• Title/Summary/Keyword: Document Retrieval

Search Result 448, Processing Time 0.025 seconds

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

  • Kumar, Aarti;Das, Sujoy
    • Journal of Information Science Theory and Practice
    • /
    • v.3 no.1
    • /
    • pp.24-39
    • /
    • 2015
  • Pre-retrieval query formulation is an important step for identifying local text reuse. Local reuse with high obfuscation, paraphrasing, and translation poses a challenge of finding the reused text in a document. In this paper, three pre-retrieval query formulation strategies for heuristic retrieval in case of low obfuscated, high obfuscated, and translated text are studied. The strategies used are (a) Query formulation using proper nouns; (b) Query formulation using unique words (Hapax); and (c) Query formulation using most frequent words. Whereas in case of low and high obfuscation and simulated paraphrasing, keywords with Hapax proved to be slightly more efficient, initial results indicate that the simple strategy of query formulation using proper nouns gives promising results and may prove better in reducing the size of the corpus for post processing, for identifying local text reuse in case of obfuscated and translated text reuse.

Measurement of Document Similarity using Word and Word-Pair Frequencies (단어 및 단어쌍 별 빈도수를 이용한 문서간 유사도 측정)

  • 김혜숙;박상철;김수형
    • Proceedings of the IEEK Conference
    • /
    • 2003.07d
    • /
    • pp.1311-1314
    • /
    • 2003
  • In this paper, we propose a method to measure document similarity. First, we have exploited single-term method that extracts nouns by using a lexical analyzer as a preprocessing step to match one index to one noun. In spite of irrelevance between documents, possibility of increasing document similarity is high with this method. For this reason, a term-phrase method has been reported. This method constructs co-occurrence between two words as an index to measure document similarity. In this paper, we tried another method that combine these two methods to compensate the problems in these two methods. Six types of features are extracted from two input documents, and they are fed into a neural network to calculate the final value of document similarity. Reliability of our method has been proved by an experiment of document retrieval.

  • PDF

A Storage and Retrieval System for Structured SGML Documents using Grove (Grove를 이용한 구조적 SGML문서의 저장 및 검색)

  • Kim, Hak-Gyoon;Cho, Sung-Bae
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.8 no.5
    • /
    • pp.501-509
    • /
    • 2002
  • SGML(ISO 8879) has been proliferated to support various document styles and to transfer documents into different platforms. SGML documents have logical structure information in addition to contents. As SGML documents are widely used, there is an increasing need for database storage and retrieval system using the logical structure of documents. However. traditional search engines using document indexes cannot exploit the logical structure. In this Paper, we have developed an SGML document storage system, which is DTD-independent and store the document type and the document instance separately by using Grove which is the document model for DSSSL and HyTime. We have used the Object Store, an object-oriented DBMS, to store the structure information appropriately without any loss of structural information. Also, we have supported a index structure for search efficiency like the relational DBMS, and constructed an effective user interface which combines content-based search with structure-based search.

Document Clustering Method using Coherence of Cluster and Non-negative Matrix Factorization (비음수 행렬 분해와 군집의 응집도를 이용한 문서군집)

  • Kim, Chul-Won;Park, Sun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.13 no.12
    • /
    • pp.2603-2608
    • /
    • 2009
  • Document clustering is an important method for document analysis and is used in many different information retrieval applications. This paper proposes a new document clustering model using the clustering method based NMF(non-negative matrix factorization) and refinement of documents in cluster by using coherence of cluster. The proposed method can improve the quality of document clustering because the re-assigned documents in cluster by using coherence of cluster based similarity between documents, the semantic feature matrix and the semantic variable matrix, which is used in document clustering, can represent an inherent structure of document set more well. The experimental results demonstrate appling the proposed method to document clustering methods achieves better performance than documents clustering methods.

A Re-Ranking Retrieval Model based on Two-Level Similarity Relation Matrices (2단계 유사관계 행렬을 기반으로 한 순위 재조정 검색 모델)

  • 이기영;은희주;김용성
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.11
    • /
    • pp.1519-1533
    • /
    • 2004
  • When Web-based special retrieval systems for scientific field extremely restrict the expression of user's information request, the process of the information content analysis and that of the information acquisition become inconsistent. In this paper, we apply the fuzzy retrieval model to solve the high time complexity of the retrieval system by constructing a reduced term set for the term's relatively importance degree. Furthermore, we perform a cluster retrieval to reflect the user's Query exactly through the similarity relation matrix satisfying the characteristics of the fuzzy compatibility relation. We have proven the performance of a proposed re-ranking model based on the similarity union of the fuzzy retrieval model and the document cluster retrieval model.

Design and Development of a Multimodal Biomedical Information Retrieval System

  • Demner-Fushman, Dina;Antani, Sameer;Simpson, Matthew;Thoma, George R.
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.168-177
    • /
    • 2012
  • The search for relevant and actionable information is a key to achieving clinical and research goals in biomedicine. Biomedical information exists in different forms: as text and illustrations in journal articles and other documents, in images stored in databases, and as patients' cases in electronic health records. This paper presents ways to move beyond conventional text-based searching of these resources, by combining text and visual features in search queries and document representation. A combination of techniques and tools from the fields of natural language processing, information retrieval, and content-based image retrieval allows the development of building blocks for advanced information services. Such services enable searching by textual as well as visual queries, and retrieving documents enriched by relevant images, charts, and other illustrations from the journal literature, patient records and image databases.

A Study on Effective Internet Data Extraction through Layout Detection

  • Sun Bok-Keun;Han Kwang-Rok
    • International Journal of Contents
    • /
    • v.1 no.2
    • /
    • pp.5-9
    • /
    • 2005
  • Currently most Internet documents including data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of Web pages by identifying the characteristics and structure of block tags that affect the layout of Web pages and calculating distances between Web pages. This method is purposed to reduce the cost of Web document automatic processing and improve processing efficiency by providing information about the structure of Web pages using templates through applying the method to information retrieval such as data extraction.

  • PDF

A Study on Building Structures and Processes for Intelligent Web Document Classification (지능적인 웹문서 분류를 위한 구조 및 프로세스 설계 연구)

  • Jang, Young-Cheol
    • Journal of Digital Convergence
    • /
    • v.6 no.4
    • /
    • pp.177-183
    • /
    • 2008
  • This paper aims to offer a solution based on intelligent document classification to create a user-centric information retrieval system allowing user-centric linguistic expression. So, structures expressing user intention and fine document classifying process using EBL, similarity, knowledge base, user intention, are proposed. To overcome the problem requiring huge and exact semantic information, a hybrid process is designed integrating keyword, thesaurus, probability and user intention information. User intention tree hierarchy is build and a method of extracting group intention between key words and user intentions is proposed. These structures and processes are implemented in HDCI(Hybrid Document Classification with Intention) system. HDCI consists of analyzing user intention and classifying web documents stages. Classifying stage is composed of knowledge base process, similarity process and hybrid coordinating process. With the help of user intention related structures and hybrid coordinating process, HDCI can efficiently categorize web documents in according to user's complex linguistic expression with small priori information.

  • PDF

Automatic Linkage Method Between Email and Block Structure to Store Construction Project Documents in The Blockchain

  • Kim, Eu Wang;Park, Min Seo;Kim, Jong Inn;Wei, Ameng;Kim, Kyoungmin;Kim, Kyong Ju
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.886-892
    • /
    • 2022
  • In construction projects, it is common to exchange documents using email because of convenience. In this study, a method extracting and organizing block information automatically based on email was developed. This method is composed of document exchange and archiving processes, which are difficult to manage and vulnerable to loss. Therefore, this study aims to develop a solution that can automatically link email and block information. The block data components are designed to derive from email exchange and user-additional input information. Also, automatically generating blocks process including extraction and conversion of information was proposed. This solution can lead to promote the convenience of project document management in terms of identifying the document flow and preventing loss of information.

  • PDF

Design and Implementation of Document Filing and Retrieval System in an OSI Environment (OSI 환경에서 문서 파일링 및 검색 시스템의 설계 및 구현)

  • 임재홍;박용진
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.2
    • /
    • pp.10-20
    • /
    • 1994
  • This paper describes a design and implementation of the DFR(Document Filing and Retrieval) system. one of applications of DOAM(Distributed Office Application Model) which is the international standard in ISO(International Standards Organization). On the basis of the international standard, the DFR system is implemented on SUN workstation and PC/386 with C language, and its implementation is verified by tracing the association descriptor and primitives of service elements when its operation is tested between client and server. The result of this study shows that the DFR system can be implemented on the basis of the international standard, and makes a contribution toward the establishment of functional standards for the DFR system.

  • PDF