• Title/Summary/Keyword: Document searching

Search Result 170, Processing Time 0.034 seconds

A Research on Enhancement of Text Categorization Performance by using Okapi BM25 Word Weight Method (Okapi BM25 단어 가중치법 적용을 통한 문서 범주화의 성능 향상)

  • Lee, Yong-Hun;Lee, Sang-Bum
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.12
    • /
    • pp.5089-5096
    • /
    • 2010
  • Text categorization is one of important features in information searching system which classifies documents according to some criteria. The general method of categorization performs the classification of the target documents by eliciting important index words and providing the weight on them. Therefore, the effectiveness of algorithm is so important since performance and correctness of text categorization totally depends on such algorithm. In this paper, an enhanced method for text categorization by improving word weighting technique is introduced. A method called Okapi BM25 has been proved its effectiveness from some information retrieval engines. We applied Okapi BM25 and showed its good performance in the categorization. Various other words weights methods are compared: TF-IDF, TF-ICF and TF-ISF. The target documents used for this experiment is Reuter-21578, and SVM and KNN algorithms are used. Finally, modified Okapi BM25 shows the most excellent performance.

An Study on the Problems and Improvement of the 'Considerable Efforts' to Use Orphan Works: Focused on Mass Digitization in Libraries (고아저작물 활용을 위한 '상당한 노력' 규정의 문제점 및 개선에 관한 연구 - 도서관의 대량디지털화를 중심으로 -)

  • Joung, Kyoung Hee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.50 no.4
    • /
    • pp.333-350
    • /
    • 2016
  • Article 50 and article 18 related to orphan works in the Copyright Act of Korea and Enforcement Decree of the Copyright Act of Korea define 'considerable efforts' to locate copyright owners. This study analyzed the reasonableness of the efforts for mass digitization in libraries. The findings of the study were the duplication of searches on the 'Finding Copyright' website and the inquire to copyright trust management organizations by document, ambiguity of search criteria through information networks, and problems due to non-use of international standard identifiers in managing works on copyright register, works with undistributed compensation, and copyright trust management organizations. This study suggests that copyright trust management organizations should register trust works, the government should develop a guideline for detail guidance for searching information networks, and copyright works should be managed using international standard identifiers.

RDB Schema Model of XML Document for Storage Capacity and Searching Efficiency (저장 공간과 검색 효율을 위한 XML 문서의 RDB 스키마 모델)

  • Kim Jeong-Hee;Kwak Ho-Young;Kwon Hoon
    • The Journal of the Korea Contents Association
    • /
    • v.6 no.4
    • /
    • pp.19-28
    • /
    • 2006
  • XML instances for purpose of information exchange are normally stored in the legacy relational database. Therefore, integrations with relational database are required for effective XML applications. To support these requirements, virtual decomposition storage or decomposition storage methods which save separates structures of instances to relational database have researched. However, these storage methods contain different information of schema structure and layers which has caused difficulties to process query during search operation as well as increased overheads due to duplicate savings for separate storages. Therefore, in this research, additional field of 'Eltype' has introduced to previous database schema structure to instance and schema structure, provide consistent level information and propose storage structure to map each field to schema field of relational database. As results, XML instance and structures can be stored together to minimize overheads and required storage-space. Also, synchronized storage layer structure provides easier processing of search query.

  • PDF

A Study on Frequency of Subject on Content of Thesis in Field of Science and Technology (과학기술분야 학위논문 내용목차에 따른 주제어 출현빈도에 관한 연구)

  • Lee, Hye-Young;Kwak, Seung-Jin
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.1
    • /
    • pp.191-210
    • /
    • 2008
  • We would generally use subject terms such as subject indexing for searching and accessing documents. So then, there must be any relationship between document's full-text and its subject terms. This study is started in this question. Master's theses in field of science and technology are worked with because full-text is relatively formatted. This study is to study locations of subject term on Thesis, distribution patterns of subject terms on content of full-text; 'Contents', 'Introduction', 'Theory', 'Main subject', 'Conclusion' and 'References'. Thesis were averagely composed of 1226.3 terms. And Subject terms were averagely compose of $12{\sim}13$ terms. As a result, 'Contents' and 'Introduction' have had the most frequency of subject.

A Comparative Study of XML and HTML: Focusing on Their Characteristics and Retrieval Functions (디지털도서관 문서양식으로서의 XML과 HTML의 특성 및 검색 기능 비교 연구)

  • 김현희;장혜원
    • Journal of the Korean Society for information Management
    • /
    • v.16 no.2
    • /
    • pp.105-134
    • /
    • 1999
  • For efficient and precise searches in the Web environment, resources should be coded in a structured way. HTML does not cover semantic structure because of its fixed tagging. XML, which has emerged as an alternative standard markuplanguage, uses custom tags that allow structural searching. Therefore, this study aims to compare XML with HTML in terms of their characteristics and retrieval functions. In order to test retrieval functions of XML- and HTML-based systems, we constructed an experimental XML-based system. The XML-based system has several advantages over the HTML system. However, some improvements are needed to make the XML system more comprehensive and effective. First, XML document search engines with user-friendly interfaces are needed. Second, popular Web browsers such as Explorer and Communicator need to support XML 1.0 specification completely. Third, Open DTD format, which will allow information retrieval systems to retrieve documents and compress them into one single format, is also needed to control Web documents more efficiently.

  • PDF

A Study on the Musical Theme Clustering for Searching Note Sequences (음렬 탐색을 위한 주제소절 자동분류에 관한 연구)

  • 심지영;김태수
    • Journal of the Korean Society for information Management
    • /
    • v.19 no.3
    • /
    • pp.5-30
    • /
    • 2002
  • In this paper, classification feature is selected with focus of musical content, note sequences pattern, and measures similarity between note sequences followed by constructing clusters by similar note sequences, which is easier for users to search by showing the similar note sequences with the search result in the CBMR system. Experimental document was $\ulcorner$A Dictionary of Musical Themes$\lrcorner$, the index of theme bar focused on classical music and obtained kern-type file. Humdrum Toolkit version 1.0 was used as note sequences treat tool. The hierarchical clustering method is by stages focused on four-type similarity matrices by whether the note sequences segmentation or not and where the starting point is. For the measurement of the result, WACS standard is used in the case of being manual classification and in the case of the note sequences starling from any point in the note sequences, there is used common feature pattern distribution in the cluster obtained from the clustering result. According to the result, clustering with segmented feature unconnected with the starting point Is higher with distinct difference compared with clustering with non-segmented feature.

Semantic Web based DQL Search System (시멘틱 웹 기반 DQL 검색 시스템 설계)

  • Kim Je-Min;Park Young-Tack
    • The KIPS Transactions:PartB
    • /
    • v.12B no.1 s.97
    • /
    • pp.91-100
    • /
    • 2005
  • It has been proposed diverse methods to use web information efficiently as the size of information is increasing. Most of search systems use a keyword-based method that mostly relies on syntactic information. They cannot utilize semantic information of documents and thus they could generate to users. To solve shortcoming in searching documents, a technique using the Semantic Web is suggested. A semantic web can find relevant information to users by employing metadata which are represented using standard ontologies. Each document is annotated with a metadata which can be reasoned by agents. In this paper, we propose a search system using semantic web technologies. Our semantic search system analyzes semantically questions that user input, and get resolution information that user want. To improve efficiency and accuracy of semantic search systems, this paper proposes DQL(DAML Query Language) engine that employs inference engine to execute reasoning and DQL converter that changes keyword form question of the user to DQL.

Design and Implementation of a Browser for Educational PDA Contents (교육용 PDA 컨텐츠 브라우저의 설계 및 구현)

  • 신재룡
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.6 no.8
    • /
    • pp.1223-1233
    • /
    • 2002
  • Recently various electronic books (I-Book) based on PDA (personal digital assistance) that we can easily use anytime and anywhere have been developed. Volume and weight of the E-Book is much less than that of traditional books. In that reason, it is easy to carry and serve us with contents by diverse functions such as searching, bookmark, dictionary, and playing of color image, sound or moving picture. On account of these advantages, many products connected with I-Book have been emerged in the market. However a product connected with educational contents is scarce, because it requires not only normal function but also additional functions such as a problem solving. Therefore it is actually necessary to develop a browser and an editor for educational contents. In this paper, we express educational contents by XML and define structure of document with XML schema. Then, we design and implement an editor and a browser that can manage educational contents on PDA.

A Design of Efficient Keyword Search Protocol Over Encrypted Document (암호화 문서상에서 효율적인 키워드 검색 프로토콜 설계)

  • Byun, Jin-Wook
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.1
    • /
    • pp.46-55
    • /
    • 2009
  • We study the problem of searching documents containing each of several keywords (conjunctive keyword search) over encrypted documents. A conjunctive keyword search protocol consists of three entities: a data supplier, a storage system such as database, and a user of storage system. A data supplier uploads encrypted documents on a storage system, and then a user of the storage system searches documents containing each of several keywords. Recently, many schemes on conjunctive keyword search have been suggested in various settings. However, the schemes require high computation cost for the data supplier or user storage. Moreover, up to now, their securities have been proved in the random oracle model. In this paper, we propose efficient conjunctive keyword search schemes over encrypted documents, for which security is proved without using random oracles. The storage of a user and the computational and communication costs of a data supplier in the proposed schemes are constant. The security of the scheme relies only on the hardness of the Decisional Bilinear Diffie-Hellman (DBDH) problem.

An empirical analysis on the present situation of government publications and the operation of the publications in library (정부간행물의 출판현황과 도서관의 정부간행물 운영실태분석)

  • 강미혜
    • Journal of Korean Library and Information Science Society
    • /
    • v.23
    • /
    • pp.79-108
    • /
    • 1995
  • Government Publications are published to keep records of governmental activities and performances. In a rapidly changing information-oriented society, it is badly required for the operation system of government publications and a library to be effectively managed with government publications for satisfying 'right to know' of people and improving the obligation of government to 'let people know.' Accordingly, the purpose of this paper is to analyze such five research items as the situation of the publication and distribution of, government publications, the operation system of the publications in library, the number of publications regarding secondary information sources for the publications, and the legal deposit of the publications in the National Library, in order to proceed with a subject as to how much the government publication is satisfying 'right to know' of people. The research findings were suggested as follow : 1) Despite the fact that the publication of government publications has been gradually increased every year in numbers and kinds, it is unfortunately pointed out that the publications, distributions and sellings of them were not well organized and systematic. That is, the government publications had not been published more than 1.47% of all publications in number. Moreover, more than the half of the publications were non-periodically or annually published. To make the matter worse, it was not easy to get an access to the publications because of the publications not to be sold. 2) It a n.0, ppears that people could not use the publications efficiently because the library and administrative document office did not pay sufficient attention to public relations for all sorts of government publications. Not only that, there were not enough numbers of publications regarding such secondary information sources as bibliography, index and catalog. A speedy searching capacity for the information gave rise to another serious problem which the government publications could not be effectively used. 3) It is legally stated that all sorts of government publication should be deposited to the National Libraries. However, the law was not properly put into force because of lack of understanding of government agencies about the law.

  • PDF