• Title/Summary/Keyword: Document Collection

Search Result 209, Processing Time 0.02 seconds

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

  • Cho, Yoon-Ho;Lee, Sang-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.2
    • /
    • pp.142-151
    • /
    • 2009
  • Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.

Shannon's Information Theory and Document Indexing (Shannon의 정보이론과 문헌정보)

  • Chung Young Mee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.6
    • /
    • pp.87-103
    • /
    • 1979
  • Information storage and retrieval is a part of general communication process. In the Shannon's information theory, information contained in a message is a measure of -uncertainty about information source and the amount of information is measured by entropy. Indexing is a process of reducing entropy of information source since document collection is divided into many smaller groups according to the subjects documents deal with. Significant concepts contained in every document are mapped into the set of all sets of index terms. Thus index itself is formed by paired sets of index terms and documents. Without indexing the entropy of document collection consisting of N documents is $log_2\;N$, whereas the average entropy of smaller groups $(W_1,\;W_2,...W_m)$ is as small $(as\;(\sum\limits^m_{i=1}\;H(W_i))/m$. Retrieval efficiency is a measure of information system's performance, which is largely affected by goodness of index. If all and only documents evaluated relevant to user's query can be retrieved, the information system is said $100\%$ efficient. Document file W may be potentially classified into two sets of relevant documents and non-relevant documents to a specific query. After retrieval, the document file W' is reclassified into four sets of relevant-retrieved, relevant-not retrieved, non-relevant-retrieved and non-relevant-not retrieved. It is shown in the paper that the difference in two entropies of document file Wand document file W' is a proper measure of retrieval efficiency.

  • PDF

대학도서관 문헌제공봉사의 현황분석과 강화방안

  • 윤희윤
    • Journal of Korean Library and Information Science Society
    • /
    • v.29
    • /
    • pp.27-63
    • /
    • 1998
  • The purpose of this study is to analyze the document delivery service(DDS) of the academic libraries and suggest its improvement model in Korea. DDS means providing copies of information requests in any format and from any source. And DDS is gaining in importance as libraries turn to 'just-in-time' access rather than 'just-in-case' collection to meet user information needs. By good fortune, rising journal subscription prices, declining financial resources, canceling some of journal subscriptions, electronic transmission technologies, and the rise of commercial document delivery services have allowed libraries to begin to deliver articles to users in a much more rapid and acceptable time frame. Therefore, the library paradigm for the 2000s must be the creation of new document delivery structures which capitalize on the access tolls and structures created by librarians during the past generations. First of all, library-based document service requires a close review of existing library-to-library delivery mechanisms, application of technology to transfer of facsimiles of materials and facilitated use of existing fee-based document sources. The ideal document delivery system would feature a transparent, seamless electronic service incorporating searching and browsing identification and marking of desired items, and transmission and fulfillment of requests. And requested items would be supplied from library collection, commercial suppliers, or other sources. But the future of DDS will succeed when physical resources, policies, personnel, and practices are organized to provide timely information delivery to users.

  • PDF

An Archival Study on the Arrangement and Description of Old Document(Diploma) (고문서 정리(整理)에 대한 기록학적 연구 - 새로운 고문서 정리 방법의 모색을 위하여 -)

  • Cho, Kyung-Koo
    • The Korean Journal of Archival Studies
    • /
    • no.7
    • /
    • pp.37-74
    • /
    • 2003
  • An Old document(Diploma) is a historical and unique record, so it must be collected, arranged, and preserved for research as soon as possible. Especially, for the effective use of the Old Document(Diploma), it is needed to arrange and describe the material systematically on the ground of modern archival theory. The Kyujanggak Archives in the Seoul National University has published 23 volumes of Old document(Diploma) material Old Document(Diploma). But they seem to cause the readers inconvenience, because the materials are classified and gathered only by genre, the titles or the orders of the materials are not standardized, and there is no description about the content of each Old document(Diploma). Jangseo-gak Library in The Academy of Korean Studies has also published the series of Old document(Diploma) material Old Document(Diploma) Collection. However the case is not different, since they are all mixed up with materials classified and gathered by genre, family, academy, or local school. And a great part of the materials have no titles and no description about the content of each Old document(Diploma), either. About the arrangement and description of the records, European and American archival science has established the theory of l)the principle of provenance, 2)the principle of original order, 3)levels of control, 4)collective description. These theories are valuable for the effective use of Old document(Diploma). On the viewpoint of the principle of provenance, Old document(Diploma) materials should not be classified by subject and genre, but by family and person. Then, the Old document(Diploma) materials, after collected by the unit of family or person on the viewpoint of the principle of provenance, should be arranged in their original order for more detailed arrangement and furthermore, for the work to find their relationship. This is so called the principle of original order. The hierarchical management of the Old document(Diploma) materials, for example, classifying by record group, sub-group, series, item and so on, is the concept of the levels of control, and comprehensive description of the each hierarchical structure is the concept of the collective description. Let's apply these archival theories to 34 pieces of the Chung, Man-Seok's material in the series of Old document(Diploma) material Old Document(Diploma). First, collect the Old document(Diploma) materials into Chung, Man-Seok's collection(the principle of provenance), which were scattered in the series classified by genre. Secondly, rearrange them chronologically(the principle of original order), and then we can find the comprehensive information about Chung, Man-Seok. For the hierarchical management of the Old document(Diploma) materials, we should establish a few concepts from the general, large group to specific, small item. The concepts can be organized as following; l)record group(Chung, Man-Seok record group) - 2)sub-group(personnel document, property document, family document, social activity document, political activity document, etc) - 3)series(gyoji-series, gyoseo-series, yuji-series etc. in the personnel document) - 4)folder(document with additions) - 5)item(one document). According to the the theory of the collective description, in the level of record group, there should be a collective description of Chung, Man-Seok's biography or a summary of record group. Similarly, there should be a collective description of a summary of sub-group in the level of sub-group and a summary of series in the level of series.

A Study of Effective Collection of Public Opinion on Environmental Impact Assessment (환경영향평가상의 효율성 주민의견 수렴에 관한 연구)

  • Yoo, Heon-Seok;Joo, Yong-Joon;Jeong, Seong-Hoon
    • Journal of Environmental Impact Assessment
    • /
    • v.11 no.4
    • /
    • pp.311-319
    • /
    • 2002
  • Procedures to establish well-balanced development and effectiveness of environmental impact assessment need include various stakeholder's participation in writing and reviewing document of environmental impact assessment, collecting public opinion, and post monitoring. Accordingly, to encourage effective and efficient collection of resident's opinion analyze present conditions and problems and suggest institutional and politic alternative proposals of it. This study resulted in following conclusions. In institutional aspects, (1) Proposal for drafting document of environmental impact assessment (2) Composition of committee for collecting and regulating stakeholder's opinion (3) Width of civil participation scale. In politic aspects, (1) Use of local community (2) Guide of local information from local society and environmental specialist (3) Understandable document and data of environmental impact assessment (4) Strength of roles and duties of local government.

A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency (문헌빈도와 장서빈도를 이용한 kNN 분류기의 자질선정에 관한 연구)

  • Lee, Yong-Gu
    • Journal of Korean Library and Information Science Society
    • /
    • v.44 no.1
    • /
    • pp.27-47
    • /
    • 2013
  • This study investigated the classification performance of a kNN classifier using the feature selection methods based on document frequency(DF) and collection frequency(CF). The results of the experiments, which used HKIB-20000 data, were as follows. First, the feature selection methods that used high-frequency terms and removed low-frequency terms by the CF criterion achieved better classification performance than those using the DF criterion. Second, neither DF nor CF methods performed well when low-frequency terms were selected first in the feature selection process. Last, combining CF and DF criteria did not result in better classification performance than using the single feature selection criterion of DF or CF.

Use Studies of Library Collections (장서평가에 관한 소고 -특히 이용조사를 중심으로-)

  • Yoo Chae-Ock
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.15
    • /
    • pp.175-195
    • /
    • 1988
  • Use studies of library collections have been conducted as a method of evaluating collections in a library. The main purpose of use studies is to evaluate the quality of a library collection in terms of extent and mode of its use. In addition to use studies, both quantitative and qualitative methods could be utilized in order to evaluate a library collection. However, the quantitative and qualitative collection evaluation methods are more concerned with the collection itself than with its use. Use studies have been conducted in large academic libraries for the following specific purposes: 1) They attempt to identify little used portion of collections that can be retired less accessible and less expensive storage area. 2) They try to identify core collections to satisfy some degree of circulation demands in the near future. 3) They try to identify use patterns of selected subject areas or type of books that can be used to adjusting collection development practices or fund allocations. 4) They try to assess the document delivery capability of a library to improve their availability. A number of methodologies employed for these specific purposes fall into four major categories; 1) circulation analysis method, 2) last circulation method, 3) relative use method, and 4) document delivery test. Each method is briefly reviewed with its limitations.

  • PDF

An overview and analysis of commercial document delivery systems (국내외 상업적 문헌제공시스템의 현황파악과 비교분석)

  • 윤희윤
    • Journal of the Korean Society for information Management
    • /
    • v.15 no.2
    • /
    • pp.7-28
    • /
    • 1998
  • The purpose of this study is to overview and analyze the commercial document delivery systems. to this end, the study first compared the current systems under three headings, that is, non-collection-based systems(Infotrieve, OCLC, UnCover, BIDS, Swets & Zeitlinger, Kyobobook), collection-based systems(EBSCO, ISI, UMI, BLDSC, CISTI, INIST, NCSI, JICST, KINITI), and specialized collection-based systems(Engineering Information Inc., IEEE/IEE, BIOSIS, CAS, NAL, RSC, TWI, ADONIS). Next, the study analyzeed the advantages and disadvantages of each system, based on the four performance criteria : scope of inventory/journal coverage, turnaround time, delivery ost and payment options, reliability and satisfaction rate.

  • PDF

Collection and Extraction Algorithm of Field-Associated Terms (분야연상어의 수집과 추출 알고리즘)

  • Lee, Sang-Kon;Lee, Wan-Kwon
    • The KIPS Transactions:PartB
    • /
    • v.10B no.3
    • /
    • pp.347-358
    • /
    • 2003
  • VSField-associated term is a single or compound word whose terms occur in any document, and which makes it possible to recognize a field of text by using common knowledge of human. For example, human recognizes the field of document such as or , a field name of text, when she encounters a word 'Pitcher' or 'election', respectively We Proposes an efficient construction method of field-associated terms (FTs) for specializing field to decide a field of text. We could fix document classification scheme from well-classified document database or corpus. Considering focus field we discuss levels and stability ranks of field-associated terms. To construct a balanced FT collection, we construct a single FTs. From the collections we could automatically construct FT's levels, and stability ranks. We propose a new extraction algorithms of FT's for document classification by using FT's concentration rate, its occurrence frequencies.

Document Clustering Using Semantic Features and Fuzzy Relations

  • Kim, Chul-Won;Park, Sun
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.3
    • /
    • pp.179-184
    • /
    • 2013
  • Traditional clustering methods are usually based on the bag-of-words (BOW) model. A disadvantage of the BOW model is that it ignores the semantic relationship among terms in the data set. To resolve this problem, ontology or matrix factorization approaches are usually used. However, a major problem of the ontology approach is that it is usually difficult to find a comprehensive ontology that can cover all the concepts mentioned in a collection. This paper proposes a new document clustering method using semantic features and fuzzy relations for solving the problems of ontology and matrix factorization approaches. The proposed method can improve the quality of document clustering because the clustered documents use fuzzy relation values between semantic features and terms to distinguish clearly among dissimilar documents in clusters. The selected cluster label terms can represent the inherent structure of a document set better by using semantic features based on non-negative matrix factorization, which is used in document clustering. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.