• Title/Summary/Keyword: Text Document

Search Result 664, Processing Time 0.031 seconds

Implementation of Web-based Information System for Full-text Processing (전문 처리를 위한 웹 기반 정보시스템 구현)

  • Kim, Sang-Do;Mun, Byeong-Ju;Ryu, Geun-Ho
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.6
    • /
    • pp.1481-1492
    • /
    • 1999
  • As Internet is popularized by the advent of Web concept having characteristics such as open network, user-friendly, and easy-usage, there are many changes in Information systems providing various information. Web is rapidly transferred traditional Information systems to Web-based Information systems, because it provides not only text information but also multimedia information including image, audio, video, and etc. Also, as information contents were changed from text-based simple abstract information to full-text information, there was appeared various document formats processing Full-text information. But, as they naturally demand large systems memory, long processing time, broader transmission bandwidth, and etc, estimating of these factors is necessary when constructing information systems. This paper focuses on how to design and construct information system processing full-text information and providing function of an integrated document. Primarily, we should review standard document format which is used or developed, and any document format is appropriate to process full-text information in review with viewpoint of information system. Also, practically we should construct information system providing full-text information based on PDF document.

  • PDF

Improving the Performance of SVM Text Categorization with Inter-document Similarities (문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.3 s.57
    • /
    • pp.261-287
    • /
    • 2005
  • The purpose of this paper is to explore the ways to improve the performance of SVM (Support Vector Machines) text classifier using inter-document similarities. SVMs are powerful machine learning systems, which are considered as the state-of-the-art technique for automatic document classification. In this paper text categorization via SVMs approach based on feature representation with document vectors is suggested. In this approach, document vectors instead of index terms are used as features, and vector similarities instead of term weights are used as feature values. Experiments show that SVM classifier with document vector features can improve the document classification performance. For the sake of run-time efficiency, two methods are developed: One is to select document vector features, and the other is to use category centroid vector features instead. Experiments on these two methods show that we can get improved performance with small vector feature set than the performance of conventional methods with index term features.

A Text Detection Method Using Wavelet Packet Analysis and Unsupervised Classifier

  • Lee, Geum-Boon;Odoyo Wilfred O.;Kim, Kuk-Se;Cho, Beom-Joon
    • Journal of information and communication convergence engineering
    • /
    • v.4 no.4
    • /
    • pp.174-179
    • /
    • 2006
  • In this paper we present a text detection method inspired by wavelet packet analysis and improved fuzzy clustering algorithm(IAFC).This approach assumes that the text and non-text regions are considered as two different texture regions. The text detection is achieved by using wavelet packet analysis as a feature analysis. The wavelet packet analysis is a method of wavelet decomposition that offers a richer range of possibilities for document image. From these multi scale features, we adapt the improved fuzzy clustering algorithm based on the unsupervised learning rule. The results show that our text detection method is effective for document images scanned from newspapers and journals.

Full-text databases as a means for resource sharing (자원공유 수단으로서의 전문 데이터베이스)

  • 노진구
    • Journal of Korean Library and Information Science Society
    • /
    • v.24
    • /
    • pp.45-79
    • /
    • 1996
  • Rising publication costs and declining financial resources have resulted in renewed interest among librarians in resource sharing. Although the idea of sharing resources is not new, there is a sense of urgency not seen in the past. Driven by rising publication costs and static and often shrinking budgets, librarians are embracing resource sharing as an idea whose time may finally have come. Resource sharing in electronic environments is creating a shift in the concept of the library as a warehouse of print-based collection to the idea of the library as the point of access to need information. Much of the library's material will be delivered in electronic form, or printed. In this new paradigm libraries can not be expected to su n.0, pport research from their own collections. These changes, along with improved communications, computerization of administrative functions, fax and digital delivery of articles, advancement of data storage technologies, are improving the procedures and means for delivering needed information to library users. In short, for resource sharing to be truly effective and efficient, however, automation and data communication are essential. The possibility of using full-text online databases as a su n.0, pplement to interlibrary loan for document delivery is examined. At this point, this article presents possibility of using full-text online databases as a means to interlibrary loan for document delivery. The findings of the study can be summarized as follows : First, turn-around time and the cost of getting a hard copy of a journal article from online full-text databases was comparable to the other document delivery services. Second, the use of full-text online databases should be considered as a method for promoting interlibrary loan services, as it is more cost-effective and labour saving. Third, for full-text databases to work as a document delivery system the databases must contain as many periodicals as possible and be loaded on as many systems as possible. Forth, to contain many scholarly research journals on full-text databases, we need guidelines to cover electronic document delivery, electronic reserves. Fifth, to be a full full-text database, more advanced information technologies are really needed.

  • PDF

Local Similarity based Document Layout Analysis using Improved ARLSA

  • Kim, Gwangbok;Kim, SooHyung;Na, InSeop
    • International Journal of Contents
    • /
    • v.11 no.2
    • /
    • pp.15-19
    • /
    • 2015
  • In this paper, we propose an efficient document layout analysis algorithm that includes table detection. Typical methods of document layout analysis use the height and gap between words or columns. To correspond to the various styles and sizes of documents, we propose an algorithm that uses the mean value of the distance transform representing thickness and compare with components in the local area. With this algorithm, we combine a table detection algorithm using the same feature as that of the text classifier. Table candidates, separators, and big components are isolated from the image using Connected Component Analysis (CCA) and distance transform. The key idea of text classification is that the characteristics of the text parallel components that have a similar thickness and height. In order to estimate local similarity, we detect a text region using an adaptive searching window size. An improved adaptive run-length smoothing algorithm (ARLSA) was proposed to create the proper boundary of a text zone and non-text zone. Results from experiments on the ICDAR2009 page segmentation competition test set and our dataset demonstrate the superiority of our dataset through f-measure comparison with other algorithms.

Text Document Classification Scheme using TF-IDF and Naïve Bayes Classifier (TF-IDF와 Naïve Bayes 분류기를 활용한 문서 분류 기법)

  • Yoo, Jong-Yeol;Hyun, Sang-Hyun;Yang, Dong-Min
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.242-245
    • /
    • 2015
  • Recently due to large-scale data spread in digital economy, the era of big data is coming. Through big data, unstructured text data consisting of technical text document, confidential document, false information documents are experiencing serious problems in the runoff. To prevent this, the need of art to sort and process the document consisting of unstructured text data has increased. In this paper, we propose a novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False. For the performance evaluation, we implement our proposed scheme using $Na{\ddot{i}}ve$ Bayes document classifier and TF-IDF modules in Python library, and compare it with the existing document classifier.

  • PDF

Encoding and language detection of text document using Deep learning algorithm (딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별)

  • Kim, Seonbeom;Bae, Junwoo;Park, Heejin
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.5
    • /
    • pp.124-130
    • /
    • 2017
  • Character encoding is the method used to represent characters or symbols on a computer, and there are many encoding detection software tools. For the widely used encoding detection software"uchardet", the accuracy of encoding detection of unmodified normal text document is 91.39%, but the accuracy of language detection is only 32.09%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 3.55% and the accuracy of language detection is 0.06%. Therefore, in this paper, we propose encoding and language detection of text document using the deep learning algorithm called LSTM(Long Short-Term Memory). The results of LSTM are better than encoding detection software"uchardet". The accuracy of encoding detection of normal text document using the LSTM is 99.89% and the accuracy of language detection is 99.92%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 99.26%, the accuracy of language detection is 99.77%.

Automatic Text Categorization Using Passage-based Weight Function and Passage Type (문단 단위 가중치 함수와 문단 타입을 이용한 문서 범주화)

  • Joo, Won-Kyun;Kim, Jin-Suk;Choi, Ki-Seok
    • The KIPS Transactions:PartB
    • /
    • v.12B no.6 s.102
    • /
    • pp.703-714
    • /
    • 2005
  • Researches in text categorization have been confined to whole-document-level classification, probably due to lacks of full-text test collections. However, full-length documents availably today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of sub-topic text blocks, or passages. In order to reflect the sub-topic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several Passages, assigns categories to each passage, and merges passage categories to document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. By using four subsets of Routers text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluated the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to main topic(s), depending on their location in the test document.

Research and Development of Document Recognition System for Utilizing Image Data (이미지데이터 활용을 위한 문서인식시스템 연구 및 개발)

  • Kwag, Hee-Kue
    • The KIPS Transactions:PartB
    • /
    • v.17B no.2
    • /
    • pp.125-138
    • /
    • 2010
  • The purpose of this research is to enhance document recognition system which is essential for developing full-text retrieval system of the document image data stored in the digital library of a public institution. To achieve this purpose, the main tasks of this research are: 1) analyzing the document image data and then developing its image preprocessing technology and document structure analysis one, 2) building its specialized knowledge base consisting of document layout and property, character model and word dictionary, respectively. In addition, developing the management tool of this knowledge base, the document recognition system is able to handle the various types of the document image data. Currently, we developed the prototype system of document recognition which is combined with the specialized knowledge base and the library of document structure analysis, respectively, adapted for the document image data housed in National Archives of Korea. With the results of this research, we plan to build up the test-bed and estimate the performance of document recognition system to maximize the utilization of full-text retrieval system.

A Study of Patent Document Processing by SGML (SGML을 이용한 특허정보처리 연구)

  • Kwon, Young-Sook
    • Journal of Information Management
    • /
    • v.30 no.3
    • /
    • pp.44-54
    • /
    • 1999
  • A description of SGML(Standard Generalized Markup Language) is given together with a detailed description of WIPO Standard ST.32. The benefits of the use of SGML are highlighted-its system Independence and flexibility in building publication systems and full-text databases. A structure of WIPO Standard ST,32 based patent content is defined by DTD(document type definition) written in ST.32, and full-text itself is described with generalized markup depending on DTD. This article explains how to represent a document structure : a hierarchy structure like a entire document, a specific, sub-document, a paragraph, or non-hirarchy structure like a table drawings, or chemical structures. Merits of SGML In patent document processing are also discussed.

  • PDF