• Title/Summary/Keyword: Original documents

Search Result 181, Processing Time 0.023 seconds

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

A Study on the Classified Jang(Fermented Soybean) in Goryeo and Chosun Dynasty Period (고려시대 및 조선시대 장류)

  • Ann, Yong-Geun;Woo, Nariyah
    • The Korean Journal of Food And Nutrition
    • /
    • v.25 no.3
    • /
    • pp.460-482
    • /
    • 2012
  • On the basis of the cookbooks and Data Base of the Korean Classics(http://db.itkc.or.kr/itkcdb/mainIndexIframe.jsp), this paper analyzed the fermented soybean listed in the general documents of the Chosun Dynasty(1392~1897) and the Goryeo Dynasty(918~1392). In the Goryeo Dynasty, there are 15 kinds of Jang(soybean paste or solution), among which are Jang (soybean paste fermented by mold)(6 documents), Yeomgjang, Yeomshi(2), and Gaejang(1). However, the cookbook at that time is defunct. The Goryeo Court relieved the famine-stricken people by proving them with Jang. In the Chosun Dynasty, 111 kinds of Jang were listed in the general documents, and 153 kinds in cookbooks. There were 55 kinds of general Jang, such as Jang(204), Yeomjang(63), Chojang, Goojang(7), and Gaejang(6), are listed in the general documents, and in the cookbooks, there are 55 kinds of Jang, such as Sookwhangjang(9 cookbooks), Daemaekjang(8), Myeonjang(8), Saengwhangjang (8), and Yooinjang(8), and among them, 13 kinds belong to the Chinese origin. A total of 9 Kinds of Ganjang(soybean solution fermented by mold), such as Soojang(30), Cheongjang(23), Gamjang(8), and Ganjang(3) are found in the general documents. In the cookbooks, 12 kinds of Jang, as Cheongjang(10), Cheonrijang(4), Ganjang(3), and etc., are listed. There were 9 kinds of Gochoojang(red pepper-soybean paste), such as Chojang(12), Gochojang(3), and etc., are listed in the general documents, and 9 kinds as Gochojang(7), Manchojang(7), rapid Manchojang(4), and etc., are in the cookbooks. In addition, 16 Kinds of Yookjang(fermented soybean-meat paste) as Haejang(15), Hyejang(11), Yookjang(11), and etc., are found in the documents, and 22 kinds as Nanjang(9), Gejang(6), Yookjang(5), Shoigogijang(4), and etc., are in the cookbooks. Eighteen Kinds of Shi(soybean paste fermented by bacteria) as Yeomshi(40), Shi(35), Shijang(6), and etc., are recorded in the documents, and 19 kinds as Jeonkookjang(6), Shi(4), Sooshijang(4), and etc., are in the cookbooks, and among them 11 kinds belong to the Chinese origin. Six kinds of Jipjang(aqueous soybean paste) as Jipjang(7), Uoopjang(4), Pojang (2), Jangzoop(2) are recorded in the documents, and 15 kinds as Jipjang(9), Zoopjeo(7), and Hajeoljipjang(5) are in the cookbooks. Soybean paste, or solution for relieving hunger is not recorded in the documents. However, the Chosun court, for the purpose of relieving famine-stricken people, used general Jang. Such 21 Jang to relieve the famine-stricken people as Pojang(7), rapid Jang(6), and Sasamgilgyeongjang(4) are listed in the cookbook. Geonjang(dried soybean paste), Nanjang (egg-soybean paste), Doojang(soybean paste), Maljang(random soybean paste), Myeonjang(wheate-soybean paste), Sodoojang (red bean-soybean paste), Yookjang(soybean-meat paste) and Jang(soybean paste) are recorded in the documents, as well as in the cookbooks. Chinese-original Jang and Shi are recorded in the cookbooks, with no list in the general documents. Therefore, it seems that it didn't pass down to the general public.

A Study on Improvement of Information Methodology for SMEs (중소기업 정보화방법론 개선 연구)

  • Sun, Nam-Sun
    • Proceedings of the Korea Database Society Conference
    • /
    • 2010.06a
    • /
    • pp.13-19
    • /
    • 2010
  • Information competitiveness accounts for substantial parts of business competitiveness necessary for business management in the knowledge-information society in the 21st century. To improve quality. productivity and competitiveness through information in the fields of SMEs particularly having difficulties under rapidly changing business environment. the government has operated "SME Information Support Project" for the past 8 years. The methodology for developing the standard for this project known as EISDM (Enterprise Information System Development Methodology) provides communication between IT businesses and SMEs participating in this project. and standardized output document formats and how to make out such documents. Infortunately. the number of personnel partaking in the development project for SMEs is no more than 2~4 per site on average. Further. they are required to complete demand analyses. development. testing and operation in about 6 months. which is a very short period. Moreover, there is too much demand for documentation, which is likely to end up being formal work process just for supervision and inspection. That is, the documentation could be for noting but documents. which will prove useless outputs after the project finishes. Therefore, this study proposes an improvement approach as an information system development methodology taking into account SMEs' characteristics and environment so as to relieve developers from such excessive burden of documentation, to save time and resources through efficient management of software development as the original purpose of the methodology, and to produce required quality software.

  • PDF

Design of WWW IR System Based on Keyword Clustering Architecture (색인어 말뭉치 처리를 기반으로 한 웹 정보검색 시스템의 설계)

  • 송점동;이정현;최준혁
    • The Journal of Information Technology
    • /
    • v.1 no.1
    • /
    • pp.13-26
    • /
    • 1998
  • In general Information retrieval systems, improper keywords are often extracted and different search results are offered comparing to user's aim bacause the systems use only term frequency informations for selecting keywords and don't consider their meanings. It represents that improving precision is limited without considering semantics of keywords because recall ratio and precision have inverse proportion relation. In this paper, a system which is able to improve precision without decreasing recall ratio is designed and implemented, as client user module is introduced which can send feedbacks to server with user's intention. For this purpose, keywords are selected using relative term frequency and inverse document frequency and co-occurrence words are extracted from original documents. Then, the keywords are clustered by their semantics using calculated mutual informations. In this paper, the system can reject inappropriate documents using segmented semantic informations according to feedbacks from client user module. Consequently precision of the system is improved without decreasing recall ratio.

  • PDF

A Study on the Anti-copying method for hard copy documents using Human Visual System (인간시각시스템을 이용한 하드카피 복사방지기법에 관한 연구)

  • Lee Kang-Ho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.4 s.42
    • /
    • pp.291-297
    • /
    • 2006
  • This paper presents a new anti-copying methode for hard copy documents. The approach protects the document and its content from unauthorized copying and forgeries, while using ordinary paper and ordinary printer. The paper copy is protected against copying as the photocopy version will appear differently when compared to the authorized printed original hard copy using Human Visual System. The anti-copying pattern created through pointillism with a halftone cell and a spot. The proposed method is useful for unauthorized copying and forgeries with high-resolution scanners and photocopiers.

  • PDF

The Refinement Effect of Foreign Word Transliteration Query on Meta Search (메타 검색에서 외래어 질의 정제 효과)

  • Lee, Jae-Sung
    • The KIPS Transactions:PartB
    • /
    • v.15B no.2
    • /
    • pp.171-178
    • /
    • 2008
  • Foreign word transliterations are not consistently used in documents, which hinders retrieving some important relevant documents in exact term matching information retrieval systems. In this paper, a meta search method is proposed, which expands and refines relevant variant queries from an original input foreign word transliteration query to retrieve the more relevant documents. The method firstly expands a transliteration query to the variants using a statistical method. Secondly the method selects the valid variants: it queries each variant to the retrieval systems beforehand and checks the validity of each variant by counting the number of appearance of the variant in the retrieved document and calculating the similarity of the context of the variant. Experiment result showed that querying with the variants produced at the first step, which is a base method of the test, performed 38% in average F measure, and querying with the refined variants at the second step, which is a proposed method, significantly improved the performance to 81% in average F measure.

A study on the Strategy of e-L/C of Credit Utilization by Transaction Cost (거래비용측면에서 전자신용장 활용전략에 대한 연구)

  • Cho, Won-Gil
    • International Commerce and Information Review
    • /
    • v.16 no.1
    • /
    • pp.247-269
    • /
    • 2014
  • This study is to present alternatives of strategical utilization of e-L/C in respective of transaction cost. Documentary credit is most used for trade importers' credit quality and the guarantee of the purchase price as the form of payment in export and import business dealings. The beneficiary must provide the documents required in a letter of credit in order to claim payment documents from the issuing bank, this leads to certain complexity during the procedure in practice, the preparation and the expenses of significant requirements and additional documents as well as in completing demands from the credit. In a result, there has been issues raised about the aspects of time and cost during the payment process. The outcome of such problems caused by delays in the existing trade procedure is the public to require the use of e-L/C in order to improve problems from the 'Transaction Cost' side. This study provides e-L/C's use to overcome the problems that are appearing from 'Transaction Cost' side as the aspect of time and the cost. In order to do so, we have to identify the problems in the original credit and e-L/C. Thus, provide the propose strategy of e-L/C from the Transaction Cost aspect.

  • PDF

A Study on Information Retrieval Effectiveness by Cited References (인용문헌에 의한 정보검색 효과에 관한 고찰)

  • Lee Lanju
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.27
    • /
    • pp.265-289
    • /
    • 1994
  • Databases publicly available for online searching permit both citation and subject searching, however, subject searching has dominated the online search environment. Despite the power of citation searching, it may be underutilized This study explored the relationship between the number of cited references used in a citation search and information retrieval effectiveness, a relatively unstudied phenomenon. Three articles in the library and information science literature were chosen to represent sample questions. Cited reference searches were conducted for each article and each of its references. All searches were conducted in Social Scisearch and Scisearch on DIALOG. Relevance judgments on the retrieved citations were obtained from the authors of the original articles. This research focused on analyzing, in terms of information retrieval effectiveness, the overlap among postings sets retrieved by various combinations of cited references. The findings from the three case studies clearly showed that the more cited references used for the citation search, the better the performance, in terms of retrieving more relevant documents, up to a point of diminishing retums. In addition, generally the overall level of overlap among relevant documents sets was found to be low. Therefore, if only some of the cited references among many candidates are used for a citation search, a significant proportion of relevant documents may be missed. The analysis of the characteristics of cited references provided the ways to predict which cited refereces would be useful to improve information retrieval. The findings of this comprehensive exploratory study are of interest for both theoretical and practical reasons. They contribute to the development of a theoretical model for the effective use of the citation search. This model might also be implemented in operational online systems. In addition, the findings potentially will help online searchers improve their search strategies using the citation search so that they can better achieve their information retrieval goals: the retrieval of items relevant to a given question and the suppression of nonrelevant items.

  • PDF

Restoration and Reproduction Study for Antique Documents (고문서 복원 및 재현 시스템 연구)

  • Kim, Young-Sung;Kim, Su-Ho;Shin, Jong-Il;Park, Soo-Youl;Shin, Seung-Rim;Jun, Kun;Son, Young-A
    • Textile Coloration and Finishing
    • /
    • v.21 no.2
    • /
    • pp.48-53
    • /
    • 2009
  • Reproduction of antique document is of importance with the concept of sharing the their contents and original aspects in terms of textual and artistic message. 'Pine tree ink stick' and 'Oil ink stick' are greatly enjoyed in their uses in the most written documentary works. Thus, it is said that the approach of this study has implied considerable meanings to cultural aspects. In this work, we have performed to investigate the reproduction and restoration study for antique documents. With comparison and analysis of some types of "ink stick", we have prepared several ink samples, controlling viscosity, surfactant, thickness agent, and applied these inks to the target antique document. Several reproduced samples showed a practical application possibility in terms of reproduction and restoration.

Forgery Protection System and 2D Bar-code inserted Watermark (워터마크가 삽입된 이차원 바코드와 위.변조 방지 시스템)

  • Lee, Sang-Kyung;Ko, Kwang-Enu;Sim, Kwee-Bo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.6
    • /
    • pp.825-830
    • /
    • 2010
  • Generally, the copy protection mark and 2D bar-code techniques are widely used for forgery protection in printed public documents. But, it is hard to discriminate truth from the copy documents by using exisiting methods, because of that existing 2D-barcode is separated from the copy protection mark and it can be only recognized by specified optical barcord scanner. Therefor, in this paper, we proposed the forgery protection tehchnique for discriminating truth from the copy document by using watermark inserted 2D-barcord, which can be accurately distinguished not only by naked eye, but also by scanner. The copy protection mark consists of deformed patterns that are caused by the lowpass filter characteristic of digital I/O device. From these, we verified the performance of the proposed techniques by applying the histogram analysis based on the original, copy, and scanned copy image of the printed documents. Also, we suggested 2D-barcord confirmation system which can be accessed through the online server by using certification key data which is detected by web-camera, cell phone camera.