• Title/Summary/Keyword: multiple language document

Search Result 23, Processing Time 0.023 seconds

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

  • Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.77-92
    • /
    • 2014
  • Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

Design and Implementation of OCR Correction Model for Numeric Digits based on a Context Sensitive and Multiple Streams (제한적 문맥 인식과 다중 스트림을 기반으로 한 숫자 정정 OCR 모델의 설계 및 구현)

  • Shin, Hyun-Kyung
    • The KIPS Transactions:PartD
    • /
    • v.18D no.1
    • /
    • pp.67-80
    • /
    • 2011
  • On an automated business document processing system maintaining financial data, errors on query based retrieval of numbers are critical to overall performance and usability of the system. Automatic spelling correction methods have been emerged and have played important role in development of information retrieval system. However scope of the methods was limited to the symbols, for example alphabetic letter strings, which can be reserved in the form of trainable templates or custom dictionary. On the other hand, numbers, a sequence of digits, are not the objects that can be reserved into a dictionary but a pure markov sequence. In this paper we proposed a new OCR model for spelling correction for numbers using the multiple streams and the context based correction on top of probabilistic information retrieval framework. We implemented the proposed error correction model as a sub-module and integrated into an existing automated invoice document processing system. We also presented the comparative test results that indicated significant enhancement of overall precision of the system by our model.

A Field Survey of Idiosyncratic Dwelling Space attached to Chang-Duk Palace's West Fence (창덕궁 담에 접한 자생주거지에 관한 연구 - 원서동 무허가 94번지의 실측 및 개선 안 기초연구 -)

  • 윤숙희;정진원
    • Journal of the Korean housing association
    • /
    • v.14 no.1
    • /
    • pp.29-40
    • /
    • 2003
  • The purpose of this research is to document and analyse spatial transformation of an unauthorized dwelling units on a peculiar site of Seoul. It's physically attached to the behind part of the west boundary wall of Chang-Duk Palace. These dwelling units took not only the site, the narrow street which had been a stream, but also the two parallel walls of others for their home. The two walls, one from the palace wall and the other from the wall of a house which distanced itself from the palace wall about 3.5 m for the reason of the Cultural Properties Protection Law, have been held as the main structural members in forming the shelter. With examining the realm of time which provide the base of the spatial realm, this research shows how the multiple linkages tangled in an illegal shack did gain and actualize an architectural language of idiosyncrasy with spontaneous order inherent in inhabitants.

Korean Voice Phishing Text Classification Performance Analysis Using Machine Learning Techniques (머신러닝 기법을 이용한 한국어 보이스피싱 텍스트 분류 성능 분석)

  • Boussougou, Milandu Keith Moussavou;Jin, Sangyoon;Chang, Daeho;Park, Dong-Joo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.297-299
    • /
    • 2021
  • Text classification is one of the popular tasks in Natural Language Processing (NLP) used to classify text or document applications such as sentiment analysis and email filtering. Nowadays, state-of-the-art (SOTA) Machine Learning (ML) and Deep Learning (DL) algorithms are the core engine used to perform these classification tasks with high accuracy, and they show satisfying results. This paper conducts a benchmarking performance's analysis of multiple SOTA algorithms on the first known labeled Korean voice phishing dataset called KorCCVi. Experimental results reveal performed on a test set of 366 samples reveal which algorithm performs the best considering the training time and metrics such as accuracy and F1 score.

Multi-source information integration framework using self-supervised learning-based language model (자기 지도 학습 기반의 언어 모델을 활용한 다출처 정보 통합 프레임워크)

  • Kim, Hanmin;Lee, Jeongbin;Park, Gyudong;Sohn, Mye
    • Journal of Internet Computing and Services
    • /
    • v.22 no.6
    • /
    • pp.141-150
    • /
    • 2021
  • Based on Artificial Intelligence technology, AI-enabled warfare is expected to become the main issue in the future warfare. Natural language processing technology is a core technology of AI technology, and it can significantly contribute to reducing the information burden of underrstanidng reports, information objects and intelligences written in natural language by commanders and staff. In this paper, we propose a Language model-based Multi-source Information Integration (LAMII) framework to reduce the information overload of commanders and support rapid decision-making. The proposed LAMII framework consists of the key steps of representation learning based on language models in self-supervsied way and document integration using autoencoders. In the first step, representation learning that can identify the similar relationship between two heterogeneous sentences is performed using the self-supervised learning technique. In the second step, using the learned model, documents that implies similar contents or topics from multiple sources are found and integrated. At this time, the autoencoder is used to measure the information redundancy of the sentences in order to remove the duplicate sentences. In order to prove the superiority of this paper, we conducted comparison experiments using the language models and the benchmark sets used to evaluate their performance. As a result of the experiment, it was demonstrated that the proposed LAMII framework can effectively predict the similar relationship between heterogeneous sentence compared to other language models.

The Design and Implementation of the System for Processing Well-Formed XML Document on the Client-side (클라이언트 상의 Well-Formed XML 문서 처리 시스템의 설계 및 구현)

  • Song, Jong-Chul;Moon, Byung-Joo;Hong, Gi-Chai;Cheong, Hyun-Soo;Kim, Gyu-Tae;Lee, Soo-Youn
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.10
    • /
    • pp.3236-3246
    • /
    • 2000
  • XML is a meta-language as SGML and also can be xonsructed as an Internet versionof simplified SGML being used in confunction with XLL. Xpointer and XSL. Also W3C established DTDless Well-Formed XML document to use XML document on the Web. But it isnt offered system that consists of browsing, link and DTD generating facihty, and efficiently processes DTDless Well-Formed XML document. This paper studies on an implementation and design of system to process DTDless Well-Formed XML document on the client-side. This system consists of Well-Formed XML viewer displaying Well-Formed XML documet, XLL Processor processing Xll and Auto DTD generator constructing automatically DTDs based on multiple documents of the same class. This study focuses on automatic DTD generation during hyperlink navigation and an implementation of extended links based on XLL and Xpointer. ID and Xpointer location address are used as the address mode in the links. As a result of implement of this system, it conforms to validationof extended link facihties, extracts DTD from Well-Fromed XML Documents including same root element at the same class and constructs generalized DTD.

  • PDF

A Study on the Utilization and Characteristics of Vietnam's Arbitration System in the FTA Era (FTA시대 베트남 중재제도의 특징과 활용방안에 관한 연구 - VIAC 중재규칙과 KCAB 국제중재규칙 비교를 중심으로 -)

  • Kim, Sung-Ryong
    • Journal of Arbitration Studies
    • /
    • v.30 no.2
    • /
    • pp.23-42
    • /
    • 2020
  • The purpose of this study is to analyze the characteristics of Vietnam's arbitration system and to present measures that companies can utilize in practice. This research considers KCAB International Arbitration Rules, focusing on amendments to the Decree on Vietnam Commercial Arbitration Act and amendments to the VIAC Arbitration Rules. To sum up some features, the decree on the Commercial Arbitration Act simplified the registration procedures for arbitration centers and their branches and made the publication of court decisions and the recognition of the approval and execution of foreign arbitration courts, thereby enhancing transparency. First of all, the decree on the Commercial Arbitration Act simplified registration procedures for arbitration centers and their branches. In addition, the court strengthened transparency by officially announcing court judgments, recognition, and decisions. Next, there are some points to note in the arbitration rules of the VIAC. First of all, the rules of expedited procedure lack clarity. Next, parties should make a separate document for counterclaim and submit it with a statement of defense. In addition, the arbitral language may choose multiple languages by the Arbitral Tribunal unless the parties agree. Therefore, companies need to take a closer look at their understanding of the international arbitration system, which is mainly used in international disputes, and the characteristics of the Vietnamese arbitration system.

Scene Composition Technology Based on HTML5 in Hybrid Broadcasting Environment (하이브리드 방송 환경 하에서 HTML5 기반 장면구성 기술)

  • Jo, Minwoo;Park, Jungwook;Kim, Kyuheon
    • Journal of Broadcast Engineering
    • /
    • v.18 no.2
    • /
    • pp.237-248
    • /
    • 2013
  • Hybrid broadcasting environment is convergence of broadcasting and communication environment. In hybrid broadcasting environment, a number of media can be delivered using both broadcasting channel and other network unlike traditional broadcast environment that is able to deliver a couple of media by the limited bandwidth. Now, starting with smart TV, hybrid broadcasting environment combining broadcasting channel and IP network is established, and a variety of services are appearing. Moreover, the services using hybrid broadcasting environment are expected to appear soon for the other smart terminals such as smart phone and tablet PC. Scene composition is one of the methods that can consume effectively a number of media delivered from hybrid broadcasting environment. Using scene composition, multiple media can be consumed through the specified presentation time and space. Therefore, in this paper, it proposes the scene composition technology that is suitable for hybrid broadcasting environment and smart terminals. However, the spatial composition and temporal composition of media using script language and style language of HTML5 might increase the complexity of processing, and cause limitation of avaliable terminals. Also, a document of HTML5 can describe only one scene. By these reason, the proposed scene composition technology extends HTML5 in order to provide the spatial and temporal composition of media and description of multiple scene through markup language. In addition, it includes the extension of HTML5 in terms of utilization in hybrid broadcasting environment. For this proposal, this paper describes the technology of HTML5 and proposed scene composition. Also, it verifies the scene composition with both implementations and experiments.

Annotation Modeling and System Implementation for Hand-held Environment (휴대용 단말기 환경을 위한 Annotation 모델링 및 시스템 구현)

  • Sohn, Won-Sung
    • Journal of The Korean Association of Information Education
    • /
    • v.10 no.2
    • /
    • pp.219-226
    • /
    • 2006
  • For the accurate creation of annotation information in a free-form annotation environment, the ambiguity that arises in the analysis stage between the geometric information and annotations needs to be resolved. Therefore, this This paper identifies, analyzes, and proposes presents solutions methods for the ambiguity that can occur between free-form marking and various contexts in XML-based annotation environment. The proposed method is based on context which includes various textual and structure information between free-form marking and annotated part. The proposed method show that the annotated portions areas included in the free-form marking information are more accurate, achieving more accurate exchange results amongst multiple users in a heterogeneous document environment. This study can be effectively applied to eLearning, Cyber-Class, and IETM

  • PDF

A Korean Community-based Question Answering System Using Multiple Machine Learning Methods (다중 기계학습 방법을 이용한 한국어 커뮤니티 기반 질의-응답 시스템)

  • Kwon, Sunjae;Kim, Juae;Kang, Sangwoo;Seo, Jungyun
    • Journal of KIISE
    • /
    • v.43 no.10
    • /
    • pp.1085-1093
    • /
    • 2016
  • Community-based Question Answering system is a system which provides answers for each question from the documents uploaded on web communities. In order to enhance the capacity of question analysis, former methods have developed specific rules suitable for a target region or have applied machine learning to partial processes. However, these methods incur an excessive cost for expanding fields or lead to cases in which system is overfitted for a specific field. This paper proposes a multiple machine learning method which automates the overall process by adapting appropriate machine learning in each procedure for efficient processing of community-based Question Answering system. This system can be divided into question analysis part and answer selection part. The question analysis part consists of the question focus extractor, which analyzes the focused phrases in questions and uses conditional random fields, and the question type classifier, which classifies topics of questions and uses support vector machine. In the answer selection part, the we trains weights that are used by the similarity estimation models through an artificial neural network. Also these are a number of cases in which the results of morphological analysis are not reliable for the data uploaded on web communities. Therefore, we suggest a method that minimizes the impact of morphological analysis by using character features in the stage of question analysis. The proposed system outperforms the former system by showing a Mean Average Precision criteria of 0.765 and R-Precision criteria of 0.872.