• Title/Summary/Keyword: Document Representation

Search Result 113, Processing Time 0.024 seconds

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec (Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류)

  • Kim, Dowoo;Koo, Myoung-Wan
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.742-747
    • /
    • 2017
  • In this paper, we propose a novel approach to improve the performance of the Convolutional Neural Network(CNN) word embedding model on top of word2vec with the result of performing like doc2vec in conducting a document classification task. The Word Piece Model(WPM) is empirically proven to outperform other tokenization methods such as the phrase unit, a part-of-speech tagger with substantial experimental evidence (classification rate: 79.5%). Further, we conducted an experiment to classify ten categories of news articles written in Korean by feeding words and document vectors generated by an application of WPM to the baseline and the proposed model. From the results of the experiment, we report the model we proposed showed a higher classification rate (89.88%) than its counterpart model (86.89%), achieving a 22.80% improvement. Throughout this research, it is demonstrated that applying doc2vec in the document classification task yields more effective results because doc2vec generates similar document vector representation for documents belonging to the same category.

Genetic Clustering with Semantic Vector Expansion (의미 벡터 확장을 통한 유전자 클러스터링)

  • Song, Wei;Park, Soon-Cheol
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.3
    • /
    • pp.1-8
    • /
    • 2009
  • This paper proposes a new document clustering system using fuzzy logic-based genetic algorithm (GA) and semantic vector expansion technology. It has been known in many GA papers that the success depends on two factors, the diversity of the population and the capability to convergence. We use the fuzzy logic-based operators to adaptively adjust the influence between these two factors. In traditional document clustering, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach not only leads to a high dimensional feature space, but also ignores the semantic relationships between some important words, which would affect the accuracy of clustering. In this paper we use latent semantic analysis (LSA)to expand the documents to corresponding semantic vectors conceptually, rather than the individual terms. Meanwhile, the sizes of the vectors can be reduced drastically. We test our clustering algorithm on 20 news groups and Reuter collection data sets. The results show that our method outperforms the conventional GA in various document representation environments.

Implementation of SMIL Editor for Multimedia Broadcasting (멀티미디어 방송을 위한 SMIL 편집 시스템 구현)

  • 장대영;김창수;정회경
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.3
    • /
    • pp.622-629
    • /
    • 2004
  • Recently, as digital broadcasting and internet are spreaded out of the world, we can easily use informations with less restrictions of time and space. According to the current trends, concerns for the ways of representing multimedia data has been rapidly increased, and users demand the services with integrated document that takes not only simple text and image but also time varying audio-visual data. Therefore, in 1998, W3C presented an international standard, SMIL in order to solve multimedia object representation and synchronization problems. By using SMIL, various multimedia elements can be integrated as a multimedia document with proper view in a space and time. Using this SMIL document, we can create new internet radio broadcasting service that delivers not only audio data but also various text, image and video. In this paper, we describe on a SMIL document editor for the common users to be able to represent time varying multimedia data with special layout and synchronization of time and space.

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms (기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구)

  • Choi, Yunsoo;Choi, Sung-Pil
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.2
    • /
    • pp.179-199
    • /
    • 2019
  • In this paper, we propose optimal methodologies for classifying patent literature by examining various feature extraction methods, machine learning and deep learning models, and provide optimal performance through experiments. We compared the traditional BoW method and a distributed representation method (word embedding vector) as a feature extraction, and compared the morphological analysis and multi gram as the method of constructing the document collection. In addition, classification performance was verified using traditional machine learning model and deep learning model. Experimental results show that the best performance is achieved when we apply the deep learning model with distributed representation and morphological analysis based feature extraction. In Section, Class and Subclass classification experiments, We improved the performance by 5.71%, 18.84% and 21.53%, respectively, compared with traditional classification methods.

Extended Entity-Relationship Model for Conceptual Modeling of XML Schema (XML 스키마의 개념적 모델링을 위한 확장된 개체관계 모델)

  • Jung, In-Hwan;Kim, Young-Ung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.15 no.1
    • /
    • pp.157-163
    • /
    • 2015
  • XML has become one of the most influential standard language for representing and exchanging data on internet. However, XML itself has a ability to represent a logical structure for storing and managing data, it is inadequate to use as a conceptual modeling tool because of its complexity for representing the document structures. In this paper, we propose the graphical form of conceptual modeling techniques for representing the structure of the XML schema documents using an extended entity relationship diagram. For this, extended entity relationship model is presented for representing the XML schema structure, transformation rules are presented for transforming extended entity relationship model into XML schema document to show the completeness of the proposed model.

Improving Retrieval Effectiveness with Multiple Weighting Schemes (다중 가중치 기법을 이용한 검색 효과의 개선)

  • 이준호
    • Journal of the Korean Society for information Management
    • /
    • v.12 no.2
    • /
    • pp.213-223
    • /
    • 1995
  • It has known that different representations of either queries or documents, or different retrieval techniques retrieve different sets of documents. Recent works suggest that significant improvements in retrieval performance can be achieved by combining multiple representations or multiple retrieval techniques. In this paper we propose a simple method for retrieving different documents within a single query representation, a single document representation and a single retrieval technique. We classify the types of documents, and describe the properties of weighting schemes. Then. we explain that different properties of weighting schemes may retrieve different types of documents. Experimental results show that significant improvements can be obtained by combining the retrieval results form different properties of weighting schemes.

  • PDF

A Study of Designing the Intelligent Information Retrieval System by Automatic Classification Algorithm (자동분류 알고리즘을 이용한 지능형 정보검색시스템 구축에 관한 연구)

  • Seo, Whee
    • Journal of Korean Library and Information Science Society
    • /
    • v.39 no.4
    • /
    • pp.283-304
    • /
    • 2008
  • This is to develop Intelligent Retrieval System which can automatically present early query's category terms(association terms connected with knowledge structure of relevant terminology) through learning function and it changes searching form automatically and runs it with association terms. For the reason, this theoretical study of Intelligent Automatic Indexing System abstracts expert's index term through learning and clustering algorism about automatic classification, text mining(categorization), and document category representation. It also demonstrates a good capacity in the aspects of expense, time, recall ratio, and precision ratio.

  • PDF

An Indexing Model for Efficient Structure Retrieval of XML Documents (XML 문서의 효율적인 구조 검색을 위한 색인 모델)

  • Park, Jong-Gwan;Son, Chung-Beom;Gang, Hyeong-Il;Yu, Jae-Su;Lee, Byeong-Yeop
    • The KIPS Transactions:PartD
    • /
    • v.8D no.5
    • /
    • pp.451-460
    • /
    • 2001
  • In this paper, we propose an indexing model for efficient structure retrieval of XML documents. The proposed indexing model consists of structured information that supports a wide range of queries such as content-based queries and structure-attribute queries at all levels of the document hierarchy and index organizations that are constructed based on the information. To support structured retrieval, a new representation method for structured information is presented. Using this structured information, we design content index, structure index, and attribute index for efficient retrieval. also, we explain processing procedures for mixed queries and evaluate the performance of proposed indexing model. It is shown that the proposed indexing model achieves better retrieval performance than the existing method.

  • PDF

A Study of Efficiency Information Filtering System using One-Hot Long Short-Term Memory

  • Kim, Hee sook;Lee, Min Hi
    • International Journal of Advanced Culture Technology
    • /
    • v.5 no.1
    • /
    • pp.83-89
    • /
    • 2017
  • In this paper, we propose an extended method of one-hot Long Short-Term Memory (LSTM) and evaluate the performance on spam filtering task. Most of traditional methods proposed for spam filtering task use word occurrences to represent spam or non-spam messages and all syntactic and semantic information are ignored. Major issue appears when both spam and non-spam messages share many common words and noise words. Therefore, it becomes challenging to the system to filter correct labels between spam and non-spam. Unlike previous studies on information filtering task, instead of using only word occurrence and word context as in probabilistic models, we apply a neural network-based approach to train the system filter for a better performance. In addition to one-hot representation, using term weight with attention mechanism allows classifier to focus on potential words which most likely appear in spam and non-spam collection. As a result, we obtained some improvement over the performances of the previous methods. We find out using region embedding and pooling features on the top of LSTM along with attention mechanism allows system to explore a better document representation for filtering task in general.

Paper Recommendation Using SPECTER with Low-Rank and Sparse Matrix Factorization

  • Panpan Guo;Gang Zhou;Jicang Lu;Zhufeng Li;Taojie Zhu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.5
    • /
    • pp.1163-1185
    • /
    • 2024
  • With the sharp increase in the volume of literature data, researchers must spend considerable time and energy locating desired papers. A paper recommendation is the means necessary to solve this problem. Unfortunately, the large amount of data combined with sparsity makes personalizing papers challenging. Traditional matrix decomposition models have cold-start issues. Most overlook the importance of information and fail to consider the introduction of noise when using side information, resulting in unsatisfactory recommendations. This study proposes a paper recommendation method (PR-SLSMF) using document-level representation learning with citation-informed transformers (SPECTER) and low-rank and sparse matrix factorization; it uses SPECTER to learn paper content representation. The model calculates the similarity between papers and constructs a weighted heterogeneous information network (HIN), including citation and content similarity information. This method combines the LSMF method with HIN, effectively alleviating data sparsity and cold-start issues and avoiding topic drift. We validated the effectiveness of this method on two real datasets and the necessity of adding side information.