Search | Korea Science

Combining Distributed Word Representation and Document Distance for Short Text Document Clustering

Kongwudhikunakorn, Supavit;Waiyamai, Kitsana
- Journal of Information Processing Systems
- /
- v.16 no.2
- /
- pp.277-300
- /
- 2020
This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover's distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1-score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster.
https://doi.org/10.3745/JIPS.04.0164 인용 PDF KSCI

An Efficient Method of Document Store and Version Management for XML Repository System (XML 저장 관리 시스템에서 효율적인 버전 관리 및 문서 저장 방안)

Jung, Hyun-Joo;Kim, Kweon-Yang;Choi, Jae-Hyuk
- The Journal of Korean Association of Computer Education
- /
- v.6 no.4
- /
- pp.11-21
- /
- 2003
In rapidly changing an information=oriented society, it is essential to control massive document information by electronic file. In relation to these electronic document, it is also important to keep and maintain all kinds of information without any losses. It should be allowed to trace previous contents as well as recently updated contents by controlling updated contents with version. For these, XML is recommendable. In this thesis, we intend to save the document storing space by saving only updated contents with version without saving whole documentation, when document is updated. In case of controlling the history of document update by version, we designed system so as to omit "JOIN operation" if document size is under a certainspecific size. Therefore, we implemented a new XML document repository system which is possible for quick search and efficient XML document saving by reducing perfomance deterioration caused by JOIN operation.
PDF

Design and implementation of XML document edit system that intend to MathML mathematical formula structure representation (MathML 수식 구조 표현을 지향하는 XML 문서 편집 시스템의 설계 및 구현)

김철순;정회경
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2002.11a
- /
- pp.363-367
- /
- 2002
Represent of mathematical formula used within system handling document that is nonstructural in existent document editing system that is used in electron document processing that use computer is represented or processed by method that is nonstructural of image or text or etc. Such mathematical formular causes relative inconvenience to readablility and reusability of document and processing and exchange of document. Therefore, document editing system is required that can overcome such nonadvantage and apply MathML mathematical formula structure on efficiently structural document. Therefore, designed and implemented that document editing system for structural document creation of XML base that can mathematical formular editing of MathML base in this paper.
PDF

Document Clustering Method using Coherence of Cluster and Non-negative Matrix Factorization (비음수 행렬 분해와 군집의 응집도를 이용한 문서군집)

Kim, Chul-Won;Park, Sun
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.13 no.12
- /
- pp.2603-2608
- /
- 2009
Document clustering is an important method for document analysis and is used in many different information retrieval applications. This paper proposes a new document clustering model using the clustering method based NMF(non-negative matrix factorization) and refinement of documents in cluster by using coherence of cluster. The proposed method can improve the quality of document clustering because the re-assigned documents in cluster by using coherence of cluster based similarity between documents, the semantic feature matrix and the semantic variable matrix, which is used in document clustering, can represent an inherent structure of document set more well. The experimental results demonstrate appling the proposed method to document clustering methods achieves better performance than documents clustering methods.
https://doi.org/10.6109/JKIICE.2009.13.12.2603 인용 PDF KSCI

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

Park, Jongin;Kim, Namgyu
- Journal of Intelligence and Information Systems
- /
- v.25 no.3
- /
- pp.19-41
- /
- 2019
According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.
https://doi.org/10.13088/jiis.2019.25.3.019 인용 PDF KSCI

Query Space Exploration Using Genetic Algorithm

Lee, Jae-Hoon;Kim, Young-Cheon;Lee, Sung-Joo
- Proceedings of the Korean Institute of Intelligent Systems Conference
- /
- 2003.09a
- /
- pp.683-689
- /
- 2003
Information retrieval must be able to search the most suitable document that user need from document set. If foretell document adaptedness by similarity degree about QL(Query Language) of document, documents that search person does not require are searched. In this paper, showed that can search the most suitable document on user's request searching document of the whole space using genetic algorithm and used knowledge-base operator to solve various model's problem.
PDF

Query Space Exploration Model Using Genetic Algorithm

Lee, Jae-Hoon;Lee, Sung-Joo
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- v.3 no.2
- /
- pp.222-226
- /
- 2003
Information retrieval must be able to search the most suitable document that user need from document set. If foretell document adaptedness by similarity degree about QL(Query Language) of document, documents that search person does not require are searched. In this paper, showed that can search the most suitable document on user's request searching document of the whole space using genetic algorithm and used knowledge-base operator to solve various model's problem.
https://doi.org/10.5391/IJFIS.2003.3.2.222 인용 PDF KSCI

Implementation of EDMS(Electric Document Management System) with Validity Verification (전자문서 유효기간 검증 기능을 탑재한 전자문서관리시스템 구현)

Park, Jung-Oh;Lee, Seung-Min;Kim, Sang-Geun;Jun, Moon-Seog
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.35 no.7B
- /
- pp.1043-1049
- /
- 2010
E-document deposit and issue service among other services is critical service in CEDA(Certified E-Document Deposit Authority) that assure reliability and stability of E-document. After owner's E-document is registered in CEDA, issuing partial information(a part of page) is to prevent exposure of superfluous information when owner issue E-document to 3rd party. Also we suggested that is able to verify validation of E-document as validation check module is inserted suggested system.
PDF KSCI

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

Cho, Yoon-Ho;Lee, Sang-Keun
- The Journal of the Korea Contents Association
- /
- v.9 no.2
- /
- pp.142-151
- /
- 2009
Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.
https://doi.org/10.5392/JKCA.2009.9.2.142 인용 PDF

A study on secure transmission system for document image using mixing algorithm (합성 알고리즘을 이용한 안전한 문서화상 전송체계에 관한 연구)

박일남;이대영
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.22 no.11
- /
- pp.2552-2562
- /
- 1997
This paepr presents a secure transmission system for document image using mixing algorithm. For this, we apply DM and RDM algorithm propoposed before. The transmitter embeds secretly the signature onto secure document, embeds it to non-secure document and transfers it to the receiver. The receiver makes a check of any forgery on the signature and the document. The total amount of data transmitted and the image quallity are about the same to that of the original document. Thus, a third party can not notice the fact that signatures and secure document is embedded on the document.
PDF

Search Result 4,932, Processing Time 0.03 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)