• Title/Summary/Keyword: Document Clustering Method

Search Result 131, Processing Time 0.032 seconds

Improving the Performance of Document Similarity by using GPU Parallelism (GPU 병렬성을 이용한 문서 유사도 계산 성능 개선)

  • Park, Il-Nam;Bae, Byung-Gurl;Im, Eun-Jin;Kang, Seung-Shik
    • The KIPS Transactions:PartB
    • /
    • v.19B no.4
    • /
    • pp.243-248
    • /
    • 2012
  • In the information retrieval systems like vector model implementation and document clustering, document similarity calculation takes a great part on the overall performance of the system. In this paper, GPU parallelism has been explored to enhance the processing speed of document similarity calculation in a CUDA framework. The proposed method increased the similarity calculation speed almost 15 times better compared to the typical CPU-based framework. It is 5.2 and 3.4 times better than the methods by using CUBLAS and Thrust, respectively.

Clustering-based Statistical Machine Translation Using Syntactic Structure and Word Similarity (문장구조 유사도와 단어 유사도를 이용한 클러스터링 기반의 통계기계번역)

  • Kim, Han-Kyong;Na, Hwi-Dong;Li, Jin-Ji;Lee, Jong-Hyeok
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.4
    • /
    • pp.297-304
    • /
    • 2010
  • Clustering method which based on sentence type or document genre is a technique used to improve translation quality of SMT(statistical machine translation) by domain-specific translation. But there is no previous research using sentence type and document genre information simultaneously. In this paper, we suggest an integrated clustering method that classifying sentence type by syntactic structure similarity and document genre by word similarity information. We interpolated domain-specific models from clusters with general models to improve translation quality of SMT system. Kernel function and cosine measures are applied to calculate structural similarity and word similarity. With these similarities, we used machine learning algorithms similar to K-means to clustering. In Japanese-English patent translation corpus, we got 2.5% point relative improvements of translation quality at optimal case.

Topic-based Multi-document Summarization Using Non-negative Matrix Factorization and K-means (비음수 행렬 분해와 K-means를 이용한 주제기반의 다중문서요약)

  • Park, Sun;Lee, Ju-Hong
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.4
    • /
    • pp.255-264
    • /
    • 2008
  • This paper proposes a novel method using K-means and Non-negative matrix factorization (NMF) for topic -based multi-document summarization. NMF decomposes weighted term by sentence matrix into two sparse non-negative matrices: semantic feature matrix and semantic variable matrix. Obtained semantic features are comprehensible intuitively. Weighted similarity between topic and semantic features can prevent meaningless sentences that are similar to a topic from being selected. K-means clustering removes noises from sentences so that biased semantics of documents are not reflected to summaries. Besides, coherence of document summaries can be enhanced by arranging selected sentences in the order of their ranks. The experimental results show that the proposed method achieves better performance than other methods.

Research on Function and Policy for e-Government System using Semantic Technology (전자정부내 의미기반 기술 도입에 따른 기능 및 정책 연구)

  • Go, Gwang-Seop;Jang, Yeong-Cheol;Lee, Chang-Hun
    • 한국디지털정책학회:학술대회논문집
    • /
    • 2007.06a
    • /
    • pp.79-87
    • /
    • 2007
  • This paper aims to offer a solution based on semantic document classification to improve e-Government utilization and efficiency for people using their own information retrieval system and linguistic expression Generally, semantic document classification method is an approach that classifies documents based on the diverse relationships between keywords in a document without fully describing hierarchial concepts between keywords. Our approach considers the deep meanings within the context of the document and radically enhances the information retrieval performance. Concept Weight Document Classification(CoWDC) method, which goes beyond using exist ing keyword and simple thesaurus/ontology methods by fully considering the concept hierarchy of various concepts is proposed, experimented, and evaluated. With the recognition that in order to verify the superiority of the semantic retrieval technology through test results of the CoWDC and efficiently integrate it into the e-Government, creation of a thesaurus, management of the operating system, expansion of the knowledge base and improvements in search service and accuracy at the national level were needed.

  • PDF

Automatic Generation of the Local Level Knowledge Structure of a Single Document Using Clustering Methods (클러스터링 기법을 이용한 개별문서의 지식구조 자동 생성에 관한 연구)

  • Han, Seung-Hee;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.21 no.3
    • /
    • pp.251-267
    • /
    • 2004
  • The purpose of this study is to generate the local level knowledge structure of a single document, similar to end-of-the-book indexes and table of contents of printed material through the use of term clustering and cluster representative term selection. Furthermore, it aims to analyze the functionalities of the knowledge structure. and to confirm the applicability of these methods in user-friend1y information services. The results of the term clustering experiment showed that the performance of the Ward's method was superior to that of the fuzzy K -means clustering method. In the cluster representative term selection experiment, using the highest passage frequency term as the representative yielded the best performance. Finally, the result of user task-based functionality tests illustrate that the automatically generated knowledge structure in this study functions similarly to the local level knowledge structure presented In printed material.

Document Summarization Method using Complete Graph (완전그래프를 이용한 문서요약 연구)

  • Lyu, Jun-Hyun;Park, Soon-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.10 no.2
    • /
    • pp.26-31
    • /
    • 2005
  • In this paper, we present the document summarizers which are simpler and more condense than the existing ones generally used in the web search engines. This method is a statistic-based summarization method using the concept of the complete graph. We suppose that each sentence as a vertex and the similarity between two sentences as a link of the graph. We compare this summarizer with those of Clustering and MMR techniques which are well-known as the good summarization methods. For the comparison, we use FScore using the summarization results generated by human subjects. Our experimental results verify the accuracy of this method, being about $30\%$ better than the others.

  • PDF

Comparison of graph clustering methods for analyzing the mathematical subject classification codes

  • Choi, Kwangju;Lee, June-Yub;Kim, Younjin;Lee, Donghwan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.5
    • /
    • pp.569-578
    • /
    • 2020
  • Various graph clustering methods have been introduced to identify communities in social or biological networks. This paper studies the entropy-based and the Markov chain-based methods in clustering the undirected graph. We examine the performance of two clustering methods with conventional methods based on quality measures of clustering. For the real applications, we collect the mathematical subject classification (MSC) codes of research papers from published mathematical databases and construct the weighted code-to-document matrix for applying graph clustering methods. We pursue to group MSC codes into the same cluster if the corresponding MSC codes appear in many papers simultaneously. We compare the MSC clustering results based on the several assessment measures and conclude that the Markov chain-based method is suitable for clustering the MSC codes.

A Effective Storage Method for Managing of MPEG-7 Document (MPEG-7 문서 관리를 위한 효율적인 저장 방법)

  • Ahn, Byeong-Tae;Lee, Jong-Ha;Chung, Bhum-Suk
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.637-641
    • /
    • 2006
  • To use multimedia contents in restricted resources, any management method of MPEG-7 documents is needed. At this time, some XML clustering methods can be used. But, to improve the performance efficiency better, new clustering method which uses the characteristics of MPEG-7 documents is needed. In this paper, we suggest a new clustering method to manage MPEG-7 documents efficiently, which uses some semantic relationships among elements of MPEG-7 documents. And also we compare it to the existed clustering methods.

  • PDF

Research on the Hybrid Paragraph Detection System Using Syntactic-Semantic Analysis (구문의미 분석을 활용한 복합 문단구분 시스템에 대한 연구)

  • Kang, Won Seog
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.1
    • /
    • pp.106-116
    • /
    • 2021
  • To increase the quality of the system in the subjective-type question grading and document classification, we need the paragraph detection. But it is not easy because it is accompanied by semantic analysis. Many researches on the paragraph detection solve the detection problem using the word based clustering method. However, the word based method can not use the order and dependency relation between words. This paper suggests the paragraph detection system using syntactic-semantic relation between words with the Korean syntactic-semantic analysis. This system is the hybrid system of word based, concept based, and syntactic-semantic tree based detection. The experiment result of the system shows it has the better result than the word based system. This system will be utilized in Korean subjective question grading and document classification.

A Document Ranking Method by Document Clustering Using Bayesian SoM and Botstrap (베이지안 SOM과 붓스트랩을 이용한 문서 군집화에 의한 문서 순위조정)

  • Choe, Jun-Hyeok;Jeon, Seong-Hae;Lee, Jeong-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.7
    • /
    • pp.2108-2115
    • /
    • 2000
  • The conventional Boolean retrieval systems based on vector spae model can provide the results of retrieval fast, they can't reflect exactly user's retrieval purpose including semantic information. Consequently, the results of retrieval process are very different from those users expected. This fact forces users to waste much time for finding expected documents among retrieved documents. In his paper, we designed a bayesian SOM(Self-Organizing feature Maps) in combination with bayesian statistical method and Kohonen network as a kind of unsupervised learning, then perform classifying documents depending on the semantic similarity to user query in real time. If it is difficult to observe statistical characteristics as there are less than 30 documents for clustering, the number of documents must be increased to at least 50. Also, to give high rank to the documents which is most similar to user query semantically among generalized classifications for generalized clusters, we find the similarity by means of Kohonen centroid of each document classification and adjust the secondary rank depending on the similarity.

  • PDF