• Title/Summary/Keyword: Document Clustering Method

Search Result 131, Processing Time 0.036 seconds

A Dynamic Ontology-based Multi-Agent Context-Awareness User Profile Construction Method for Personalized Information Retrieval

  • Gao, Qian;Cho, Young Im
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.12 no.4
    • /
    • pp.270-276
    • /
    • 2012
  • With the increase in amount of data and information available on the web, there have been high demands on personalized information retrieval services to provide context-aware services for the web users. This paper proposes a novel dynamic multi-agent context-awareness user profile construction method based on ontology to incorporate concepts and properties to model the user profile. This method comprehensively considers the frequency and the specific of the concept in one document and its corresponding domain ontology to construct the user profile, based on which, a fuzzy c-means clustering method is adopted to cluster the user's interest domain, and a dynamic update policy is adopted to continuously consider the change of the users' interest. The simulation result shows that along with the gradual perfection of the our user profile, our proposed system is better than traditional semantic based retrieval system in terms of the Recall Ratio and Precision Ratio.

Text Detection and Binarization using Color Variance and an Improved K-means Color Clustering in Camera-captured Images (카메라 획득 영상에서의 색 분산 및 개선된 K-means 색 병합을 이용한 텍스트 영역 추출 및 이진화)

  • Song Young-Ja;Choi Yeong-Woo
    • The KIPS Transactions:PartB
    • /
    • v.13B no.3 s.106
    • /
    • pp.205-214
    • /
    • 2006
  • Texts in images have significant and detailed information about the scenes, and if we can automatically detect and recognize those texts in real-time, it can be used in various applications. In this paper, we propose a new text detection method that can find texts from the various camera-captured images and propose a text segmentation method from the detected text regions. The detection method proposes color variance as a detection feature in RGB color space, and the segmentation method suggests an improved K-means color clustering in RGB color space. We have tested the proposed methods using various kinds of document style and natural scene images captured by digital cameras and mobile-phone camera, and we also tested the method with a portion of ICDAR[1] contest images.

Performance Improvement by Cluster Analysis in Korean-English and Japanese-English Cross-Language Information Retrieval (한국어-영어/일본어-영어 교차언어정보검색에서 클러스터 분석을 통한 성능 향상)

  • Lee, Kyung-Soon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.2
    • /
    • pp.233-240
    • /
    • 2004
  • This paper presents a method to implicitly resolve ambiguities using dynamic incremental clustering in Korean-to-English and Japanese-to-English cross-language information retrieval (CLIR). The main objective of this paper shows that document clusters can effectively resolve the ambiguities tremendously increased in translated queries as well as take into account the context of all the terms in a document. In the framework we propose, a query in Korean/Japanese is first translated into English by looking up bilingual dictionaries, then documents are retrieved for the translated query terms based on the vector space retrieval model or the probabilistic retrieval model. For the top-ranked retrieved documents, query-oriented document clusters are incrementally created and the weight of each retrieved document is re-calculated by using the clusters. In the experiment based on TREC test collection, our method achieved 39.41% and 36.79% improvement for translated queries without ambiguity resolution in Korean-to-English CLIR, and 17.89% and 30.46% improvements in Japanese-to-English CLIR, on the vector space retrieval and on the probabilistic retrieval, respectively. Our method achieved 12.30% improvements for all translation queries, compared with blind feedback in Korean-to-English CLIR. These results indicate that cluster analysis help to resolve ambiguity.

Examining the Intellectual Structure of Records Management & Archival Science in Korea with Text Mining (텍스트 마이닝을 이용한 국내 기록관리학 분야 지적구조 분석)

  • Lee, Jae-Yun;Moon, Ju-Young;Kim, Hee-Jung
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.41 no.1
    • /
    • pp.345-372
    • /
    • 2007
  • In this study, the intellectual structure of Records Management & Archival Science in Korea was analyzed using document clustering, a widely used method of text mining, and document similarity network analysis. The data used in this study were 145 articles written on the subject of Records Management & Archival Science selected from five major representative journals in the field of Library & Information Science in Korea, published from 2001 to 2006. The results of cluster analysis show that the core subject areas are "electronic records management and digital Preservation," "records management policy and institution," "records description and catalogues." and "records management domain and education." The results of document analysis, which is more detailed than cluster analysis, show that "digital archiving," a specialized subject in digital preservation, plays a central role. The results of serial analysis, which proceeds according to a timeline, show the emergence of "archival services" as a new subject area.

Clustering Method Of Plagiarism Document To Use Similarity Syntagma Tree (유사 어절 트리를 이용한 표절 문서의 Clustering 방법)

  • Cheon, Seung-Hwan;Kim, Mi-Young;Lee, Guee-Sang
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2002.11c
    • /
    • pp.2269-2272
    • /
    • 2002
  • 인터넷과 컴퓨터를 이용한 학생들의 과제물을 평가하는데 있어 표절의 용이성으로 인해 정확히 판별하는 것은 매우 어렵고 번거로운 일이다. 특히 동일한 주제에 대해서 작성되는 경우가 많으므로 독자적으로 작성된 문서와 표절되어진 문서를 판별하기가 쉽지 않다. 이것은 클러스터링 하고자 하는 문서들에서 주요 단어들 즉, 색인어들의 출현 빈도를 추출한 뒤 이를 이용하여 가장 적합한 Clustering을 찾는 기존의 정보 검색 방법들과는 전혀 다른 문제이다. 본 논문에서는 과제물의 평가에 지침을 제공할 수 있도록 유사 어절 트리를 이용한 표절 유사도에 따른 Cluster들을 생성하는 방법에 대해 제안한다.

  • PDF

Research on Function and Policy for e-Government System using Semantic Technology (전자정부내 의미기반 기술 도입에 따른 기능 및 정책 연구)

  • Jang, Young-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.13 no.5
    • /
    • pp.22-28
    • /
    • 2008
  • This paper aims to offer a solution based on semantic document classification to improve e-Government utilization and efficiency for people using their own information retrieval system and linguistic expression. Generally, semantic document classification method is an approach that classifies documents based on the diverse relationships between keywords in a document without fully describing hierarchial concepts between keywords. Our approach considers the deep meanings within the context of the document and radically enhances the information retrieval performance. Concept Weight Document Classification(CoWDC) method, which goes beyond using existing keyword and simple thesaurus/ontology methods by fully considering the concept hierarchy of various concepts is proposed, experimented, and evaluated. With the recognition that in order to verify the superiority of the semantic retrieval technology through test results of the CoWDC and efficiently integrate it into the e-Government, creation of a thesaurus, management of the operating system, expansion of the knowledge base and improvements in search service and accuracy at the national level were needed.

  • PDF

A Clustering Method Based on Path Similarities of XML Data (XML 데이타의 경로 유사성에 기반한 클러스터링 기법)

  • Choi Il-Hwan;Moon Bong-Ki;Kim Hyoung-Joo
    • Journal of KIISE:Databases
    • /
    • v.33 no.3
    • /
    • pp.342-352
    • /
    • 2006
  • Current studies on storing XML data are focused on either mapping XML data to existing RDBMS efficiently or developing a native XML storage. Some native XML storages store each XML node with parsed object form. Clustering, the physical arrangement of each object, can be an important factor to increase the performance with this storing method. In this paper, we propose re-clustering techniques that can store an XML document efficiently. Proposed clustering technique uses path similarities among data nodes, which can reduce page I/Os when returning query results. And proposed technique can process a path query only using small number of clusters as possible instead of using all clusters. This enables efficient processing of path query because we can reduce search space by skipping unnecessary data. Finally, we apply existing clustering techniques to store XML data and compare the performance with proposed technique. Our results show that the performance of XML storage can be improved by using a proper clustering technique.

XML Document Clustering Technique by K-means algorithm through PCA (주성분 분석의 K 평균 알고리즘을 통한 XML 문서 군집화 기법)

  • Kim, Woo-Saeng
    • The KIPS Transactions:PartD
    • /
    • v.18D no.5
    • /
    • pp.339-342
    • /
    • 2011
  • Recently, researches are studied in developing efficient techniques for accessing, querying, and storing XML documents which are frequently used in the Internet. In this paper, we propose a new method to cluster XML documents efficiently. We use a K-means algorithm with a Principal Component Analysis(PCA) to cluster XML documents after they are represented by vectors in the feature vector space by transferring them as names and levels of the elements of the corresponding trees. The experiment shows that our proposed method has a good result.

A Comparative Study of Feature Selection Methods for Korean Web Documents Clustering (한글 웹 문서 클러스터링 성능향상을 위한 자질선정 기법 비교 연구)

  • Kim Young-Gi
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.1
    • /
    • pp.45-58
    • /
    • 2005
  • This Paper is a comparative study of feature selection methods for Korean web documents clustering. First, we focused on how the term feature and the co-link of web documents affect clustering performance. We clustered web documents by native term feature, co-link and both, and compared the output results with the originally allocated category. And we selected term features for each category using $X^2$, Information Gain (IG), and Mutual Information (MI) from training documents, and applied these features to other experimental documents. In addition we suggested a new method named Max Feature Selection, which selects terms that have the maximum count for a category in each experimental document, and applied $X^2$ (or MI or IG) values to each term instead of term frequency of documents, and clustered them. In the results, $X^2$ shows a better performance than IG or MI, but the difference appears to be slight. But when we applied the Max Feature Selection Method, the clustering Performance improved notably. Max Feature Selection is a simple but effective means of feature space reduction and shows powerful performance for Korean web document clustering.

An Automatic Classification System of Korean Documents Using Weight for Keywords of Document and Word Cluster (문서의 주제어별 가중치 부여와 단어 군집을 이용한 한국어 문서 자동 분류 시스템)

  • Hur, Jun-Hui;Choi, Jun-Hyeog;Lee, Jung-Hyun;Kim, Joong-Bae;Rim, Kee-Wook
    • The KIPS Transactions:PartB
    • /
    • v.8B no.5
    • /
    • pp.447-454
    • /
    • 2001
  • The automatic document classification is a method that assigns unlabeled documents to the existing classes. The automatic document classification can be applied to a classification of news group articles, a classification of web documents, showing more precise results of Information Retrieval using a learning of users. In this paper, we use the weighted Bayesian classifier that weights with keywords of a document to improve the classification accuracy. If the system cant classify a document properly because of the lack of the number of words as the feature of a document, it uses relevance word cluster to supplement the feature of a document. The clusters are made by the automatic word clustering from the corpus. As the result, the proposed system outperformed existing classification system in the classification accuracy on Korean documents.

  • PDF