• Title/Summary/Keyword: web document clustering

Search Result 54, Processing Time 0.037 seconds

Document Summarization Method using Complete Graph (완전그래프를 이용한 문서요약 연구)

  • Lyu, Jun-Hyun;Park, Soon-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.10 no.2
    • /
    • pp.26-31
    • /
    • 2005
  • In this paper, we present the document summarizers which are simpler and more condense than the existing ones generally used in the web search engines. This method is a statistic-based summarization method using the concept of the complete graph. We suppose that each sentence as a vertex and the similarity between two sentences as a link of the graph. We compare this summarizer with those of Clustering and MMR techniques which are well-known as the good summarization methods. For the comparison, we use FScore using the summarization results generated by human subjects. Our experimental results verify the accuracy of this method, being about $30\%$ better than the others.

  • PDF

A Study on the efficiency of similarity and clustering measure in Historical Writing Document (역사적 기록 문서에서 효율적인 유사도 및 클러스터링 측정에 관한 연구)

  • 한광덕
    • Journal of the Korea Society of Computer and Information
    • /
    • v.7 no.4
    • /
    • pp.94-101
    • /
    • 2002
  • It expected a lot of changes in mass media and documentation expression as documents on web are getting diverse, complex and massive. An Annals of The Chosun Dynasty is a very important document used for researching historical facts and is published as CD-Rom. However. The CD-Rom was composed as content-based and using simple search method, therefore it's very difficult to make determine event-relationship between documents factors. Because of that, we studied to discover event-relationship between documents through clustering and efficient similarity method among Annals of The Chosun Dynasty. For the research method, we discovered the best similarity method for historical written documents through simulation similarity measures of Annals of The Chosun Dynasty documents. Then we did simulation-clustering documents based on similarity probability. In evaluation of the clustered documents , the results were the same as when manually figured.

  • PDF

Recommendation System using Associative Web Document Classification by Word Frequency and α-Cut (단어 빈도와 α-cut에 의한 연관 웹문서 분류를 이용한 추천 시스템)

  • Jung, Kyung-Yong;Ha, Won-Shik
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.1
    • /
    • pp.282-289
    • /
    • 2008
  • Although there were some technological developments in improving the collaborative filtering, they have yet to fully reflect the actual relation of the items. In this paper, we propose the recommendation system using associative web document classification by word frequency and ${\alpha}$-cut to address the short comings of the collaborative filtering. The proposed method extracts words from web documents through the morpheme analysis and accumulates the weight of term frequency. It makes associative rules and applies the weight of term frequency to its confidence by using Apriori algorithm. And it calculates the similarity among the words using the hypergraph partition. Lastly, it classifies related web document by using ${\alpha}$-cut and calculates similarity by using adjusted cosine similarity. The results show that the proposed method significantly outperforms the existing methods.

Document Clustering for Web Directory Service (웹 디렉토리 서비스를 위한 문서 클러스터링)

  • 이문기;권오욱;이종혁
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.04b
    • /
    • pp.351-353
    • /
    • 2000
  • 대부분의 검색 엔진에서의 사용자의 정보 검색 요구에서 나타나는 키워드 장벽의 문제점을 해결하고 사용자의 정보 검색 과정에 도움을 주기 위해 디렉토리 서비스를 제공한다. 하지만 디렉토리 서비스에서 새로운 웹 사이트를 지속적으로 인덱스하여 하나의 주제어에 너무 많은 수의 웹 사이트가 부여되어 있으면 사용자의 검색 편의를 위해서 재분류하여 세분류할 필요가 있다. 따라서 본 논문에서는 한 주제어에 과다하게 부여된 웹 사이트들을 세분류하기 위해 기존의 문서 클러스터링 기법을 사용하여 클러스터링 할 때 생기는 문제점을 보완한 문서 클러스터링 시스템을 소개한다.

  • PDF

Design and implementation of web document clustering system using on incremental algorithm (점진적 알고리즘을 이용한 웹 문서 클러스터링 시스템의 설계 및 구현)

  • 황태호;손기락
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10a
    • /
    • pp.207-209
    • /
    • 1999
  • 클러스터 분석은 관측의 대상이 되는 집합에 맞는 분류 구조를 생성하는데 이용되는 통계학적인 기술이다. 정보검색 응용에서 전형적으로 발견되는 높은 차원을 가진 많은 데이터 집합을 클러스터하기 위하여, 많은 공간과 시간이 필요하다. SLINK 알고리즘은 O(n2)의 시간과 O(n)의 공간의 성능을 갖으며 점진성을 반영할 수 있는 알고리즘이다. SLINK알고리즘을 이용하여 검색 엔진의 검색결과에 온라인으로 클러스터 분류를 수행하는 시스템을 구현하였다. 구현된 시스템은 상대적으로 높은 정확도와 각 클러스터를 저장하고 표현하는데 있어서의 장점을 제공하며, 상대적으로 느린 수행 속도는 온라인으로 문서들이 다운로드 되는 속도가 느리므로 문제가 되지 않음을 알 수 있었다.

  • PDF

Automatic Response and Conceptual Browsing of Internet FAQs Using Self-Organizing Maps (자기구성 지도를 이용한 인터넷 FAQ의 자동응답 및 개념적 브라우징)

  • Ahn, Joon-Hyun;Ryu, Jung-Won;Cho, Sung-Bae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.12 no.5
    • /
    • pp.432-441
    • /
    • 2002
  • Though many services offer useful information on internet, computer users are not so familiar with such services that they need an assistant system to use the services easily In the case of web sites, for example, the operators answer the users e-mail questions, but the increasing number of users makes it hard to answer the questions efficiently. In this paper, we propose an assistant system which responds to the users questions automatically and helps them browse the Hanmail Net FAQ (Frequently Asked Question) conceptually. This system uses two-level self-organizing map (SOM): the keyword clustering SOM and document classification SOM. The keyword clustering SOM reduces a variable length question to a normalized vector and the document classification SOM classifies the question into an answer class. Experiments on the 2,206 e-mail question data collected for a month from the Hanmail net show that this system is able to find the correct answers with the recognition rate of 95% and also the browsing based on the map is conceptual and efficient.

Web Document Clustering for Specific Subject Information Using WordNet and HTML Tags (WordNet과 HTML 태그를 활용한 특정영역 정보의 웹 문서 분류)

  • 조은휘;변영태
    • Proceedings of the Korean Society for Cognitive Science Conference
    • /
    • 2002.05a
    • /
    • pp.28-32
    • /
    • 2002
  • 웹 상의 많은 정보들 속에서 사용자가 원하는 정보를 찾아내는 일은 쉽지 않다. 사용자가 의도하는 양질의 정보 제공을 위해 특정 영역과 관련한 정보 제공 시스템이 .개발되고 있다. 이전 시스템은 특정 영역 관련 지식베이스를 토대로 하여 웹 문서를 수집해 놓고, 사용자에게 정보를 제공한다. 본 논문에서는 전문 사이트 내에 문서간의 유사성을 토대로 하여 동물 영역에 대한 효과적인 문서 클러스타링(clustering)에 관해 실험하였다. 기존의 방법에서는 문서의 분류나 질의어와 관련한 문서 선택이나 순위 결정이 주로 텀(term)을 바탕으로 하고 있다. 본 논문에서는 각 문서 내의 텀 뿐만 아니라 HTML 태그(tag), 지식베이스에 WordNet의 계층구조를 적용한 data를 활용하고, SVD(Singular Value Decomposition)를 사용하여 문서간의 관계를 밝혀내어 문서 분류 및 수집에 이용하였다. 특정 영역의 전문 문서를 많이 제공하는 사이트에 적용하여 좋은 결과를 볼 수 있었다.

  • PDF

An Automatic Classification System of Korean Documents Using Weight for Keywords of Document and Word Cluster (문서의 주제어별 가중치 부여와 단어 군집을 이용한 한국어 문서 자동 분류 시스템)

  • Hur, Jun-Hui;Choi, Jun-Hyeog;Lee, Jung-Hyun;Kim, Joong-Bae;Rim, Kee-Wook
    • The KIPS Transactions:PartB
    • /
    • v.8B no.5
    • /
    • pp.447-454
    • /
    • 2001
  • The automatic document classification is a method that assigns unlabeled documents to the existing classes. The automatic document classification can be applied to a classification of news group articles, a classification of web documents, showing more precise results of Information Retrieval using a learning of users. In this paper, we use the weighted Bayesian classifier that weights with keywords of a document to improve the classification accuracy. If the system cant classify a document properly because of the lack of the number of words as the feature of a document, it uses relevance word cluster to supplement the feature of a document. The clusters are made by the automatic word clustering from the corpus. As the result, the proposed system outperformed existing classification system in the classification accuracy on Korean documents.

  • PDF

Research on Function and Policy for e-Government System using Semantic Technology (전자정부내 의미기반 기술 도입에 따른 기능 및 정책 연구)

  • Jang, Young-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.13 no.5
    • /
    • pp.22-28
    • /
    • 2008
  • This paper aims to offer a solution based on semantic document classification to improve e-Government utilization and efficiency for people using their own information retrieval system and linguistic expression. Generally, semantic document classification method is an approach that classifies documents based on the diverse relationships between keywords in a document without fully describing hierarchial concepts between keywords. Our approach considers the deep meanings within the context of the document and radically enhances the information retrieval performance. Concept Weight Document Classification(CoWDC) method, which goes beyond using existing keyword and simple thesaurus/ontology methods by fully considering the concept hierarchy of various concepts is proposed, experimented, and evaluated. With the recognition that in order to verify the superiority of the semantic retrieval technology through test results of the CoWDC and efficiently integrate it into the e-Government, creation of a thesaurus, management of the operating system, expansion of the knowledge base and improvements in search service and accuracy at the national level were needed.

  • PDF

A Clustering Technique Using Association Rules for The Library and Information Science Terminology (연관규칙을 이용한 문헌정보학 전문용어 클러스터링 기법에 관한 연구)

  • Seung, Hyon-Woo;Park, Mi-Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.37 no.2
    • /
    • pp.89-105
    • /
    • 2003
  • In this paper, an effective method for clustering terminologies extracted from text is proposed, in order to develope a search engine to extract relevant information from large web documents. To prevent frequency of the meaningless association rules among general terminologies, only useful association rules among terminologies are produced using database tables which consist of domain-specific terminologies. Such association rules are produced by applying the Apriori algorithm after forming transaction units from groups of association rules in a document. A group of association rules produced from a terminology forms in a cluster.