• Title/Summary/Keyword: Document Clustering

Search Result 225, Processing Time 0.02 seconds

Similarity checking between XML tags through expanding synonym vector (유사어 벡터 확장을 통한 XML태그의 유사성 검사)

  • Lee, Jung-Won;Lee, Hye-Soo;Lee, Ki-Ho
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.9
    • /
    • pp.676-683
    • /
    • 2002
  • The success of XML(eXtensible Markup Language) is primarily based on its flexibility : everybody can define the structure of XML documents that represent information in the form he or she desires. XML is so flexible that XML documents cannot be automatically provided with an underlying semantics. Different tag sets, different names for elements or attributes, or different document structures in general mislead the task of classifying and clustering XML documents precisely. In this paper, we design and implement a system that allows checking the semantic-based similarity between XML tags. First, this system extracts the underlying semantics of tags and then expands the synonym set of tags using an WordNet thesaurus and user-defined word library which supports the abbreviation forms and compound words for XML tags. Seconds, considering the relative importance of XML tags in the XML documents, we extend a conventional vector space model which is the most generally used for document model in Information Retrieval field. Using this method, we have been able to check the similarity between XML tags which are represented different tags.

XML Document Analysis based on Similarity (유사성 기반 XML 문서 분석 기법)

  • Lee, Jung-Won;Lee, Ki-Ho
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.6
    • /
    • pp.367-376
    • /
    • 2002
  • XML allows users to define elements using arbitrary words and organize them in a nested structure. These features of XML offer both challenges and opportunities in information retrieval and document management. In this paper, we propose a new methodology for computing similarity considering XML semantics - meanings of the elements and nested structures of XML documents. We generate extended-element vectors, using thesaurus, to normalize synonyms, compound words, and abbreviations and build similarity matrix using them. And then we compute similarity between XML elements. We also discover and minimize XML structure using automata(NFA(Nondeterministic Finite Automata) and DFA(Deterministic Finite automata). We compute similarity between XML structures using similarity matrix between elements and minimized XML structures. Our methodology considering XML semantics shows 100% accuracy in identifying the category of real documents from on-line bookstore.

Resampling Feedback Documents Using Overlapping Clusters (중첩 클러스터를 이용한 피드백 문서의 재샘플링 기법)

  • Lee, Kyung-Soon
    • The KIPS Transactions:PartB
    • /
    • v.16B no.3
    • /
    • pp.247-256
    • /
    • 2009
  • Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback.

Examining the Intellectual Structure of Reading Studies with Co-Word Analysis Based on the Importance of Journals and Sequence of Keywords (학술지 중요도와 키워드 순서를 고려한 단어동시출현 분석을 이용한 독서분야의 지적구조 분석)

  • Zhang, Ling Ling;Hong, Hyun Jin
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.25 no.1
    • /
    • pp.295-318
    • /
    • 2014
  • The purpose of this study is to analyze the intellectual structure of reading studies by using Co-Word Analysis based on the mixed weight in which the level of academic journals and the position of keywords are calculated. To achieve it, 838 academic articles relating to reading studies from KCI during the period from 2003 to 2012 were retrieved and 56 keywords were extracted. The results of clustering analysis, MDS, network analysis are that the network based on the mixed weight has a better performance in above three methods and reading studies can be divided into 4 bigger divisions and 11 subdivisions. Finally, the result of document analysis shows reading studies changes its research tendency from theoretical studies to empirical studies.

Study on Designing and Implementing Online Customer Analysis System based on Relational and Multi-dimensional Model (관계형 다차원모델에 기반한 온라인 고객리뷰 분석시스템의 설계 및 구현)

  • Kim, Keun-Hyung;Song, Wang-Chul
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.4
    • /
    • pp.76-85
    • /
    • 2012
  • Through opinion mining, we can analyze the degree of positive or negative sentiments that customers feel about important entities or attributes in online customer reviews. But, the limit of the opinion mining techniques is to provide only simple functions in analyzing the reviews. In this paper, we proposed novel techniques that can analyze the online customer reviews multi-dimensionally. The novel technique is to modify the existing OLAP techniques so that they can be applied to text data. The novel technique, that is, multi-dimensional analytic model consists of noun, adjective and document axes which are converted into four relational tables in relational database. The multi-dimensional analysis model would be new framework which can converge the existing opinion mining, information summarization and clustering algorithms. In this paper, we implemented the multi-dimensional analysis model and algorithms. we recognized that the system would enable us to analyze the online customer reviews more complexly.

Adaptive Data Mining Model using Fuzzy Performance Measures (퍼지 성능 측정자를 이용한 적응 데이터 마이닝 모델)

  • Rhee, Hyun-Sook
    • The KIPS Transactions:PartB
    • /
    • v.13B no.5 s.108
    • /
    • pp.541-546
    • /
    • 2006
  • Data Mining is the process of finding hidden patterns inside a large data set. Cluster analysis has been used as a popular technique for data mining. It is a fundamental process of data analysis and it has been Playing an important role in solving many problems in pattern recognition and image processing. If fuzzy cluster analysis is to make a significant contribution to engineering applications, much more attention must be paid to fundamental decision on the number of clusters in data. It is related to cluster validity problem which is how well it has identified the structure that Is present in the data. In this paper, we design an adaptive data mining model using fuzzy performance measures. It discovers clusters through an unsupervised neural network model based on a fuzzy objective function and evaluates clustering results by a fuzzy performance measure. We also present the experimental results on newsgroup data. They show that the proposed model can be used as a document classifier.

Unsupervised Motion Learning for Abnormal Behavior Detection in Visual Surveillance (영상감시시스템에서 움직임의 비교사학습을 통한 비정상행동탐지)

  • Jeong, Ha-Wook;Chang, Hyung-Jin;Choi, Jin-Young
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.48 no.5
    • /
    • pp.45-51
    • /
    • 2011
  • In this paper, we propose an unsupervised learning method for modeling motion trajectory patterns effectively. In our approach, observations of an object on a trajectory are treated as words in a document for latent dirichlet allocation algorithm which is used for clustering words on the topic in natural language process. This allows clustering topics (e.g. go straight, turn left, turn right) effectively in complex scenes, such as crossroads. After this procedure, we learn patterns of word sequences in each cluster using Baum-Welch algorithm used to find the unknown parameters in a hidden markov model. Evaluation of abnormality can be done using forward algorithm by comparing learned sequence and input sequence. Results of experiments show that modeling of semantic region is robust against noise in various scene.

Design and Implementation of Topic Map Generation System based Tag (태그 기반 토픽맵 생성 시스템의 설계 및 구현)

  • Lee, Si-Hwa;Lee, Man-Hyoung;Hwang, Dae-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.5
    • /
    • pp.730-739
    • /
    • 2010
  • One of core technology in Web 2.0 is tagging, which is applied to multimedia data such as web document of blog, image and video etc widely. But unlike expectation that the tags will be reused in information retrieval and then maximize the retrieval efficiency, unacceptable retrieval results appear owing to toot limitation of tag. In this paper, in the base of preceding research about image retrieval through tag clustering, we design and implement a topic map generation system which is a semantic knowledge system. Finally, tag information in cluster were generated automatically with topics of topic map. The generated topics of topic map are endowed with mean relationship by use of WordNet. Also the topics are endowed with occurrence information suitable for topic pair, and then a topic map with semantic knowledge system can be generated. As the result, the topic map preposed in this paper can be used in not only user's information retrieval demand with semantic navigation but alse convenient and abundant information service.

Analysis method of patent document to Forecast Patent Registration (특허 등록 예측을 위한 특허 문서 분석 방법)

  • Koo, Jung-Min;Park, Sang-Sung;Shin, Young-Geun;Jung, Won-Kyo;Jang, Dong-Sik
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.4
    • /
    • pp.1458-1467
    • /
    • 2010
  • Recently, imitation and infringement rights of an intellectual property are being recognized as impediments to nation's industrial growth. To prevent the huge loss which comes from theses impediments, many researchers are studying protection and efficient management of an intellectual property in various ways. Especially, the prediction of patent registration is very important part to protect and assert intellectual property rights. In this study, we propose the patent document analysis method by using text mining to predict whether the patent is registered or rejected. In the first instance, the proposed method builds the database by using the word frequencies of the rejected patent documents. And comparing the builded database with another patent documents draws the similarity value between each patent document and the database. In this study, we used k-means which is partitioning clustering algorithm to select criteria value of patent rejection. In result, we found conclusion that some patent which similar to rejected patent have strong possibility of rejection. We used U.S.A patent documents about bluetooth technology, solar battery technology and display technology for experiment data.

The Expressive Characteristics of Fashion Installation in Henrik Vibskov Collection (헨릭 빕스코브 컬렉션에 나타난 패션 인스톨레이션의 표현 특성)

  • Ko, Hyunzin
    • Journal of the Korean Society of Costume
    • /
    • v.65 no.6
    • /
    • pp.133-147
    • /
    • 2015
  • The aim of this study is to review the creative fashion installation of Henrik Vibskov, Danish designer. Its intention is to contribute useful information for more innovative fashion presentation. As a research method, document and case study were performed and his collections from 2004 F/W to 2016 S/S were analyzed. In fashion installation, the designer puts objects in meaningful spaces in order to convey a certain message, to make an integrated artwork, and to interact with spectator. It has been used in fashion exhibitions, as well as in the set design of fashion performance and fashion show. The results were as follows. Henrik Vibskov's fashion installation has three features, which are 1)conceptual theme approach that communicates a twisted and metaphoric message, with a poetic and interesting show title, 2) surrealistic scenography that plays with fragmentation of the human body, clustering of plastic and symbolic objects, innovative color transformations, and visual trickery between figures and the background, and 3) setting for multisensory performance that makes spectators interact by making artistic objects and surroundings, which stimulates the five senses. Henrik Vibskov's fashion installation can exist as an independent artwork, and not just as a supporting piece for a fashion show. It has both artistic and fashionable values, and can be an effective fashion presentation communicating his conceptual fashion themes.