• Title/Summary/Keyword: Similar Documents

Search Result 283, Processing Time 0.03 seconds

Visualization Method of Document Retrieval Result based on Centers of Clusters (군집 중심 기반 문헌 검색 결과의 시각화)

  • Jee, Tae-Chang;Lee, Hyun-Jin;Lee, Yill-Byung
    • The Journal of the Korea Contents Association
    • /
    • v.7 no.5
    • /
    • pp.16-26
    • /
    • 2007
  • Because it is difficult on existing document retrieval systems to visualize the search result, search results show document titles and short summaries of the parts that include the search keywords. If the result list is long, it is difficult to examine all the documents at once and to find a relation among them. This study uses clustering to classify similar documents into groups to make it easy to grasp the relations among the searched documents. Also, this study proposes a two-level visualization algorithm such that, first, the center of clusters is projected to low-dimensional space by using multi-dimensional scaling to help searchers grasp the relation among clusters at a glance, and second, individual documents are drawn in low-dimensional space based on the center of clusters using the orbital model as a basis to easily confirm similarities among individual documents. This study is tested on the benchmark data and the real data, and it shows that it is possible to visualize search results in real time.

Semi Automatic Ontology Generation about XML Documents

  • Gu Mi Sug;Hwang Jeong Hee;Ryu Keun Ho;Jung Doo Yeong;Lee Keum Woo
    • Proceedings of the KSRS Conference
    • /
    • 2004.10a
    • /
    • pp.730-733
    • /
    • 2004
  • Recently XML (eXtensible Markup Language) is becoming the standard for exchanging the documents on the web. And as the amount of information is increasing because of the development of the technique in the Internet, semantic web is becoming to appear for more exact result of information retrieval than the existing one on the web. Ontology which is the basis of the semantic web provides the basic knowledge system to express a particular knowledge. So it can show the exact result of the information retrieval. Ontology defines the particular concepts and the relationships between the concepts about specific domain and it has the hierarchy similar to the taxonomy. In this paper, we propose the generation of semi-automatic ontology based on XML documents that are interesting to many researchers as the means of knowledge expression. To construct the ontology in a particular domain, we suggest the algorithm to determine the domain. So we determined that the domain of ontology is to extract the information of movie on the web. And we used the generalized association rules, one of data mining methods, to generate the ontology, using the tag and contents of XML documents. And XTM (XML Topic Maps), ISO Standard, is used to construct the ontology as an ontology language. The advantage of this method is that because we construct the ontology based on the terms frequently used documents related in the domain, it is useful to query and retrieve the related domain.

  • PDF

An Efficient Index Scheme of XML Documents Using Node Range and Pre-Order List (노드 범위와 Pre-Order List를 이용한 XML문서의 효율적 색인기법)

  • Kim Young;Park Sang-Ho;Lee Ju-Hong
    • Journal of Internet Computing and Services
    • /
    • v.7 no.4
    • /
    • pp.23-32
    • /
    • 2006
  • In this paper, we propose indexing method to manage large amount of XML documents efficiently, using the range of node and Pre-Oder List. The most of XML indexing methods are based on path or numbering method. However, the method of path-based indexing method shows disadvantages of performance degradation for join operations of ancestor-descendent relationships, and searching for middle and lower nodes. The method of numbers-scheme based indexing has to number all nodes of XML documents, since search overhead increased and the disk space for indexes was wasted. Therefore, in this paper, we propose a novel indexing method using node ranges and Preorder-Lists to overcome these problems. The proposed method more efficiently stores similar structured XML documents. In addition, our method supports flexible insertion and deletion of XML documents.

  • PDF

Improving Performance of Change Detection Algorithms through the Efficiency of Matching (대응효율성을 통한 변화 탐지 알고리즘의 성능 개선)

  • Lee, Suk-Kyoon;Kim, Dong-Ah
    • The KIPS Transactions:PartD
    • /
    • v.14D no.2
    • /
    • pp.145-156
    • /
    • 2007
  • Recently, the needs for effective real time change detection algorithms for XML/HTML documents and increased in such fields as the detection of defacement attacks to web documents, the version management, and so on. Especially, those applications of real time change detection for large number of XML/HTML documents require fast heuristic algorithms to be used in real time environment, instead of algorithms which compute minimal cost-edit scripts. Existing heuristic algorithms are fast in execution time, but do not provide satisfactory edit script. In this paper, we present existing algorithms XyDiff and X-tree Diff, analyze their problems and propose algorithm X-tree Diff which improve problems in existing ones. X-tree Diff+ has similar performance in execution time with existing algorithms, but it improves matching ratio between nodes from two documents by refining matching process based on the notion of efficiency of matching.

Latent Keyphrase Extraction Using LDA Model (LDA 모델을 이용한 잠재 키워드 추출)

  • Cho, Taemin;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.2
    • /
    • pp.180-185
    • /
    • 2015
  • As the number of document resources is continuously increasing, automatically extracting keyphrases from a document becomes one of the main issues in recent days. However, most previous works have tried to extract keyphrases from words in documents, so they overlooked latent keyphrases which did not appear in documents. Although latent keyphrases do not appear in documents, they can undertake an important role in text summarization and information retrieval because they implicate meaningful concepts or contents of documents. Also, they cover more than one fourth of the entire keyphrases in the real-world datasets and they can be utilized in short articles such as SNS which rarely have explicit keyphrases. In this paper, we propose a new approach that selects candidate keyphrases from the keyphrases of neighbor documents which are similar to the given document and evaluates the importance of the candidates with the individual words in the candidates. Experiment result shows that latent keyphrases can be extracted at a reasonable level.

The Classification arranged from Protectorate period to the early Japanese Colonial rule period : for Official Documents during the period from Kabo Reform to The Great Han Empire - Focusing on Classification Stamp and Warehouse Number Stamp - (통감부~일제 초기 갑오개혁과 대한제국기 공문서의 분류 - 분류도장·창고번호도장을 중심으로 -)

  • Park, Sung-Joon
    • The Korean Journal of Archival Studies
    • /
    • no.22
    • /
    • pp.115-155
    • /
    • 2009
  • As Korea was merged into Japan, the official documents during Kabo Reform and The Great Han Empire time were handed over to the Government-General of Chosun and reclassified from section based to ministry based. However they had been reclassified before many times. The footprints of reclassification can be found in the classification stamps and warehouse number stamps which remained on the cover of official documents from Kabo Reform to The Great Han Empire. They classified the documents by Section in the classification system of Ministry-Department-Section, stamped and numbered them. It is consistent with the official document classification system in The Great Han Empire, which shows the section based classification was maintained. Although they stamped by Section and numbered the documents, there were differences in sub classification system by Section. In the documents of Land Tax Section, many institutions can be found. The documents of the same year can be found in different group and documents of similar characteristics are classified in the same group. Customs Section and Other Tax Section seemed to number their documents according to the year of documents. However the year and the order of 'i-ro-ha(イロハ) song' does not match. From Kabo Reform to The Great Han Empire the documents were grouped by Section. However they did not have classification rules for the sub units of Section. Therefore, it is not clear if the document grouping of classification stamps can be understood as the original order of official document classification system of The Great Han Empire. However, given the grouping method reflects the document classification system, the sub section classification system of the Great Han Empire can be inferred through the grouping method. In this inference, it is understood that the classification system was divided into two such as 'Section - Counterpart Institution' and 'Section - Document Issuance Year'. The Government-General of Chosun took over the official documents of The Great Han Empire, stored them in the warehouse and marked them with Warehouse Number Stamps. Warehouse Number Stamp contained the Institution that grouped those documents and the documents were stored by warehouse. Although most of the documents on the shelves in each warehouse were arranged by classification stamp number, some of them were mixed and the order of shelves and that of documents did not match. Although they arranged the documents on the shelves and gave the symbols in the order of 'i-ro-ha(イロハ) song', these symbols were not given by the order of number. During the storage of the documents by the Government-General of Chosun, the classification system according to the classification stamps was affected. One characteristic that can be found in warehouse number stamps is that the preservation period on each document group lost the meaning. The preservation period id decided according to the historical and administrative value. However, the warehouse number stamps did not distinguish the documents according to the preservation period and put the documents with different preservation period on one shelf. As Japan merged Korea, The Great Han Empire did not consider the official documents of the Great Han Empire as administrative documents that should be disposed some time later. It considered them as materials to review the old which is necessary for the colonial governance. As the meaning of the documents has been changed from general administrative documents to the materials that they would need to govern the colony, they dealt with all the official documents of The Great Han Empire as the same object regardless of preservation period. The Government-General of Chosun destroyed the classification system of the Great Han Empire which was based on Section and the functions in the Section by reclassifying them according to Ministry when they reclassified the official documents during Kobo Reform and the Great Han Empire in order to utilize them to govern the colony.

WebDBs : A User oriented Web Search Engine (WebDBs: 사용자 중심의 웹 검색 엔진)

  • 김홍일;임해철
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.24 no.7B
    • /
    • pp.1331-1341
    • /
    • 1999
  • This paper propose WebDBs(Web Database system) which retrieves information registered in web using query language similar to SQL. This proposed system automatically extracts information which is needed to retrieve from HTML documents dispersed in web. Also, it has an ability to process SQL based query intended for the extracted information. Web database system takes the most of query processing time for capturing documents going through network line. And so, the information previously retrieved is reused in similar applications after stored in cache in perceiving that most of the web retrieval depends on web locality. In this case, we propose cache mechanism adapted to user applications by storing cached information associated with retrieved query. And, Web search engine is implemented based on these concepts.

  • PDF

Query Term Expansion and Reweighting using Term-Distribution Similarity (용어 분포 유사도를 이용한 질의 용어 확장 및 가중치 재산정)

  • Kim, Ju-Youn;Kim, Byeong-Man;Park, Hyuk-Ro
    • Journal of KIISE:Databases
    • /
    • v.27 no.1
    • /
    • pp.90-100
    • /
    • 2000
  • We propose, in this paper, a new query expansion technique with term reweighting. All terms in the documents feedbacked from a user, excluding stopwords, are selected as candidate terms for query expansion and reweighted using the relevance degree which is calculated from the term-distribution similarity between a candidate term and each term in initial query. The term-distribution similarity of two terms is a measure on how similar their occurrence distributions in relevant documents are. The terms to be actually expanded are selected using the relevance degree and combined with initial query to construct an expanded query. We use KT-set 1.0 and KT-set 2.0 to evaluate performance and compare our method with two methods, one with no relevance feedback and the other with Dec-Hi method which is similar to our method. based on recall and precision.

  • PDF

A Study on the Visual Representation of TREC Text Documents in the Construction of Digital Library (디지털도서관 구축과정에서 TREC 텍스트 문서의 시각적 표현에 관한 연구)

  • Jeong, Ki-Tai;Park, Il-Jong
    • Journal of the Korean Society for information Management
    • /
    • v.21 no.3
    • /
    • pp.1-14
    • /
    • 2004
  • Visualization of documents will help users when they do search similar documents. and all research in information retrieval addresses itself to the problem of a user with an information need facing a data source containing an acceptable solution to that need. In various contexts. adequate solutions to this problem have included alphabetized cubbyholes housing papyrus rolls. microfilm registers. card catalogs and inverted files coded onto discs. Many information retrieval systems rely on the use of a document surrogate. Though they might be surprise to discover it. nearly every information seeker uses an array of document surrogates. Summaries. tables of contents. abstracts. reviews, and MARC recordsthese are all document surrogates. That is, they stand infor a document allowing a user to make some decision regarding it. whether to retrieve a book from the stacks, whether to read an entire article, etc. In this paper another type of document surrogate is investigated using a grouping method of term list. lising Multidimensional Scaling Method (MDS) those surrogates are visualized on two-dimensional graph. The distances between dots on the two-dimensional graph can be represented as the similarity of the documents. More close the distance. more similar the documents.

A reuse recommendation framework of artifacts based on task similarity to improve R&D performance (연구개발 생산성 향상을 위한 태스크 유사도 기반 산출물 재사용 추천 프레임워크)

  • Nam, Seungwoo;Daneth, Horn;Hong, Jang-Eui
    • Journal of Convergence for Information Technology
    • /
    • v.9 no.2
    • /
    • pp.23-33
    • /
    • 2019
  • Research and development(R&D) activities consist of analytical survey and state-of-the-art report writing for technical information. As R & D activities become more concrete, it often happens that they refer to related technical documents that were created in previous steps or created in previous similar projects. This paper proposes a research-task based reuse recommendation framework(RTRF), which is a reuse recommendation system that enables researchers to efficiently reuse the existing artifacts. In addition to the existing keyword-based retrieval and reuse, the proposed framework also provides reusable information that researchers may need by recommending reusable artifacts based on task similarity; other developers who have a similar task to the researcher's work can recommend reusable documents. A case study was performed to show the researchers' efficiency in the process of writing the technology trend report by reusing existing documents. When reuse is performed using RTRF, it can be seen that documents of different stages or other research fields are reused more frequently than when RTRF is not used. The RTRF may contribute to the efficient reuse of the desired artifacts among huge amount of R&D documents stored in the repository.