• Title/Summary/Keyword: Document searching

Search Result 170, Processing Time 0.026 seconds

A Design and Implementation of Ontology-based Retrieval System for the Electronic Records of Universities (대학 전자기록물을 위한 온톨로지 기반 검색시스템 설계 및 구현)

  • Lee, Jung-Hee;Kim, Hee-Sop
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.3
    • /
    • pp.343-362
    • /
    • 2007
  • The purpose of this study is to design and implement an ontology-based retrieval system for the electronic records of universities and to compare its performance with the existing keyword-based retrieval system. We used OntoStudio 1.4 for implementing an ontology-based retrieval system, and the test collection consisted of the following: (1) 5,099 electronic records of the 'personnel management notification' created by Korea Maritime University, (2) 20 topics (10 short-topics and 10 long-topics), and (3) the relevant assessments were conducted by the group of human experts. 10 university staff participated in the experiment of keyword-based searching and used the same test collection as used in the experiment of ontology-based searching. The ontology-based retrieval system outperformed to the keyword-based retrieval system in terms of Recall and Precision, and the same results showed in the test of the short-topics and long-topics comparison.

Constructing Domain Ontologies Using Japanese DODDLE and General Ontologies (일본어 DODDLE와 범용 온토로지를 이용한 도메인 온토로지의 구축 및 평가)

  • Hong, Yun-Ki;Yamaguchi, Takahira;Kim, Tai-Suk
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.2
    • /
    • pp.226-233
    • /
    • 2006
  • With the advancement of the Internet, bulky information overflows in the Web. When the Internet user wants to get the necessary information, it is essential to use the retrieval system. It is not easy for the user to get the information from the result of the retrieval system. Various research activities have been advanced for the result improvement of the retrieval system. Although retrieval results can be improved with ontology, it usually takes lots of costs for users to construct the Japanese domain ontology. This paper discusses how to integrate search result refinement and domain ontology refinement using the domain ontology tool called Japanese DODDLE, and how to improve the research result using the constructed ontology. To prove the effectiveness of the suggested methodology, the case study with rocket operation is performed and it shows that the methodology can be promising.

  • PDF

Query Processing using Information of Parent Nodes in Partitioned Inverted Index Tables (분할된 역 인덱스 테이블에서 부모노드의 정보를 이용한 질의 처리)

  • Kim, Myung-Soo;Hwang, Byung-Yeon
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.7
    • /
    • pp.905-913
    • /
    • 2008
  • Many heterogeneous XML documents are being widely used with the increasing employment of XML, and the importance of data structure research for more efficient document management has been growing steadily. We propose a query processing technique which uses parent node information in a partitioned inverted index tree. The searching efficiency of these heterogeneous documents is greatly influenced by the number of query processing and the amount of target data sets in many ways. Therefore, considering these two factors is very important for designing a data structure. First, our technique stores parent node's information in an inverted index table. Then using this information, we can reduce the number of query processing by half. Also, the amount of target data sets can be lessoned by using partitioned inverted index table. Some XML documents collected from the Internet will be used to demonstrate the new method, and its high efficiency will be compared with some of the existing searching methods.

  • PDF

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

An Empirical Study on Improvement model for Measuring of Project Similarity (과제 유사도 측정 개선모형에 관한 실증적 연구)

  • Jung, Ok-Nam;Rhew, Sung-Yul;Kim, Jong-Bae
    • Journal of Digital Contents Society
    • /
    • v.12 no.4
    • /
    • pp.457-465
    • /
    • 2011
  • The annual R&D investment in Korea increased by an average of 12.2percent during the last 5 years. Therefore, prevention of duplicate projects being performed became an important factor in promoting the efficiency of R&D investment and the originality of R&D projects. On measuring the similarity of projects, the measurement model used to estimate the accuracy of the similarity is crucial. In this paper, we propose an advanced measurement model on checking the similarity of R&D projects for promoting the efficiency of R&D investment. The proposed model is made up of the following steps for the model measurement, sampling and analyzing. During the sampling step, we append the abstract of R&D reports on the search engine based on document vector. We then measure the similarity on projects to use research title network which is consists of the compound keyword and the weight of items on during the analysis. The proposed method improved the accuracy for measuring the similarity of projects by an average of 0.19 over the existing search engine and by 9.25 over the simple keyword search on R&D projects. On searching the similarity with the appending conditions and high sampling, it improved the accuracy of measuring the similarity of R&D projects.

A Search Method for Components Based-on XML Component Specification (XML 컴포넌트 명세서 기반의 컴포넌트 검색 기법)

  • Park, Seo-Young;Shin, Yoeng-Gil;Wu, Chi-Su
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.2
    • /
    • pp.180-192
    • /
    • 2000
  • Recently, the component technology has played a main role in software reuse. It has changed the code-based reuse into the binary code-based reuse, because components can be easily combined into the developing software only through component interfaces. Since components and component users have increased rapidly, it is necessary that the users of components search for the most proper components for HTML among the enormous number of components on the Internet. It is desirable to use web-document-typed specifications for component specifications on the Internet. This paper proposes to use XML component specifications instead of HTML specifications, because it is impossible to represent the semantics of contexts using HTML. We also propose the XML context-search method based on XML component specifications. Component users use the contexts for the component properties and the terms for the values of component properties in their queries for searching components. The index structure for the context-based search method is the inverted file indexing structure of term-context-component specification. Not only an XML context-based search method but also a variety of search methods based on context-based search, such as keyword, search, faceted search, and browsing search method, are provided for the convenience of users. We use the 3-layer architecture, with an interface layer, a query expansion layer, and an XML search engine layer, of the search engine for the efficient index scheme. In this paper, an XML DTD(Document Type Definition) for component specification is defined and the experimental results of comparing search performance of XML with HTML are discussed.

  • PDF

An Analysis on the Factors Affectingy Online Search Effect (온라인 정보탐색의 효과변인 분석)

  • Kim Sun-Ho
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.22
    • /
    • pp.361-396
    • /
    • 1992
  • The purpose of this study is to verify the correlations between the amount of the online searcher's search experience and their search effect. In order to achieve this purpose, the 28 online searchers working at the chosen libraries and information centers have participated in the study as subjects. The subjects have been classified into the two types of cognitive style by Group Embedded Figure Test. As the result of the GEFT, two groups have been identified: the 15 Field Independance ( FI ) searchers and the 13 Field Dependance ( FD ) searchers. The subject's search experience consists of the 3 elements: disciplinary, training, and working experience. In order to get the data of these empirical elements, a questionnaire have been sent to the 28 subjects. An online searching request form prepared by a practical user was sent to all subjects, who conducted searches of the oversea databases through Dialog to retrieve what was requested. The resultant outcomes were collected and sent back to the user to evaluate relevance and pertinence of the search effect by the individual. In this study, the search effect has been divide into relevance and pertinence. The relevance has been then subdivided into the 3 elements : the number of the relevant documents, recall ratio, and the cost per a relevant document. The relevance has been subdivided into the 3 elements: the number of the pertinent documents, utility ratio, and the cost per a pertinent document. The correlations between the 3 elements of the subject's experience and the 6 elements of the search effect has been analysed in the FI and in the FD searchers separately. At the standard of the 0.01 significance level, findings and conclusions made in the study are summarised as follows : 1. There are strong correlations between the amount of training and the recall ratio, the number of the pertinent documents, and the utility ratio on the part of FI searchers. 2. There are strong correlations between the amount of working experience and the number of the relevant documents, the recall ratio on the part of FD searchers. However, there is also a significant converse correlation between the amount of working experience and the search cost per a pertinent document on the part of FD searchers. 3. The amount of working experience has stronger correlations with the number of the pertinent documents and the utility ratio on the part of FD searchers than the amount of training. 4. There is a strong correlation between the amount of training and the pertinence on both part of FI and FD searchers.

  • PDF

Design and Implementation of code generator to remove the parameter boundary failure in the unit test (단위테스트 중 매개변수 경계오류제거를 위한 코드 자동생성 시스템 설계와 구현)

  • Park, Youngjo;Bang, Hyeja
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.11 no.2
    • /
    • pp.1-10
    • /
    • 2015
  • As programs get more complicated and they are developed by various hands, the possibility that there are program bugs in the code has been increasing. And developers usually run unit tests to find these problems in the code. Besides, the developers are at the pain of getting stability of the code when they have to modify a code very often for clients requirements. In the methodlogy of TDD(Test Driven Development), developers write a unit test code first, and then write a program code for passing the unit test. The unit test must include the boundary condition test the reason why the possibility of occurring the bugs is very high. When failed to pass the test because of the value of a function is incorrect, not existed, out of the range or not matched etc, the program code will return the error code or occur the exception. In the document, the system is designed and implemented in order to insert the generated code automatically or suggest it to the developer, when the boundary condition test is failed. In conclusion, it is possible that the developer will get the code stability by searching the code and checking the code to be omitted automatically through this system.

Adaptive User Profile for Information Retrieval from the Web

  • Srinil, Phaitoon;Pinngern, Ouen
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2003.10a
    • /
    • pp.1986-1989
    • /
    • 2003
  • This paper proposes the information retrieval improvement for the Web using the structure and hyperlinks of HTML documents along with user profile. The method bases on the rationale that terms appearing in different structure of documents may have different significance in identifying the documents. The method partitions the occurrence of terms in a document collection into six classes according to the tags in which particular terms occurred (such as Title, H1-H6 and Anchor). We use genetic algorithm to determine class importance values and expand user query. We also use this value in similarity computation and update user profile. Then a genetic algorithm is used again to select some terms from user profile to expand the original query. Lastly, the search engine uses the expanded query for searching and the results of the search engine are scored by similarity values between each result and the user profile. Vector space model is used and the weighting schemes of traditional information retrieval were extended to include class importance values. The tested results show that precision is up to 81.5%.

  • PDF

Symmetric Searchable Encryption with Efficient Conjunctive Keyword Search

  • Jho, Nam-Su;Hong, Dowon
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.7 no.5
    • /
    • pp.1328-1342
    • /
    • 2013
  • Searchable encryption is a cryptographic protocol for searching a document in encrypted databases. A simple searchable encryption protocol, which is capable of using only one keyword at one time, is very limited and cannot satisfy demands of various applications. Thus, designing a searchable encryption with useful additional functions, for example, conjunctive keyword search, is one of the most important goals. There have been many attempts to construct a searchable encryption with conjunctive keyword search. However, most of the previously proposed protocols are based on public-key cryptosystems which require a large amount of computational cost. Moreover, the amount of computation in search procedure depends on the number of documents stored in the database. These previously proposed protocols are not suitable for extremely large data sets. In this paper, we propose a new searchable encryption protocol with a conjunctive keyword search based on a linked tree structure instead of public-key based techniques. The protocol requires a remarkably small computational cost, particularly when applied to extremely large databases. Actually, the amount of computation in search procedure depends on the number of documents matched to the query, instead of the size of the entire database.