• Title/Summary/Keyword: Retrieval systems

Search Result 1,016, Processing Time 0.025 seconds

Design and Implementation of Multiple Filter Distributed Deduplication System Applying Cuckoo Filter Similarity (쿠쿠 필터 유사도를 적용한 다중 필터 분산 중복 제거 시스템 설계 및 구현)

  • Kim, Yeong-A;Kim, Gea-Hee;Kim, Hyun-Ju;Kim, Chang-Geun
    • Journal of Convergence for Information Technology
    • /
    • v.10 no.10
    • /
    • pp.1-8
    • /
    • 2020
  • The need for storage, management, and retrieval techniques for alternative data has emerged as technologies based on data generated from business activities conducted by enterprises have emerged as the key to business success in recent years. Existing big data platform systems must load a large amount of data generated in real time without delay to process unstructured data, which is an alternative data, and efficiently manage storage space by utilizing a deduplication system of different storages when redundant data occurs. In this paper, we propose a multi-layer distributed data deduplication process system using the similarity of the Cuckoo hashing filter technique considering the characteristics of big data. Similarity between virtual machines is applied as Cuckoo hash, individual storage nodes can improve performance with deduplication efficiency, and multi-layer Cuckoo filter is applied to reduce processing time. Experimental results show that the proposed method shortens the processing time by 8.9% and increases the deduplication rate by 10.3%.

A Study on the Development of Electronic Resource Management System in a University Library (대학도서관 전자자원관리시스템(ERMS) 구축에 관한 연구)

  • Kim, Yong;Cho, Su-Kyeong
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.44 no.4
    • /
    • pp.249-276
    • /
    • 2010
  • With the rapid growth and development of information technology and the Internet, the amount of information published in electronic formats such as video, audio, digitalized text, etc. and the number of users accessing information online to satisfy their information needs are growing at a tremendous rate. This study analyzes standardized components to construct ERMS and proposes a model of ERMS based on the result of the analysis. The main functions of ERMS in university libraries are: 1) ERMS can manage and control access information to various electronic resources, metadata, holdings, user resources. Also, ERMS can be compatible with an existing library system such as IR(Information Retrieval) system, linking system, or proxy system. 2) ERMS should completely be compatible with acquisition and cataloging systems for effective management and control of integrated information organization and library budget. 3) ERMS should systematically and effectively manage license information on electronic resources. 4) ERMS should provide ideal and effective environment for use and access control of electronic resources in a library and integrated tool to manage and control all of electronic resources. Additionally, this study points out the need to organize committee groups to establish standardized rules and collaborative management of electronic resources among university libraries like DLF ERMI and redesign organizations in a library and a librarian's job description.

Efficient Management of Statistical Information of Keywords on E-Catalogs (전자 카탈로그에 대한 효율적인 색인어 통계 정보 관리 방법)

  • Lee, Dong-Joo;Hwang, In-Beom;Lee, Sang-Goo
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.1-17
    • /
    • 2009
  • E-Catalogs which describe products or services are one of the most important data for the electronic commerce. E-Catalogs are created, updated, and removed in order to keep up-to-date information in e-Catalog database. However, when the number of catalogs increases, information integrity is violated by the several reasons like catalog duplication and abnormal classification. Catalog search, duplication checking, and automatic classification are important functions to utilize e-Catalogs and keep the integrity of e-Catalog database. To implement these functions, probabilistic models that use statistics of index words extracted from e-Catalogs had been suggested and the feasibility of the methods had been shown in several papers. However, even though these functions are used together in the e-Catalog management system, there has not been enough consideration about how to share common data used for each function and how to effectively manage statistics of index words. In this paper, we suggest a method to implement these three functions by using simple SQL supported by relational database management system. In addition, we use materialized views to reduce the load for implementing an application that manages statistics of index words. This brings the efficiency of managing statistics of index words by putting database management systems optimize statistics updating. We showed that our method is feasible to implement three functions and effective to manage statistics of index words with empirical evaluation.

  • PDF

Information System Evaluation using IPA Method (IPA 기법을 활용한 정보시스템 평가)

  • Park, Minsoo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.6 no.3
    • /
    • pp.431-436
    • /
    • 2020
  • Information service organizations that provide science and technology information with a relatively short information life cycle for free or paid are in need of reflecting rapidly changing user needs and behaviors and grafting the latest technologies. The purpose of this study is to derive improvements for each system by comparing and analyzing general recognition of science and technology information users' domestic and foreign science and technology information sites and importance by science and technology information attributes. A total of 816 users of science and technology information participated in the online survey, and the collected data were analyzed by quantitative methods including IPA (Importance Performance Analysis) technique. The importance was evaluated by the impact value calculated through regression analysis. As a result of data analysis, the general recognition of users on science and technology information sites was relatively high in national science and technology information services, and Google Scholar and Science Direct were also high. Google Scholar was found to have more strength than improvement. A better understanding of the user's preferred system is a good driving force for improving the lack of existing systems. It is necessary to improve the information retrieval of the science and technology information service system, that is, to improve the search speed and functions, and also to improve the user interface with improved convenience and usability.

Text Filtering using Iterative Boosting Algorithms (반복적 부스팅 학습을 이용한 문서 여과)

  • Hahn, Sang-Youn;Zang, Byoung-Tak
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.4
    • /
    • pp.270-277
    • /
    • 2002
  • Text filtering is a task of deciding whether a document has relevance to a specified topic. As Internet and Web becomes wide-spread and the number of documents delivered by e-mail explosively grows the importance of text filtering increases as well. The aim of this paper is to improve the accuracy of text filtering systems by using machine learning techniques. We apply AdaBoost algorithms to the filtering task. An AdaBoost algorithm generates and combines a series of simple hypotheses. Each of the hypotheses decides the relevance of a document to a topic on the basis of whether or not the document includes a certain word. We begin with an existing AdaBoost algorithm which uses weak hypotheses with their output of 1 or -1. Then we extend the algorithm to use weak hypotheses with real-valued outputs which was proposed recently to improve error reduction rates and final filtering performance. Next, we attempt to achieve further improvement in the AdaBoost's performance by first setting weights randomly according to the continuous Poisson distribution, executing AdaBoost, repeating these steps several times, and then combining all the hypotheses learned. This has the effect of mitigating the ovefitting problem which may occur when learning from a small number of data. Experiments have been performed on the real document collections used in TREC-8, a well-established text retrieval contest. This dataset includes Financial Times articles from 1992 to 1994. The experimental results show that AdaBoost with real-valued hypotheses outperforms AdaBoost with binary-valued hypotheses, and that AdaBoost iterated with random weights further improves filtering accuracy. Comparison results of all the participants of the TREC-8 filtering task are also provided.

Design and Implementation of Automatic Linking Support System for Efficient Generating and Retrieving Integrated Documents Based on Web (웹 통합문서의 효율적 생성과 검색을 위한 자동링크지원 시스템의 설계 및 구축)

  • Lee, Won-Jung;Jung, Eun-Jae;Joo, Su-Chong;Lee, Seung-Yong
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.93-100
    • /
    • 2003
  • With the advent of distributed computing and Web service technologies, lots of users have been requiring services that can conveniently obtain and/or support well-assembled information based on Web. For this reason, we are to construct Automatic Linking Support Systems for generating Web-based integrated information and supporting retrieval information according to user's various requirements. Our system organization is based on client/server system. A server environment consisted of automatic linking engine that can provide lexical analyzing, query processing and integrated document generating functions, and databases that are made of dictionaries, image and URL contents. Also, client environments consisted of Web editor that can generate integrated documents and Web helper that can retrieve them via automatic linking engine and databases. For client's user-friendly interfaces, web editor and helper programs can directly execute by down leading from a server without setup them before inside clients. For reducing server's overheads, Parts of server's executing modules are distributed to clients on which they can be executing. As an implementation of our system, we use the JDK 1.3, SWING for user interfaces like Web editor and helper, RMI mechanism for interaction between clients and a server, and SQL server 7.0 for database development, respectively. Finally, we showed the access procedures of automatic document linking engine and databases from Web editor or Web helper, and results appearing on their screens.

Optimizing Similarity Threshold and Coverage of CBR (사례기반추론의 유사 임계치 및 커버리지 최적화)

  • Ahn, Hyunchul
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.8
    • /
    • pp.535-542
    • /
    • 2013
  • Since case-based reasoning(CBR) has many advantages, it has been used for supporting decision making in various areas including medical checkup, production planning, customer classification, and so on. However, there are several factors to be set by heuristics when designing effective CBR systems. Among these factors, this study addresses the issue of selecting appropriate neighbors in case retrieval step. As the criterion for selecting appropriate neighbors, conventional studies have used the preset number of neighbors to combine(i.e. k of k-nearest neighbor), or the relative portion of the maximum similarity. However, this study proposes to use the absolute similarity threshold varying from 0 to 1, as the criterion for selecting appropriate neighbors to combine. In this case, too small similarity threshold value may make the model rarely produce the solution. To avoid this, we propose to adopt the coverage, which implies the ratio of the cases in which solutions are produced over the total number of the training cases, and to set it as the constraint when optimizing the similarity threshold. To validate the usefulness of the proposed model, we applied it to a real-world target marketing case of an online shopping mall in Korea. As a result, we found that the proposed model might significantly improve the performance of CBR.

Developmental disability Diagnosis Assessment Systems Implementation using Multimedia Authorizing Tool (멀티미디어 저작도구를 이용한 발달장애 진단.평가 시스템 구현연구)

  • Byun, Sang-Hea;Lee, Jae-Hyun
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.3 no.1
    • /
    • pp.57-72
    • /
    • 2008
  • Serve and do so that graft together specialists' view application field of computer and developmental disability diagnosis estimation data to construct developmental disability diagnosis estimation system in this Paper and constructed developmental disability diagnosis estimation system. Developmental disability diagnosis estimation must supply information of specification area that specialists are having continuously. Developmental disability diagnosis estimation specialist system need multimedia data processing that is specialized little more for developmental disability classification diagnosis and decision-making and is atomized for this. Characteristic of developmental disability diagnosis estimation system that study in this paper can supply quick feedback about result, and can reduce mistake on recording and calculation as well as can shorten examination's enforcement time, and background of training is efficient system fairly in terms of nonprofessional who is not many can use easily. But, as well as when multimedia information that is essential data of system construction for developmental disability diagnosis estimation is having various kinds attribute and a person must achieve description about all developmental disability diagnosis estimation informations, great amount of work done is accompanied, technology about equal data can become different according to management. Because of these problems, applied search technology of contents base (Content-based) that search connection information by contents of edit target data for developmental disability diagnosis estimation data processing multimedia data processing technical development. In the meantime, typical access way for conversation style data processing to support fast image search, after draw special quality of data by N-dimension vector, store to database regarding this as value of N dimension and used data structure of Tree techniques to use index structure that search relevant data based on this costs. But, these are not coincided correctly in purpose of developmental disability diagnosis estimation because is developed focusing in application field that use data of low dimension such as original space DataBase or geography information system. Therefore, studied save structure and index mechanism of new way that support fast search to search bulky good physician data.

  • PDF

Implementation and Verification of Dynamic Search Ranking Model for Information Search Tasks: The Evaluation of Users' Relevance Judgement Model (정보 검색 과제별 동적 검색 랭킹 모델 구현 및 검증: 사용자 중심 적합성 판단 모형 평가를 중심으로)

  • Park, Jung-Ah;Sohn, Young-Woo
    • Science of Emotion and Sensibility
    • /
    • v.15 no.3
    • /
    • pp.367-380
    • /
    • 2012
  • The purpose of this research was to implement and verify an information retrieval(IR) system based on users' relevance criteria for information search tasks. For this purpose, we implemented an IR system with a dynamic ranking model using users' relevance criteria varying with the types of information search task and evaluated this system through user experiment. 45 participants performed three information search tasks on both IR systems with a static and a dynamic ranking model. Three Information search tasks are fact finding search task, problem solving search task and decision making search task. Participants evaluated top five search results on 7 likert scales of relevance. We observed that the IR system with a dynamic ranking model provided more relevant search results compared to the system with a static ranking model. This research has significance in designing IR system for information search tasks, in testing the validity of user-oriented relevance judgement model by implementing an IR system for actual information search tasks and in relating user research to the improvement of an IR system.

  • PDF

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.