• Title/Summary/Keyword: 색인화

Search Result 272, Processing Time 0.027 seconds

A Big Data Analysis by Between-Cluster Information using k-Modes Clustering Algorithm (k-Modes 분할 알고리즘에 의한 군집의 상관정보 기반 빅데이터 분석)

  • Park, In-Kyoo
    • Journal of Digital Convergence
    • /
    • v.13 no.11
    • /
    • pp.157-164
    • /
    • 2015
  • This paper describes subspace clustering of categorical data for convergence and integration. Because categorical data are not designed for dealing only with numerical data, The conventional evaluation measures are more likely to have the limitations due to the absence of ordering and high dimensional data and scarcity of frequency. Hence, conditional entropy measure is proposed to evaluate close approximation of cohesion among attributes within each cluster. We propose a new objective function that is used to reflect the optimistic clustering so that the within-cluster dispersion is minimized and the between-cluster separation is enhanced. We performed experiments on five real-world datasets, comparing the performance of our algorithms with four algorithms, using three evaluation metrics: accuracy, f-measure and adjusted Rand index. According to the experiments, the proposed algorithm outperforms the algorithms that were considered int the evaluation, regarding the considered metrics.

Fast Hilbert R-tree Bulk-loading Scheme using GPGPU (GPGPU를 이용한 Hilbert R-tree 벌크로딩 고속화 기법)

  • Yang, Sidong;Choi, Wonik
    • Journal of KIISE
    • /
    • v.41 no.10
    • /
    • pp.792-798
    • /
    • 2014
  • In spatial databases, R-tree is one of the most widely used indexing structures and many variants have been proposed for its performance improvement. Among these variants, Hilbert R-tree is a representative method using Hilbert curve to process large amounts of data without high cost split techniques to construct the R-tree. This Hilbert R-tree, however, is hardly applicable to large-scale applications in practice mainly due to high pre-processing costs and slow bulk-load time. To overcome the limitations of Hilbert R-tree, we propose a novel approach for parallelizing Hilbert mapping and thus accelerating bulk-loading of Hilbert R-tree on GPU memory. Hilbert R-tree based on GPU improves bulk-loading performance by applying the inversed-cell method and exploiting parallelism for packing the R-tree structure. Our experimental results show that the proposed scheme is up to 45 times faster compared to the traditional CPU-based bulk-loading schemes.

A Method for Same Author Name Disambiguation in Domestic Academic Papers (국내 학술논문의 동명이인 저자명 식별을 위한 방법)

  • Shin, Daye;Yang, Kiduk
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.28 no.4
    • /
    • pp.301-319
    • /
    • 2017
  • The task of author name disambiguation involves identifying an author with different names or different authors with the same name. The author name disambiguation is important for correctly assessing authors' research achievements and finding experts in given areas as well as for the effective operation of scholarly information services such as citation indexes. In the study, we performed error correction and normalization of data and applied rules-based author name disambiguation to compare with baseline machine learning disambiguation in order to see if human intervention could improve the machine learning performance. The improvement of over 0.1 in F-measure by the corrected and normalized email-based author name disambiguation over machine learning demonstrates the potential of human pattern identification and inference, which enabled data correction and normalization process as well as the formation of the rule-based diambiguation, to complement the machine learning's weaknesses to improve the author name disambiguation results.

A study on the improving and constructing the content for the Sijo database in the Period of Modern Enlightenment (계몽기·근대시조 DB의 개선 및 콘텐츠화 방안 연구)

  • Chang, Chung-Soo
    • Sijohaknonchong
    • /
    • v.44
    • /
    • pp.105-138
    • /
    • 2016
  • Recently with the research function, "XML Digital collection of Sijo Texts in the Period of Modern Enlightenment" DB data is being provided through the Korean Research Memory (http://www.krm.or.kr) and the foundation for the constructing the contents of Sijo Texts in the Period of Modern Enlightenment has been laid. In this paper, by reviewing the characteristics and problems of Digital collection of Sijo Texts in the Period of Modern Enlightenment and searching for the improvement, I tried to find a way to make it into the content. This database has the primary meaning in the integrating and glancing at the vast amounts of Sijo in the Period of Modern Enlightenment to reaching 12,500 pieces. In addition, it is the first Sijo data base which is provide the variety of search features according to literature, name of poet, title of work, original text, per period, and etc. However, this database has the limits to verifying the overall aspects of the Sijo in the Period of Modern Enlightenment. The title and original text, which is written in the archaic word or Chinese character, could not be searched, because the standard type text of modern language is not formatted. And also the works and the individual Sijo works released after 1945 were missing in the database. It is inconvenient to extract the datum according to the poet, because poets are marked in the various ways such as one's real name, nom de plume and etc. To solve this kind of problems and improve the utilization of the database, I proposed the providing the standard type text of modern language, giving the index terms about content, providing the information on the work format and etc. Furthermore, if the Sijo database in the Period of Modern Enlightenment which is prepared the character of the Sijo Culture Information System could be built, it could be connected with the academic, educational contents. For the specific plan, I suggested as follow, - learning support materials for the Modern history and the national territory recognition on the Modern Age - source materials for studying indigenous animals and plants characters creating the commercial characters - applicability as the Sijo learning tool such as Sijo Game.

  • PDF

A Design of Computerizing System for Record Management of ULJIN Nuclear Power Plant (울진원자력발전소(原子力發展所) 자료관리업무(資料管理業務) 전산화(電算化) 연구(硏究)(I))

  • Park, Yong-Boo
    • Journal of the Korean Society for information Management
    • /
    • v.4 no.1
    • /
    • pp.154-169
    • /
    • 1987
  • The Computerizing Record Management System(RMS) has been developed for ULJIN Nuclear Power Plant in Korea on the basis of the Manual System. By means of review and analysis of processing flow & project requirements, system logics such as Receiving system, Logging system, Distribution system, Filing system, Indexing & Retrieval system and Output system of statistic and various reports, have been established for computerizing. Structure of Masterfile has been designed so as to include Bibliographic data, Transmittal data, Distribution data, Area data, RMS data for operation of Plant. The RMS data have been designed for construction and operation of the plant by adding index parameters for operation such as System code, KEPCO No., Component link code and Retention period on the point of receiving. The RMS has turned out an easy access to cross-reference between RMS and Material Control System.

  • PDF

Word Vectorization Method Based on Bag of Characters (Bag of Characters를 응용한 단어의 벡터 표현 생성 방법)

  • Lee, Chanhee;Lee, Seolhwa;Lim, Heuiseok
    • Proceedings of The KACE
    • /
    • 2017.08a
    • /
    • pp.47-49
    • /
    • 2017
  • 인공 신경망 기반 자연어 처리 시스템들에서 단어를 벡터로 변환할 때, 크게 색인 및 순람표를 이용하는 방법과 합성곱 신경망이나 회귀 신경망을 이용하는 방법이 있다. 이 때, 전자의 방법을 사용하려면 시스템이 수용 가능한 어휘집이 정의되어 있어야 하며 새로운 단어를 어휘집에 추가하기 어렵다. 반면 후자의 방법을 사용하면 단어를 구성하는 문자들을 바탕으로 벡터 표현을 생성하기 때문에 어휘집이 필요하지 않지만, 추가적인 인공 신경망 구조가 필요하기 때문에 모델의 복잡도와 파라미터의 수가 증가한다는 단점이 있다. 본 연구에서는 위 두 방법의 한계를 극복하고자 Bag of Characters를 응용하여 단어를 구성하는 문자들의 집합을 바탕으로 벡터 표현을 생성하는 방법을 제안한다. 제안된 방법은 문자를 기반으로 동작하기 때문에 어휘집을 정의할 필요가 없으며, 인공 신경망 구조가 사용되지 않기 때문에 시스템의 복잡도도 증가시키지 않는다. 또한, 단어의 벡터 표현에 단어를 구성하는 문자들의 정보가 반영되기 때문에 Out-Of-Vocabulary 단어에 대한 성능도 어휘집을 사용하는 방법보다 우수할 것으로 기대된다.

  • PDF

A Study on automatic assignment of descriptors using machine learning (기계학습을 통한 디스크립터 자동부여에 관한 연구)

  • Kim, Pan-Jun
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.1 s.59
    • /
    • pp.279-299
    • /
    • 2006
  • This study utilizes various approaches of machine learning in the process of automatically assigning descriptors to journal articles. The effectiveness of feature selection and the size of training set were examined, after selecting core journals in the field of information science and organizing test collection from the articles of the past 11 years. Regarding feature selection, after reducing the feature set using $x^2$ statistics(CHI) and criteria that prefer high-frequency features(COS, GSS, JAC), the trained Support Vector Machines(SVM) performed the best. With respect to the size of the training set, it significantly influenced the performance of Support Vector Machines(SVM) and Voted Perceptron(VTP). However, it had little effect on Naive Bayes(NB).

Fast Scene Change Detection Algorithm in Compressed Video by a phased-approach Method (압축 비디오에서 단계적 접근방법에 의한 빠른 장면전환검출 알고리듬)

  • 이재승;천이진;윤정오
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.6 no.3
    • /
    • pp.115-122
    • /
    • 2001
  • A scene change detection is an important step for video indexing and retrieval. This paper proposes an algorithm by a phased algorithm for fast and accurate detection of abrupt scene changes in an MPEG compressed domain with minimal decoding requirements and computational effort. The proposed method compares two successive I-frames for locating a scene change occurring within the GOP and uses macroblock-coded type information contained in B-frames to detect the exact frame where the scene change occurred. The algorithm has the advantage of speed, simplicity and accuracy. In addition, it requires less amount of storage. The experiment results demonstrate that the proposed algorithm has better detection performance, such as precision and recall rate, than the existing method using all DC images.

  • PDF

E-mail Classification Using Dynamic Category Hierarchy and Automatic Generation of Category Label (분류 주제 자동 생성 및 동적분류체계 방법을 이용한 이메일 분류)

  • Ahn, C.M.;Park, S.;Park, S.H.;Choi, B.K.;Lee, J.H.
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04a
    • /
    • pp.439-441
    • /
    • 2004
  • 이메일 사용이 보편화됨에 따라 점차 수신되는 메일의 량이 증가하고 있다. 이러한 메일 량의 증가는 사용자로 하여금 이메일을 좀더 효율적으로 분류할 수 있는 방법을 필요하게 한다. 그러나 현재의 이메일 분류는 규칙기반, 베이시안, SVM 등을 이용하여 스팸메일을 필터링 하는 이원분류가 주로 연구되고 있다. 이외에도 다원분류에 대한 연구로는 클러스터링을 이용한 방법이 있으나, 이는 단순히 유사도에 의해 메일을 묶는 수준에 그치고 있다. 본 논문에서는 벡터모델의 유사도를 기반으로 한 분류 주제 자동 생성 알고리즘과 동적분류체계 방법을 결합하여 새로운 이메일 자동 다원분류 방법을 제안했다. 본 논문에서 제안한 방법은 이메일을 자동으로 분류하여, 분류된 결과를 색인검색과 디렉토리 검색 방법을 지원하며 대량의 메일도 효율적으로 관리할 수 있다. 또한 메시지를 동적으로 재분류 할 수 있게 함으로써 디렉토리 검색시 재현율을 높였다.

  • PDF

An Efficient Audio Indexing Scheme based on User Query Patterns (사용자 질의 패턴을 이용한 효율적인 오디오 색인기법)

  • 노승민;박동문;황인준
    • Journal of KIISE:Databases
    • /
    • v.31 no.4
    • /
    • pp.341-351
    • /
    • 2004
  • With the popularity of digital audio contents, querying and retrieving audio contents efficiently from database has become essential. In this paper, we propose a new index scheme for retrieving audio contents efficiently using audio portions that have been queried frequently. This scheme is based on the observation that users have a tendency to memorize and query a small number of audio portions. Detecting and indexing such portions enables fast retrieval and shows better performance than sequential search-based audio retrieval. Moreover, this scheme is independent of underlying retrieval system, which means this scheme can work together with any other audio retrieval system. We have implemented a prototype system and showed its performance gain through experiments.