• Title/Summary/Keyword: 데이터 항목

Search Result 1,281, Processing Time 0.028 seconds

An Efficient Method for Mining Frequent Patterns based on Weighted Support over Data Streams (데이터 스트림에서 가중치 지지도 기반 빈발 패턴 추출 방법)

  • Kim, Young-Hee;Kim, Won-Young;Kim, Ung-Mo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.8
    • /
    • pp.1998-2004
    • /
    • 2009
  • Recently, due to technical developments of various storage devices and networks, the amount of data increases rapidly. The large volume of data streams poses unique space and time constraints on the data mining process. The continuous characteristic of streaming data necessitates the use of algorithms that require only one scan over the stream for knowledge discovery. Most of the researches based on the support are concerned with the frequent itemsets, but ignore the infrequent itemsets even if it is crucial. In this paper, we propose an efficient method WSFI-Mine(Weighted Support Frequent Itemsets Mine) to mine all frequent itemsets by one scan from the data stream. This method can discover the closed frequent itemsets using DCT(Data Stream Closed Pattern Tree). We compare the performance of our algorithm with DSM-FI and THUI-Mine, under different minimum supports. As results show that WSFI-Mine not only run significant faster, but also consume less memory.

Finding Frequent Itemsets Over Data Streams in Confined Memory Space (한정된 메모리 공간에서 데이터 스트림의 빈발항목 최적화 방법)

  • Kim, Min-Jung;Shin, Se-Jung;Lee, Won-Suk
    • The KIPS Transactions:PartD
    • /
    • v.15D no.6
    • /
    • pp.741-754
    • /
    • 2008
  • Due to the characteristics of a data stream, it is very important to confine the memory usage of a data mining process regardless of the amount of information generated in the data stream. For this purpose, this paper proposes the Prime pattern tree(PPT) for finding frequent itemsets over data streams with using the confined memory space. Unlike a prefix tree, a node of a PPT can maintain the information necessary to estimate the current supports of several itemsets together. The length of items in a prime pattern can be reduced the total number of nodes and controlled by split_delta $S_{\delta}$. The size and the accuracy of the PPT is determined by $S_{\delta}$. The accuracy is better as the value of $S_{\delta}$ is smaller since the value of $S_{\delta}$ is large, many itemsets are estimated their frequencies. So it is important to consider trade-off between the size of a PPT and the accuracy of the mining result. Based on this characteristic, the size and the accuracy of the PPT can be flexibly controlled by merging or splitting nodes in a mining process. For finding all frequent itemsets over the data stream, this paper proposes a PPT to replace the role of a prefix tree in the estDec method which was proposed as a previous work. It is efficient to optimize the memory usage for finding frequent itemsets over a data stream in confined memory space. Finally, the performance of the proposed method is analyzed by a series of experiments to identify its various characteristics.

Construction of Metadata Format and Ontology for Religious architecture heritage Information (종교유적 건축물 정보의 메타데이터 구성과 온톨로지 구축)

  • Chung, Heesun;Kim, Heesoon;Song, Hyun-Sook;Lee, Myeong-Hee
    • Journal of Korean Library and Information Science Society
    • /
    • v.44 no.1
    • /
    • pp.5-26
    • /
    • 2013
  • Although organizing standardized metadata is important for effective management of cultural heritage information, current metadata are represented differently according to the properties of the resources or objectives of the organizations in which they are accumulated. This research compared 6 different metadata formats and created 18 data elements for constructing databases. A religious architecture heritage information database was constructed based on 72 historic religious architectures, each composing of three parts. An ontology based on religious architecture heritage information was designed using a revised CIDOC-CRM, and was developed with a semi-automated corpus program.

Search Method of the time sensitive frequent itemsets (시간에 따른 가변성을 고려한 상대적인 빈발항목 탐색방법)

  • Park, Tae-Su;Lee, Ju-Hong;Park, Sun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2005.11a
    • /
    • pp.97-100
    • /
    • 2005
  • 최근 유비쿼터스 컴퓨팅 및 인터넷 서비스에 대한 관심이 증대되면서, 대용량의 데이터에 내재되어 있는 정보를 빠른 시간 내에 처리하여 새로운 지식을 창출하려는 요구가 증가하고 있다. 데이터 마이닝 기법을 이용하여 데이터 스트림에서 빈발항목을 탐색하는 기존의 연구는 시간을 고려하지 않고 단순히 집계를 통하여 빈발항목을 탐색하기 때문에 정확성을 보장하지 못한다. 따라서 본 논문에서는 데이터 스트림에서 시간적 측면을 고려하여 상대적인 빈발항목을 탐색하기 위한 새로운 알고리즘을 제안하고자 한다. 논문에서 제안하는 알고리즘의 성능은 다양한 실험을 통해서 검증된다.

  • PDF

The Taxonomy of Dirty Data for MPEG-2 TS (MPEG-2 표준을 위한 오류 데이터 분류)

  • 곽태희;최병주
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.04a
    • /
    • pp.691-693
    • /
    • 2001
  • DASE(Digital TV Application Software Environment)는 데이터 방송을 위한 국제 표준으로 MPEG-2 TS(Moving Picture Experts Group-2 Transport Stream) 형식의 데이터를 처리한다. 소스코드 대신 입력 데이터 명세 정보만을 공개하는 특성상 DASE 시스템의 오류를 테스트하기 위해서는 테스트 데이터에 오류를 삽입하는 방법이 적합하고 이를 위해 MPEG-2 표준을 위한 오류 항목을 개발이 요구된다. 본 논문에서는 관계형 데이터 베이스를 위한 데이터 분류인 Kim’s et al 분류를 근거로 하여 MPEG-2 표준을 위한 오류 항목을 개발하였다. 이는 DASE 시스템의 오류 삽입 테스트 기법에 유용하게 사용될 수 있을 것이다.

  • PDF

Finding Frequent Itemsets based on Open Data Mining in Data Streams (데이터 스트림에서 개방 데이터 마이닝 기반의 빈발항목 탐색)

  • Chang, Joong-Hyuk;Lee, Won-Suk
    • The KIPS Transactions:PartD
    • /
    • v.10D no.3
    • /
    • pp.447-458
    • /
    • 2003
  • The basic assumption of conventional data mining methodology is that the data set of a knowledge discovery process should be fixed and available before the process can proceed. Consequently, this assumption is valid only when the static knowledge embedded in a specific data set is the target of data mining. In addition, a conventional data mining method requires considerable computing time to produce the result of mining from a large data set. Due to these reasons, it is almost impossible to apply the mining method to a realtime analysis task in a data stream where a new transaction is continuously generated and the up-to-dated result of data mining including the newly generated transaction is needed as quickly as possible. In this paper, a new mining concept, open data mining in a data stream, is proposed for this purpose. In open data mining, whenever each transaction is newly generated, the updated mining result of whole transactions including the newly generated transactions is obtained instantly. In order to implement this mechanism efficiently, it is necessary to incorporate the delayed-insertion of newly identified information in recent transactions as well as the pruning of insignificant information in the mining result of past transactions. The proposed algorithm is analyzed through a series of experiments in order to identify the various characteristics of the proposed algorithm.

A study of the Health Data Application (보건 데이터 활용에 관한 연구(II))

  • Lim, Gi-Young;Cho, Eun-Hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2001.10b
    • /
    • pp.1213-1216
    • /
    • 2001
  • 정규분포 등의 가정이 곤란한 복잡한 밀도 분포에 대해 데이터의 선험적인 지식 없이 해석하기 위해 다수의 항목이 되고 복잡한 밀도 분포를 가진 데이터를 보다 소수의 단순한 밀도 분포가 되는 그룹으로 분류하는 방법을 나타내었고 데이터를 그룹으로 분류하는데 표본에 의한 분류와 항목에 의한 분류를 할 수 있다. 선험지식을 사용하지 않고 데이터를 분류하면 Parzen의 창함수에 의한 추정과 대수우도에 의한 평가함수를 사용하는 것으로 복잡한 형상을 가진 밀도분포도 선험지식 없이 해석이 가능하다. 표본의 밀도 분포와 항목의 밀도분포를 나타내기 위하여 다수의 밀도 분포의 합과 곱의 형으로 전개하는 방법을 보였고 제안하는 방법을 의도적으로 생성한 데이터에 적용하여 원래의 밀도분포에 따라 분류결과를 얻을 수 있었다.

  • PDF

Extracting and Validating Metadata in Electronic Records (전자기록물의 메타데이터 추출 및 비교 검증 기술 연구)

  • Choi, Joo Ho;Lee, Jae Young
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.12 no.1
    • /
    • pp.7-32
    • /
    • 2012
  • When migrate electronic records, the validation of the required metadata in electronic records and verified with the metadata in the document are also important. This paper presents a method and implements a tool to extract data from files in various formats and use them to validate metadata associated with the files in electronic records. Compared to other metadata extraction tools, especially developed in foreign countries, the standard form of documents used in Korean government is taken into account and metadata is extracted from the content of files. The tool compares the extracted data to encapsulated metadata for validation.

A Study on the Necessary Factors to Establish for Public Institutions Big Data System (공공기관 빅데이터 시스템 구축 시 고려해야 할 측정항목에 관한 연구)

  • Lee, Gwang-Su;Kwon, Jungin
    • Journal of Digital Convergence
    • /
    • v.19 no.10
    • /
    • pp.143-149
    • /
    • 2021
  • As the need to establish a big data system for rapid provision of big data and efficient management of resources has emerged due to rapid entry into the hyper-connected intelligence information society, public institutions are pushing to establish a big data system. Therefore, this study analyzed and combined the success factors of big data-related studies and the specific aspects of big data in public institutions based on the measurement of environmental factors for establishing an integrated information system for higher education institutions. In addition, 19 measurement items reflecting big data characteristics were derived from big data experts using brainstorming and Delphi methods, and a plan to successfully apply them to public institutions that want to build big data systems was proposed. We hope that this research results will be used as a foundation for the successful establishment of big data systems in public institutions.

A MultiDatabase Clustering using Distance of Itemsets (항목집합의 거리를 이용한 다중데이터베이스 클러스터링)

  • Kim, Jin-Hyun;Park, Sung-Lyeon;Youn, Sung-Dae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2003.05c
    • /
    • pp.1567-1570
    • /
    • 2003
  • 장바구니 데이터들로 구성된 다중데이터베이스를 마이닝 하기 위한 선처리 작업으로는 Ideal&Goodness 기법이 있으며, Ideal&Goodness기법은 유사한 항목이 존재하는 데이터베이스간의 식별이 불가능하다는 단점이 있다. 그러므로 본 논문에서 제안하는 기법은 항목으로만 구성된 집합을 생성하여 데이터베이스간의 거리를 측정하고 항목집합간의 식별능력을 향상시키기 위하여 항목과 지지도를 갖는 항목 데이터 집합을 생성하고 지지도에 대한 확률을 계산한 후, 이를 비교 연산하여 가중치를 계산한다. 본 논문에서는 장바구니 분석을 위한 선처리 단계로써 활용 가능한 클러스터링 기법을 제안하며 성능평가를 통하여 데이터베이스간의 우수한 식별 능력을 보인다.

  • PDF