• Title/Summary/Keyword: large data

Search Result 14,238, Processing Time 0.042 seconds

A Hybrid Clustering Technique for Processing Large Data (대용량 데이터 처리를 위한 하이브리드형 클러스터링 기법)

  • Kim, Man-Sun;Lee, Sang-Yong
    • The KIPS Transactions:PartB
    • /
    • v.10B no.1
    • /
    • pp.33-40
    • /
    • 2003
  • Data mining plays an important role in a knowledge discovery process and various algorithms of data mining can be selected for the specific purpose. Most of traditional hierachical clustering methode are suitable for processing small data sets, so they difficulties in handling large data sets because of limited resources and insufficient efficiency. In this study we propose a hybrid neural networks clustering technique, called PPC for Pre-Post Clustering that can be applied to large data sets and find unknown patterns. PPC combinds an artificial intelligence method, SOM and a statistical method, hierarchical clustering technique, and clusters data through two processes. In pre-clustering process, PPC digests large data sets using SOM. Then in post-clustering, PPC measures Similarity values according to cohesive distances which show inner features, and adjacent distances which show external distances between clusters. At last PPC clusters large data sets using the simularity values. Experiment with UCI repository data showed that PPC had better cohensive values than the other clustering techniques.

Implementation of the Large-scale Data Signature System Using Hash Tree Replication Approach (해시 트리 기반의 대규모 데이터 서명 시스템 구현)

  • Park, Seung Kyu
    • Convergence Security Journal
    • /
    • v.18 no.1
    • /
    • pp.19-31
    • /
    • 2018
  • As the ICT technologies advance, the unprecedently large amount of digital data is created, transferred, stored, and utilized in every industry. With the data scale extension and the applying technologies advancement, the new services emerging from the use of large scale data make our living more convenient and useful. But the cybercrimes such as data forgery and/or change of data generation time are also increasing. For the data security against the cybercrimes, the technology for data integrity and the time verification are necessary. Today, public key based signature technology is the most commonly used. But a lot of costly system resources and the additional infra to manage the certificates and keys for using it make it impractical to use in the large-scale data environment. In this research, a new and far less system resources consuming signature technology for large scale data, based on the Hash Function and Merkle tree, is introduced. An improved method for processing the distributed hash trees is also suggested to mitigate the disruptions by server failures. The prototype system was implemented, and its performance was evaluated. The results show that the technology can be effectively used in a variety of areas like cloud computing, IoT, big data, fin-tech, etc., which produce a large-scale data.

  • PDF

Development of the design methodology for large-scale database based on MongoDB

  • Lee, Jun-Ho;Joo, Kyung-Soo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.11
    • /
    • pp.57-63
    • /
    • 2017
  • The recent sudden increase of big data has characteristics such as continuous generation of data, large amount, and unstructured format. The existing relational database technologies are inadequate to handle such big data due to the limited processing speed and the significant storage expansion cost. Thus, big data processing technologies, which are normally based on distributed file systems, distributed database management, and parallel processing technologies, have arisen as a core technology to implement big data repositories. In this paper, we propose a design methodology for large-scale database based on MongoDB by extending the information engineering methodology based on E-R data model.

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

Level Scale Interface Design for Real-Time Visualizing Large-Scale Data (대용량 자료 실시간 시각화를 위한 레벨 수준 표현 인터페이스 설계)

  • Lee, Do-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.13 no.2
    • /
    • pp.105-111
    • /
    • 2008
  • Various visualizing methods have been proposed according to the input and output types. To show complex and large-scale raw data and information. LOD and special region scale method have been used for them. In this paper, I propose level scale interface for dynamic and interactive controlling large scale data such as bio-data. The method has not only advantage of LOD and special region scale but also dynamic and real-time processing. In addition, the method supports elaborate control from large scale to small one for visualization on a region in detail. Proposed method was adopted for genome relationship visualization tool and showed reasonable control method.

  • PDF

Efficient Quantitative Association Rules with Parallel Processing (병렬처리를 이용한 효율적인 수량 연관규칙)

  • Lee, Hye-Jung;Hong, Min;Park, Doo-Soon
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.8
    • /
    • pp.945-957
    • /
    • 2007
  • Quantitative association rules apply a binary association to the data which have the relatively strong quantitative attributions in a large database system. When a domain range of quantitative data which involve the significant meanings for the association is too broad, a domain requires to be divided into a proper interval which satisfies the minimum support for the generation of large interval items. The reliability of formulated rules is enormously influenced by the generation of large interval items. Therefore, this paper proposes a new method to efficiently generate the large interval items. The proposed method does not lose any meaningful intervals compared to other existing methods, provides the accurate large interval items which are close to the minimum support, and minimizes the loss of characteristics of data. In addition, since our method merges data where the frequency of data is high enough, it provides the fast run time compared with other methods for the broad quantitative domain. To verify the superiority of proposed method, the real national census data are used for the performance analysis and a Clunix HPC system is used for the parallel processing.

  • PDF

Implementation of Hardware RAID and LVM-based Large Volume Storage on Global Data Center System of International GNSS Service

  • Lee, Dae-Kyu;Cho, Sung-Ki;Park, Jong-Uk;Park, Pil-Ho
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2005.06a
    • /
    • pp.1553-1557
    • /
    • 2005
  • High performance and reliability of the storage system to handle a very large amount of data has been become very important. Many techniques have been applied on the various application systems to establish very large capacity storage that satisfy the requirement of high I/O speed and physical or logical failure protection. We applied RAID and LVM to construct a storage system for the global data center which needs a very reliable large capacity storage system. The storage system is successfully established and equipped on the latest Linux application server.

  • PDF

Design and Implementation of an Efficient Communication System for Collecting Sensor Data in Large Scale Sensors Networks (대규모 센서 네트워크에서 센서 데이터 수집을 위한 효율적인 통신 시스템 설계 및 구현)

  • Jang, Si-woong;Kim, Ji-Seong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.1
    • /
    • pp.113-119
    • /
    • 2020
  • Large sensor networks require the collection and analysis of data from a large number of sensors. The number of sensors that can be controlled per micro controller is limited. In this paper, we propose how to aggregate sensor data from a large number of sensors using a large number of microcontrollers and multiple bridge nodes, and design and implement an efficient communication system for sensor data collection. Bridge nodes aggregate data from multiple microcontrollers using SPI communication, and transfer the aggregated data to PC servers using wireless TCP/IP communication. In this paper, the communication system was constructed using the Open H/W Aduo Mini and ESP8266 and performance of the system was analyzed. The performance analysis results showed that more than 30 sensing data can be collected per second from more than 700 sensors.

Dog-Species Classification through CycleGAN and Standard Data Augmentation

  • Chan, Park;Nammee, Moon
    • Journal of Information Processing Systems
    • /
    • v.19 no.1
    • /
    • pp.67-79
    • /
    • 2023
  • In the image field, data augmentation refers to increasing the amount of data through an editing method such as rotating or cropping a photo. In this study, a generative adversarial network (GAN) image was created using CycleGAN, and various colors of dogs were reflected through data augmentation. In particular, dog data from the Stanford Dogs Dataset and Oxford-IIIT Pet Dataset were used, and 10 breeds of dog, corresponding to 300 images each, were selected. Subsequently, a GAN image was generated using CycleGAN, and four learning groups were established: 2,000 original photos (group I); 2,000 original photos + 1,000 GAN images (group II); 3,000 original photos (group III); and 3,000 original photos + 1,000 GAN images (group IV). The amount of data in each learning group was augmented using existing data augmentation methods such as rotating, cropping, erasing, and distorting. The augmented photo data were used to train the MobileNet_v3_Large, ResNet-152, InceptionResNet_v2, and NASNet_Large frameworks to evaluate the classification accuracy and loss. The top-3 accuracy for each deep neural network model was as follows: MobileNet_v3_Large of 86.4% (group I), 85.4% (group II), 90.4% (group III), and 89.2% (group IV); ResNet-152 of 82.4% (group I), 83.7% (group II), 84.7% (group III), and 84.9% (group IV); InceptionResNet_v2 of 90.7% (group I), 88.4% (group II), 93.3% (group III), and 93.1% (group IV); and NASNet_Large of 85% (group I), 88.1% (group II), 91.8% (group III), and 92% (group IV). The InceptionResNet_v2 model exhibited the highest image classification accuracy, and the NASNet_Large model exhibited the highest increase in the accuracy owing to data augmentation.

A Method for Generating Large-Interval Itemset using Locality of Data (데이터의 지역성을 이용한 빈발구간 항목집합 생성방법)

  • 박원환;박두순
    • Journal of Korea Multimedia Society
    • /
    • v.4 no.5
    • /
    • pp.465-475
    • /
    • 2001
  • Recent1y, there is growing attention on the researches of inducing association rules from large volume of database. One of them is the method that can be applied to quantitative attribute data. This paper presents a new method for generating large-interval itemsets, which uses locality for partitioning the range of data. This method can minimize the loss of data-inherent characteristics by generating denser large-interval items than other methods. Performance evaluation results show that our new approach is more efficient than previously proposed techniques.

  • PDF