• Title/Summary/Keyword: huge data

Search Result 1,410, Processing Time 0.032 seconds

On the clustering of huge categorical data

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.6
    • /
    • pp.1353-1359
    • /
    • 2010
  • Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Training for Huge Data set with On Line Pruning Regression by LS-SVM

  • Kim, Dae-Hak;Shim, Joo-Yong;Oh, Kwang-Sik
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.10a
    • /
    • pp.137-141
    • /
    • 2003
  • LS-SVM(least squares support vector machine) is a widely applicable and useful machine learning technique for classification and regression analysis. LS-SVM can be a good substitute for statistical method but computational difficulties are still remained to operate the inversion of matrix of huge data set. In modern information society, we can easily get huge data sets by on line or batch mode. For these kind of huge data sets, we suggest an on line pruning regression method by LS-SVM. With relatively small number of pruned support vectors, we can have almost same performance as regression with full data set.

  • PDF

Artificial Intelligence and Pattern Recognition Using Data Mining Algorithms

  • Al-Shamiri, Abdulkawi Yahya Radman
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.7
    • /
    • pp.221-232
    • /
    • 2021
  • In recent years, with the existence of huge amounts of data stored in huge databases, the need for developing accurate tools for analyzing data and extracting information and knowledge from the huge and multi-source databases have been increased. Hence, new and modern techniques have emerged that will contribute to the development of all other sciences. Knowledge discovery techniques are among these technologies, one popular technique of knowledge discovery techniques is data mining which aims to knowledge discovery from huge amounts of data. Such modern technologies of knowledge discovery will contribute to the development of all other fields. Data mining is important, interesting technique, and has many different and varied algorithms; Therefore, this paper aims to present overview of data mining, and clarify the most important of those algorithms and their uses.

A Study on Data Mining Using the Spline Basis

  • Lee, Sun-Geune;Sim, Songyong;Koo, Ja-Yong
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.2
    • /
    • pp.255-264
    • /
    • 2004
  • Due to a computerized data processing, there are many cases when we encounter a huge data set. On the other hand, advances in computing technologies make it possible to deal with a huge data set. One important area is the data mining. In this paper we consider data mining when the dependent variable is binary. The proposed method is to use the poly-class model when the independent variables consists of continuous and discrete variables. An example is provided.

Performance of Distributed Database System built on Multicore Systems

  • Kim, Kangseok
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.47-53
    • /
    • 2017
  • Recently, huge datasets have been generating rapidly in a variety of fields. Then, there is an urgent need for technologies that will allow efficient and effective processing of huge datasets. Therefore the problems of partitioning a huge dataset effectively and alleviating the processing overhead of the partitioned data efficiently have been a critical factor for scalability and performance in distributed database system. In our work we utilized multicore servers to provide scalable service to our distributed system. The partitioning of database over multicore servers have emerged from a need for new architectural design of distributed database system from scalability and performance concerns in today's data deluge. The system allows uniform access through a web service interface to concurrently distributed databases over multicore servers, using SQMD (Single Query Multiple Database) mechanism based on publish/subscribe paradigm. We will present performance results with the distributed database system built on multicore server, which is time intensive with traditional architectures. We will also discuss future works.

A sample size calibration approach for the p-value problem in huge samples

  • Park, Yousung;Jeon, Saebom;Kwon, Tae Yeon
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.5
    • /
    • pp.545-557
    • /
    • 2018
  • The inclusion of covariates in the model often affects not only the estimates of meaningful variables of interest but also its statistical significance. Such gap between statistical and subject-matter significance is a critical issue in huge sample studies. A popular huge sample study, the sample cohort data from Korean National Health Insurance Service, showed such gap of significance in the inference for the effect of obesity on cause of mortality, requiring careful consideration. In this regard, this paper proposes a sample size calibration method based on a Monte Carlo t (or z)-test approach without Monte Carlo simulation, and also proposes a test procedure for subject-matter significance using this calibration method in order to complement the deflated p-value in the huge sample size. Our calibration method shows no subject-matter significance of the obesity paradox regardless of race, sex, and age groups, unlike traditional statistical suggestions based on p-values.

A Case Study of Economic Infographic by Beautiful Visualization Method

  • Hua, Zheng-yang;Kim, Se-hwa
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2012.05a
    • /
    • pp.339-340
    • /
    • 2012
  • As the flood of huge data, simply static traditional diagram cannot help reader understand these dataset. Economic data analysis needs a lot of time to clearly understood. The purpose of this study is by using beautiful visualization method to analysis economic infographic displayed huge data easily, quickly and aesthetically.

  • PDF

Efficient Processing of Huge Airborne Laser Scanned Data Utilizing Parallel Computing and Virtual Grid (병렬처리와 가상격자를 이용한 대용량 항공 레이저 스캔 자료의 효율적인 처리)

  • Han, Soo-Hee;Heo, Joon;Lkhagva, Enkhbaatar
    • Journal of Korea Spatial Information System Society
    • /
    • v.10 no.4
    • /
    • pp.21-26
    • /
    • 2008
  • A method for processing huge airborne laser scanned data using parallel computing and virtual grid is proposed and the method is tested by generating raster DSM(Digital Surface Model) with IDW(Inverse Distance Weighting). Parallelism is involved for fast interpolation of huge point data and virtual grid is adopted for enhancing searching efficiency of irregularly distributed point data. Processing time was checked for the method using cluster constituted of one master node and six slave nodes, resulting in efficiency near to 1 and load scalability property. Also large data which cannot be processed with a sole system was processed with cluster system.

  • PDF

A Study on Data Modeling for VLDB Performance (VLDB의 성능을 고려한 데이터 모델링에 관한 연구)

  • Li, Zhong-Shi;Lee, Chang-Ho
    • Journal of the Korea Safety Management & Science
    • /
    • v.14 no.2
    • /
    • pp.185-192
    • /
    • 2012
  • It has been a huge amount of capacity of 10GB data base in a decade ago so far. Nowadays, however, 10TB is the common data base and even bigger capacities are available. So, new generation of Very Large Data Base (VLDB) has begun. Moving in to the new generation of VLDB has been caused major problems like backing up, restoring, and managing especially performance. It is very hard to export necessary data rapidly now due to the huge amount of data base. In the past, such kind of problems was out of the questions because of less data. As time goes on, however, optimization of performance became a big issue when the VLDB is common. Therefore, new professional technics are urgently required to maintain and optimize the data base that has become a VLDB or one that is in the progress of becoming one.

3-DIMENSIONAL TILING TECHNIQUE TO PROCESS HUGE SIZE HIGH RESOLUTION SATELLITE IMAGE SEAMLESSLY AND RAPIDLY

  • Jung, Chan-Gyu;Kim, Jun-Chul;Hwang, Hyun-Deok
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.85-89
    • /
    • 2007
  • This paper presents the method to provide a fast service for user in image manipulation such as zooming and panning of huge size high resolution satellite image (e.g. Giga bytes per scene). The proposed technique is based on the hierarchical structure that has 3D-Tiling in horizontal and vertical direction to provide the image service more effectively than 2D-Tiling technique in the past does. The essence of the proposed technique is to create tiles that have optimum level of horizontal as well as vertical direction on the basis of current displaying area which changes as user manipulates huge image. So this technique provides seamless service, and will be very powerful and useful for manipulation of images of huge size without data conversion.

  • PDF