• Title/Summary/Keyword: Big Data Clustering

Search Result 146, Processing Time 0.023 seconds

Multi-class Support Vector Machines Model Based Clustering for Hierarchical Document Categorization in Big Data Environment (빅 데이터 환경에서 계층적 문서 유형 분류를 위한 클러스터링 기반 다중 SVM 모델)

  • Kim, Young Soo;Lee, Byoung Yup
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.11
    • /
    • pp.600-608
    • /
    • 2017
  • Recently data growth rates are growing exponentially according to the rapid expansion of internet. Since users need some of all the information, they carry a heavy workload for examination and discovery of the necessary contents. Therefore information retrieval must provide hierarchical class information and the priority of examination through the evaluation of similarity on query and documents. In this paper we propose an Multi-class support vector machines model based clustering for hierarchical document categorization that make semantic search possible considering the word co-occurrence measures. A combination of hierarchical document categorization and SVM classifier gives high performance for analytical classification of web documents that increase exponentially according to extension of document hierarchy. More information retrieval systems are expected to use our proposed model in their developments and can perform a accurate and rapid information retrieval service.

Design of Meteorological Radar Pattern Classifier Using Clustering-based RBFNNs : Comparative Studies and Analysis (클러스터링 기반 RBFNNs를 이용한 기상레이더 패턴분류기 설계 : 비교 연구 및 해석)

  • Choi, Woo-Yong;Oh, Sung-Kwun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.5
    • /
    • pp.536-541
    • /
    • 2014
  • Data through meteorological radar includes ground echo, sea-clutter echo, anomalous propagation echo, clear echo and so on. Each echo is a kind of non-precipitation echoes and the characteristic of individual echoes is analyzed in order to identify with non-precipitation. Meteorological radar data is analyzed through pre-processing procedure because the data is given as big data. In this study, echo pattern classifier is designed to distinguish non-precipitation echoes from precipitation echo in meteorological radar data using RBFNNs and echo judgement module. Output performance is compared and analyzed by using both HCM clustering-based RBFNNs and FCM clustering-based RBFNNs.

Design of Efficient Big Data Collection Method based on Mass IoT devices (방대한 IoT 장치 기반 환경에서 효율적인 빅데이터 수집 기법 설계)

  • Choi, Jongseok;Shin, Yongtae
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.14 no.4
    • /
    • pp.300-306
    • /
    • 2021
  • Due to the development of IT technology, hardware technologies applied to IoT equipment have recently been developed, so smart systems using low-cost, high-performance RF and computing devices are being developed. However, in the infrastructure environment where a large amount of IoT devices are installed, big data collection causes a load on the collection server due to a bottleneck between the transmitted data. As a result, data transmitted to the data collection server causes packet loss and reduced data throughput. Therefore, there is a need for an efficient big data collection technique in an infrastructure environment where a large amount of IoT devices are installed. Therefore, in this paper, we propose an efficient big data collection technique in an infrastructure environment where a vast amount of IoT devices are installed. As a result of the performance evaluation, the packet loss and data throughput of the proposed technique are completed without loss of the transmitted file. In the future, the system needs to be implemented based on this design.

Parallel k-Modes Algorithm for Spark Framework (스파크 프레임워크를 위한 병렬적 k-Modes 알고리즘)

  • Chung, Jaehwa
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.10
    • /
    • pp.487-492
    • /
    • 2017
  • Clustering is a technique which is used to measure similarities between data in big data analysis and data mining field. Among various clustering methods, k-Modes algorithm is representatively used for categorical data. To increase the performance of iterative-centric tasks such as k-Modes, a distributed and concurrent framework Spark has been received great attention recently because it overcomes the limitation of Hadoop. Spark provides an environment that can process large amount of data in main memory using the concept of abstract objects called RDD. Spark provides Mllib, a dedicated library for machine learning, but Mllib only includes k-means that can process only continuous data, so there is a limitation that categorical data processing is impossible. In this paper, we design RDD for k-Modes algorithm for categorical data clustering in spark environment and implement an algorithm that can operate effectively. Experiments show that the proposed algorithm increases linearly in the spark environment.

Analysis of Covid-19, Tourism, Stress Keywords Using Social Network Big Data_Semantic Network Analysis

  • Yun, Su-Hyun;Moon, Seok-Jae;Ryu, Ki-Hwan
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.204-210
    • /
    • 2022
  • From the 1970s to the present, the number of new infectious diseases such as SARS, Ebola virus, and MERS has steadily increased. The new infectious disease, COVID-19, which began in Wuhan, Hubei Province, China, has pushed the world into a pandemic era. As a result, Countries imposed restrictions on entry to foreign countries due to concerns over the spread of COVID-19, which led to a decrease in the movement of tourists. Due to the restriction of travel, keywords such as "Corona blue" have soared and depression has increased. Therefore, this study aims to analyze the stress meaning network of the COVID-19 era to derive keywords and come up with a plan for a travel-related platform of the Post-COVID 19 era. This study conducted analysis of travel and stress caused by COVID-19 using TEXTOM, a big data analysis tool, and conducted semantic network analysis using UCINET6. We also conducted a CONCOR analysis to classify keywords for clustering of words with similarities. However, since we have collected travel and stress-oriented data from the start to the present, we need to increase the number of analysis data and analyze more data in the future.

Location Recommendation Customize System Using Opinion Mining (오피니언마이닝을 이용한 사용자 맞춤 장소 추천 시스템)

  • Choi, Eun-jeong;Kim, Dong-keun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.11
    • /
    • pp.2043-2051
    • /
    • 2017
  • Lately, In addition to the increased interest in the big data field, there is also a growing interest in application fields through the processing of big data. Opinion Mining is a big data processing technique that is widely used in providing personalized service to users. Based on this, in this paper, textual review of users' places is processed by Opinion mining technique and the sentiment of users was analyzed through k-means clustering. The same numerical value is given to users who have a similar category of sentiment classified as a clustering operation. We propose a method to show recommendation contents to users by predicting preference using collaborative filtering recommendation system with assigned numerical values and marking contents with markers on the map in order of places with high predicted value.

A Customer Segmentation Scheme Base on Big Data in a Bank (빅데이터를 활용한 은행권 고객 세분화 기법 연구)

  • Chang, Min-Suk;Kim, Hyoung Joong
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.85-91
    • /
    • 2018
  • Most banks use only demographic information such as gender, age, occupation and address to segment customers, but they do not reflect financial behavior patterns of customers. In this study, we aim to solve the problems by using various big data in a bank and to develop customer segmentation method which can be widely used in many banks in the future. In this paper, we propose an approach of segmenting clustering blocks with bottom-up method. This method has an advantage that it can accurately reflect various financial needs of customers based on various transaction patterns, channel contact patterns, and existing demographic information. Based on this, we will develop various marketing models such as product recommendation, financial need rating calculation, and customer churn-out prediction based on this, and we will adapt this models for the marketing strategy of NH Bank.

A Study on Application of Machine Learning Algorithms to Visitor Marketing in Sports Stadium (기계학습 알고리즘을 사용한 스포츠 경기장 방문객 마케팅 적용 방안)

  • Park, So-Hyun;Ihm, Sun-Young;Park, Young-Ho
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.27-33
    • /
    • 2018
  • In this study, we analyze the big data of visitors who are looking for a sports stadium in marketing field and conduct research to provide customized marketing service to consumers. For this purpose, we intend to derive a similar visitor group by using the K-means clustering method. Also, we will use the K-nearest neighbors method to predict the store of interest for new visitors. As a result of the experiment, it was possible to provide a marketing service suitable for each group attribute by deriving a group of similar visitors through the above two algorithms, and it was possible to recommend products and events for new visitors.

High-performance computing for SARS-CoV-2 RNAs clustering: a data science-based genomics approach

  • Oujja, Anas;Abid, Mohamed Riduan;Boumhidi, Jaouad;Bourhnane, Safae;Mourhir, Asmaa;Merchant, Fatima;Benhaddou, Driss
    • Genomics & Informatics
    • /
    • v.19 no.4
    • /
    • pp.49.1-49.11
    • /
    • 2021
  • Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Management of Distributed Nodes for Big Data Analysis in Small-and-Medium Sized Hospital (중소병원에서의 빅데이터 분석을 위한 분산 노드 관리 방안)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2016.05a
    • /
    • pp.376-377
    • /
    • 2016
  • Performance of Hadoop, which is a distributed data processing framework for big data analysis, is affected by several characteristics of each node in distributed cluster such as processing power and network bandwidth. This paper analyzes previous approaches for heterogeneous hadoop clusters, and presents several requirements for distributed node clustering in small-and-medium sized hospitals by considering computing environments of the hospitals.

  • PDF