• Title/Summary/Keyword: Big Data Clustering

Search Result 147, Processing Time 0.023 seconds

Signed Hellinger measure for directional association (연관성 방향을 고려한 부호 헬링거 측도의 제안)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.2
    • /
    • pp.353-362
    • /
    • 2016
  • By Wikipedia, data mining is the process of discovering patterns in a big data set involving methods at the intersection of association rule, decision tree, clustering, artificial intelligence, machine learning. and database systems. Association rule is a method for discovering interesting relations between items in large transactions by interestingness measures. Association rule interestingness measures play a major role within a knowledge discovery process in databases, and have been developed by many researchers. Among them, the Hellinger measure is a good association threshold considering the information content and the generality of a rule. But it has the drawback that it can not determine the direction of the association. In this paper we proposed a signed Hellinger measure to be able to interpret operationally, and we checked three conditions of association threshold. Furthermore, we investigated some aspects through a few examples. The results showed that the signed Hellinger measure was better than the Hellinger measure because the signed one was able to estimate the right direction of association.

The similarities analysis of location fishing information through 2 step clustering (2단계 군집분석을 통한 해구별 조업정보의 유사성 분석)

  • Cho, Yong-Jun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.3
    • /
    • pp.551-562
    • /
    • 2009
  • In this paper, I would present a using method for The Fishing Operation Information(FOI) of National Federation of Fisheries Cooperatives(NFFC) through the availabilities analysis and put out the similarities by the section of the sea through classifying characteristics of fishing patterns by their locations. As a result, although the catch of FOI is nothing more than 33% level to National Fishery Production Statistics(NFPS), FOI data is useful in understanding the patterns of fishing operation by the location because both patterns and correlation were very similar in the usability analysis, comparing the FOI data with NFPS. So I classified optimal clusters for catch, the number of fishing days and the number of fishing vessels through 2 step cluster analysis by the big marine zone and divided fishing patterns.

  • PDF

Multivariate Outlier Removing for the Risk Prediction of Gas Leakage based Methane Gas (메탄 가스 기반 가스 누출 위험 예측을 위한 다변량 특이치 제거)

  • Dashdondov, Khongorzul;Kim, Mi-Hye
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.12
    • /
    • pp.23-30
    • /
    • 2020
  • In this study, the relationship between natural gas (NG) data and gas-related environmental elements was performed using machine learning algorithms to predict the level of gas leakage risk without directly measuring gas leakage data. The study was based on open data provided by the server using the IoT-based remote control Picarro gas sensor specification. The naturel gas leaks into the air, it is a big problem for air pollution, environment and the health. The proposed method is multivariate outlier removing method based Random Forest (RF) classification for predicting risk of NG leak. After, unsupervised k-means clustering, the experimental dataset has done imbalanced data. Therefore, we focusing our proposed models can predict medium and high risk so best. In this case, we compared the receiver operating characteristic (ROC) curve, accuracy, area under the ROC curve (AUC), and mean standard error (MSE) for each classification model. As a result of our experiments, the evaluation measurements include accuracy, area under the ROC curve (AUC), and MSE; 99.71%, 99.57%, and 0.0016 for MOL_RF respectively.

Visual Model of Pattern Design Based on Deep Convolutional Neural Network

  • Jingjing Ye;Jun Wang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.2
    • /
    • pp.311-326
    • /
    • 2024
  • The rapid development of neural network technology promotes the neural network model driven by big data to overcome the texture effect of complex objects. Due to the limitations in complex scenes, it is necessary to establish custom template matching and apply it to the research of many fields of computational vision technology. The dependence on high-quality small label sample database data is not very strong, and the machine learning system of deep feature connection to complete the task of texture effect inference and speculation is relatively poor. The style transfer algorithm based on neural network collects and preserves the data of patterns, extracts and modernizes their features. Through the algorithm model, it is easier to present the texture color of patterns and display them digitally. In this paper, according to the texture effect reasoning of custom template matching, the 3D visualization of the target is transformed into a 3D model. The high similarity between the scene to be inferred and the user-defined template is calculated by the user-defined template of the multi-dimensional external feature label. The convolutional neural network is adopted to optimize the external area of the object to improve the sampling quality and computational performance of the sample pyramid structure. The results indicate that the proposed algorithm can accurately capture the significant target, achieve more ablation noise, and improve the visualization results. The proposed deep convolutional neural network optimization algorithm has good rapidity, data accuracy and robustness. The proposed algorithm can adapt to the calculation of more task scenes, display the redundant vision-related information of image conversion, enhance the powerful computing power, and further improve the computational efficiency and accuracy of convolutional networks, which has a high research significance for the study of image information conversion.

Using Big Data and Small Data to Understand Linear Parks - Focused on the 606 Trail, USA and Gyeongchun Line Forest, Korea - (빅데이터와 스몰데이터로 본 선형공원 - 시카고 606 트레일과 서울 경춘선 숲길을 중심으로 -)

  • Sim, Ji-Soo;Oh, Chang Song
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.48 no.5
    • /
    • pp.28-41
    • /
    • 2020
  • This study selects two linear parks representing each culture and reveals the differences between them using a visitor survey as small data and social media analytics as big data based on the three components of the model of landscape perception. The 606 in Chicago, U.S., and the Gyeongchun Line in Seoul, Korea, are representative parks built on railroads. A total of 505 surveys were collected from these parks. The responses were analyzed using descriptive statistics, principal component analysis, and linear regression. Also, more than 20,000 tweets which mentioned two linear parks respectively were collected. By using those tweets, the authors conducted the clustering analysis and draw the bigram network diagram for identifying and comparing the placeness of each park. The result suggests that more diverse design concept links to less diversity in behavior; that half of the park users use the park as a shortcut; and that same physical exercise provides different benefits depending on the park. Social media analysis showed the 606 is more closely related to the neighborhoods rather than the Gyeongchun Line Forest. The Gyeongchun Line Forest was a more event-related place than the 606.

A Collaborative Filtering-based Recommendation System with Relative Classification and Estimation Revision based on Time (상대적 분류 방법과 시간에 따른 평가값 보정을 적용한 협력적 필터링 기반 추천 시스템)

  • Lee, Se-Il;Lee, Sang-Yong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.2
    • /
    • pp.189-194
    • /
    • 2010
  • In the recommendation system that recommends services to a specific user by using the estimation value of other users for users' recommendation service, collaborative filtering methods are widely used. But such recommendation systems have problems that exact classification is not possible because a specific user is classified to already classified group in the course of clustering and inexact result can be recommended in case of big errors in users' estimation values. In this paper, in order to increase estimation accuracy, the researchers suggest a recommendation system that applies collaborative filtering after reclassifying on the basis of a specific user's classification items and then finding and correcting the estimation values of the users beyond the critical value of time. This system uses a method where a specific user is not classified to already classified group in the course of clustering but a group is reorganized on the basis of the specific user. In addition, the researchers correct estimation information by cutting off the subordinate 10% from the trimmed mean of samples and then applies weight over time to the remaining data. As the result of an experiment, the suggested method demonstrated about 14.9%'s more accurate estimation result in case of using MAE than general collaborative filtering method.

A Methodology for Estimating Large Scale Dynamic O/D of Commuter Working Trip (대규모 동적 O/D 생성을 위한 추정 방법론 연구: 첨두 출근통행을 기준으로)

  • HAN, He;HONG, Kiman;KIM, Taegyun;WHANG, Junmun;HONG, Young Suk;CHO, Joong Rae
    • Journal of Korean Society of Transportation
    • /
    • v.36 no.3
    • /
    • pp.203-215
    • /
    • 2018
  • This study suggests a method to construct large scale dynamic O/D reflecting the characteristic that the passengers' travel patterns change according to the land use patterns of the destination. There are limitations in the existing research about dynamic O/D estimation method, such as the difficulty of collecting data, which can be applied only to a small area, or limiting to a specific transportation network such as highway networks or public transportation networks. In this paper, we propose a method to estimate dynamic O/D without limitation of analysis area based on transportation resources that can be easily collected and used according to the big data era. Clustering analysis was used to calculate the departure time trip distribution ratio based on arrival time and departure time trip distribution function was estimated by each cluster. As a result of the comparison test with the survey data, the estimated distribution function was statistically significant.

Online news-based stock price forecasting considering homogeneity in the industrial sector (산업군 내 동질성을 고려한 온라인 뉴스 기반 주가예측)

  • Seong, Nohyoon;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.1-19
    • /
    • 2018
  • Since stock movements forecasting is an important issue both academically and practically, studies related to stock price prediction have been actively conducted. The stock price forecasting research is classified into structured data and unstructured data, and it is divided into technical analysis, fundamental analysis and media effect analysis in detail. In the big data era, research on stock price prediction combining big data is actively underway. Based on a large number of data, stock prediction research mainly focuses on machine learning techniques. Especially, research methods that combine the effects of media are attracting attention recently, among which researches that analyze online news and utilize online news to forecast stock prices are becoming main. Previous studies predicting stock prices through online news are mostly sentiment analysis of news, making different corpus for each company, and making a dictionary that predicts stock prices by recording responses according to the past stock price. Therefore, existing studies have examined the impact of online news on individual companies. For example, stock movements of Samsung Electronics are predicted with only online news of Samsung Electronics. In addition, a method of considering influences among highly relevant companies has also been studied recently. For example, stock movements of Samsung Electronics are predicted with news of Samsung Electronics and a highly related company like LG Electronics.These previous studies examine the effects of news of industrial sector with homogeneity on the individual company. In the previous studies, homogeneous industries are classified according to the Global Industrial Classification Standard. In other words, the existing studies were analyzed under the assumption that industries divided into Global Industrial Classification Standard have homogeneity. However, existing studies have limitations in that they do not take into account influential companies with high relevance or reflect the existence of heterogeneity within the same Global Industrial Classification Standard sectors. As a result of our examining the various sectors, it can be seen that there are sectors that show the industrial sectors are not a homogeneous group. To overcome these limitations of existing studies that do not reflect heterogeneity, our study suggests a methodology that reflects the heterogeneous effects of the industrial sector that affect the stock price by applying k-means clustering. Multiple Kernel Learning is mainly used to integrate data with various characteristics. Multiple Kernel Learning has several kernels, each of which receives and predicts different data. To incorporate effects of target firm and its relevant firms simultaneously, we used Multiple Kernel Learning. Each kernel was assigned to predict stock prices with variables of financial news of the industrial group divided by the target firm, K-means cluster analysis. In order to prove that the suggested methodology is appropriate, experiments were conducted through three years of online news and stock prices. The results of this study are as follows. (1) We confirmed that the information of the industrial sectors related to target company also contains meaningful information to predict stock movements of target company and confirmed that machine learning algorithm has better predictive power when considering the news of the relevant companies and target company's news together. (2) It is important to predict stock movements with varying number of clusters according to the level of homogeneity in the industrial sector. In other words, when stock prices are homogeneous in industrial sectors, it is important to use relational effect at the level of industry group without analyzing clusters or to use it in small number of clusters. When the stock price is heterogeneous in industry group, it is important to cluster them into groups. This study has a contribution that we testified firms classified as Global Industrial Classification Standard have heterogeneity and suggested it is necessary to define the relevance through machine learning and statistical analysis methodology rather than simply defining it in the Global Industrial Classification Standard. It has also contribution that we proved the efficiency of the prediction model reflecting heterogeneity.

Cluster-head-selection-algorithm in Wireless Sensor Networks by Considering the Distance (무선 센서네트워크에서 거리를 고려한 클러스터 헤드 선택 알고리즘)

  • Kim, Byung-Joon;Yoo, Sang-Shin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.13 no.4
    • /
    • pp.127-132
    • /
    • 2008
  • Wireless sensor network technologies applicable to various industrial fields are rapidly growing. Because it is difficult to change a battery for the once distributed wireless sensor network, energy efficient design is very critical. In order to achieve this purpose in network design, a number of studies have been examining the energy efficient routing protocol. The sensor network consumes energy in proportion to the distance of data transmission and the data to send. Cluster-based routing Protocols such as LEACH-C achieve energy efficiency through minimizing the distance of data transmission. In LEACH-C, however, the total distance between the nodes consisting the clusters are considered important in constructing clustering. This paper examines the cluster-head-selection-algorithm that reflect the distance between the base station and the cluster-head having a big influence on energy consumption. The Proposed method in this paper brought the result that the performance improved average $4{\sim}7%$ when LEACH-C and the base station are located beyond a certain distance. This result showed that the distance between cluster-head and the base station had a substantial influence on lifetime performance in the cluster-based routing protocol.

  • PDF

Extended Information Entropy via Correlation for Autonomous Attribute Reduction of BigData (빅 데이터의 자율 속성 감축을 위한 확장된 정보 엔트로피 기반 상관척도)

  • Park, In-Kyu
    • Journal of Korea Game Society
    • /
    • v.18 no.1
    • /
    • pp.105-114
    • /
    • 2018
  • Various data analysis methods used for customer type analysis are very important for game companies to understand their type and characteristics in an attempt to plan customized content for our customers and to provide more convenient services. In this paper, we propose a k-mode cluster analysis algorithm that uses information uncertainty by extending information entropy to reduce information loss. Therefore, the measurement of the similarity of attributes is considered in two aspects. One is to measure the uncertainty between each attribute on the center of each partition and the other is to measure the uncertainty about the probability distribution of the uncertainty of each property. In particular, the uncertainty in attributes is taken into account in the non-probabilistic and probabilistic scales because the entropy of the attribute is transformed into probabilistic information to measure the uncertainty. The accuracy of the algorithm is observable to the result of cluster analysis based on the optimal initial value through extensive performance analysis and various indexes.