• Title/Summary/Keyword: software clustering

Search Result 318, Processing Time 0.027 seconds

Dynamic Subspace Clustering for Online Data Streams (온라인 데이터 스트림에서의 동적 부분 공간 클러스터링 기법)

  • Park, Nam Hun
    • Journal of Digital Convergence
    • /
    • v.20 no.2
    • /
    • pp.217-223
    • /
    • 2022
  • Subspace clustering for online data streams requires a large amount of memory resources as all subsets of data dimensions must be examined. In order to track the continuous change of clusters for a data stream in a finite memory space, in this paper, we propose a grid-based subspace clustering algorithm that effectively uses memory resources. Given an n-dimensional data stream, the distribution information of data items in data space is monitored by a grid-cell list. When the frequency of data items in the grid-cell list of the first level is high and it becomes a unit grid-cell, the grid-cell list of the next level is created as a child node in order to find clusters of all possible subspaces from the grid-cell. In this way, a maximum n-level grid-cell subspace tree is constructed, and a k-dimensional subspace cluster can be found at the kth level of the subspace grid-cell tree. Through experiments, it was confirmed that the proposed method uses computing resources more efficiently by expanding only the dense space while maintaining the same accuracy as the existing method.

Distance Measures in HMM Clustering for Large-scale On-line Chinese Character Recognition (대용량 온라인 한자 인식을 위한 클러스터링 거리계산 척도)

  • Kim, Kwang-Seob;Ha, Jin-Young
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.9
    • /
    • pp.683-690
    • /
    • 2009
  • One of the major problems that prevent us from building a good recognition system for large-scale on-line Chinese character recognition using HMMs is increasing recognition time. In this paper, we propose a clustering method to solve recognition speed problem and an efficient distance measure between HMMs. From the experiments, we got about twice the recognition speed and 95.37% 10-candidate recognition accuracy, which is only 0.9% decrease, for 20,902 Chinese characters defined in Unicode CJK unified ideographs.

A Function Approximation Method for Q-learning of Reinforcement Learning (강화학습의 Q-learning을 위한 함수근사 방법)

  • 이영아;정태충
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.11
    • /
    • pp.1431-1438
    • /
    • 2004
  • Reinforcement learning learns policies for accomplishing a task's goal by experience through interaction between agent and environment. Q-learning, basis algorithm of reinforcement learning, has the problem of curse of dimensionality and slow learning speed in the incipient stage of learning. In order to solve the problems of Q-learning, new function approximation methods suitable for reinforcement learning should be studied. In this paper, to improve these problems, we suggest Fuzzy Q-Map algorithm that is based on online fuzzy clustering. Fuzzy Q-Map is a function approximation method suitable to reinforcement learning that can do on-line teaming and express uncertainty of environment. We made an experiment on the mountain car problem with fuzzy Q-Map, and its results show that learning speed is accelerated in the incipient stage of learning.

An Ensemble Clustering Algorithm based on a Prior Knowledge (사전정보를 활용한 앙상블 클러스터링 알고리즘)

  • Ko, Song;Kim, Dae-Won
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.2
    • /
    • pp.109-121
    • /
    • 2009
  • Although a prior knowledge is a factor to improve the clustering performance, it is dependant on how to use of them. Especial1y, when the prior knowledge is employed in constructing initial centroids of cluster groups, there should be concerned of similarities of a prior knowledge. Despite labels of some objects of a prior knowledge are identical, the objects whose similarities are low should be separated. By separating them, centroids of initial group were not fallen in a problem which is collision of objects with low similarities. There can use the separated prior knowledge by various methods such as various initializations. To apply association rule, proposed method makes enough cluster group number, then the centroids of initial groups could constructed by separated prior knowledge. Then ensemble of the various results outperforms what can not be separated.

Optimization Driven MapReduce Framework for Indexing and Retrieval of Big Data

  • Abdalla, Hemn Barzan;Ahmed, Awder Mohammed;Al Sibahee, Mustafa A.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.5
    • /
    • pp.1886-1908
    • /
    • 2020
  • With the technical advances, the amount of big data is increasing day-by-day such that the traditional software tools face a burden in handling them. Additionally, the presence of the imbalance data in big data is a massive concern to the research industry. In order to assure the effective management of big data and to deal with the imbalanced data, this paper proposes a new indexing algorithm for retrieving big data in the MapReduce framework. In mappers, the data clustering is done based on the Sparse Fuzzy-c-means (Sparse FCM) algorithm. The reducer combines the clusters generated by the mapper and again performs data clustering with the Sparse FCM algorithm. The two-level query matching is performed for determining the requested data. The first level query matching is performed for determining the cluster, and the second level query matching is done for accessing the requested data. The ranking of data is performed using the proposed Monarch chaotic whale optimization algorithm (M-CWOA), which is designed by combining Monarch butterfly optimization (MBO) [22] and chaotic whale optimization algorithm (CWOA) [21]. Here, the Parametric Enabled-Similarity Measure (PESM) is adapted for matching the similarities between two datasets. The proposed M-CWOA outperformed other methods with maximal precision of 0.9237, recall of 0.9371, F1-score of 0.9223, respectively.

XML Documents Clustering Technique Based on Bit Vector (비트벡터에 기반한 XML 문서 군집화 기법)

  • Kim, Woo-Saeng
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.5
    • /
    • pp.10-16
    • /
    • 2010
  • XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for accessing, querying, and storing XML documents. In this paper, we propose a new method to cluster XML documents efficiently. A bit vector which represents a XML document is proposed to cluster the XML documents. The similarity between two XML documents is measured by a bit-wise AND operation between two corresponding bit vectors. The experiment shows that the clusters are formed well and efficiently when a bit vector is used for the feature of a XML document.

Deconstructing Opinion Survey: A Case Study

  • Alanazi, Entesar
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.4
    • /
    • pp.52-58
    • /
    • 2021
  • Questionnaires and surveys are increasingly being used to collect information from participants of empirical software engineering studies. Usually, such data is analyzed using statistical methods to show an overall picture of participants' agreement or disagreement. In general, the whole survey population is considered as one group with some methods to extract varieties. Sometimes, there are different opinions in the same group, but they are not well discovered. In some cases of the analysis, the population may be divided into subgroups according to some data. The opinions of different segments of the population may be the same. Even though the existing approach can capture the general trends, there is a risk that the opinions of different sub-groups are lost. The problem becomes more complex in longitudinal studies where minority opinions might fade over time. Longitudinal survey data may include several interesting patterns that can be extracted using a clustering process. It can discover new information and give attention to different opinions. We suggest using a data mining approach to finding the diversity among the different groups in longitudinal studies. Our study shows that diversity can be revealed and tracked over time using the clustering approach, and the minorities have an opportunity to be heard.

Deconstructing Agile Survey to Identify Agile Skeptics

  • Entesar Alanazi;Mohammad Mahdi Hassan
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.3
    • /
    • pp.201-210
    • /
    • 2024
  • In empirical software engineering research, there is an increased use of questionnaires and surveys to collect information from practitioners. Typically, such data is then analyzed based on overall, descriptive statistics. Overall, they consider the whole survey population as a single group with some sampling techniques to extract varieties. In some cases, the population is also partitioned into sub-groups based on some background information. However, this does not reveal opinion diversity properly as similar opinions can exist in different segments of the population, whereas people within the same group might have different opinions. Even though existing approach can capture the general trends there is a risk that the opinions of different sub-groups are lost. The problem becomes more complex in case of longitudinal studies where minority opinions might fade or resolute over time. Survey based longitudinal data may have some potential patterns which can be extracted through a clustering process. It may reveal new information and attract attention to alternative perspectives. We suggest using a data mining approach to finding the diversity among the different groups in longitudinal studies (agile skeptics). In our study, we show that diversity can be revealed and tracked over time with the use of clustering approach, and the minorities have an opportunity to be heard.

Distributed data deduplication technique using similarity based clustering and multi-layer bloom filter (SDS 환경의 유사도 기반 클러스터링 및 다중 계층 블룸필터를 활용한 분산 중복제거 기법)

  • Yoon, Dabin;Kim, Deok-Hwan
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.14 no.5
    • /
    • pp.60-70
    • /
    • 2018
  • A software defined storage (SDS) is being deployed in cloud environment to allow multiple users to virtualize physical servers, but a solution for optimizing space efficiency with limited physical resources is needed. In the conventional data deduplication system, it is difficult to deduplicate redundant data uploaded to distributed storages. In this paper, we propose a distributed deduplication method using similarity-based clustering and multi-layer bloom filter. Rabin hash is applied to determine the degree of similarity between virtual machine servers and cluster similar virtual machines. Therefore, it improves the performance compared to deduplication efficiency for individual storage nodes. In addition, a multi-layer bloom filter incorporated into the deduplication process to shorten processing time by reducing the number of the false positives. Experimental results show that the proposed method improves the deduplication ratio by 9% compared to deduplication method using IP address based clusters without any difference in processing time.

Identification of Microservices to Develop Cloud-Native Applications (클라우드네이티브 애플리케이션 구축을 위한 마이크로서비스 식별 방법)

  • Choi, Okjoo;Kim, Yukyong
    • Journal of Software Assessment and Valuation
    • /
    • v.17 no.1
    • /
    • pp.51-58
    • /
    • 2021
  • Microservices are not only developed independently, but can also be run and deployed independently, ensuring more flexible scaling and efficient collaboration in a cloud computing environment. This impact has led to a surge in migrating to microservices-oriented application environments in recent years. In order to introduce microservices, the problem of identifying microservice units in a single application built with a single architecture must first be solved. In this paper, we propose an algorithm-based approach to identify microservices from legacy systems. A graph is generated using the meta-information of the legacy code, and a microservice candidate is extracted by applying a clustering algorithm. Modularization quality is evaluated using metrics for the extracted microservice candidates. In addition, in order to validate the proposed method, candidate services are derived using codes of open software that are widely used for benchmarking, and the level of modularity is evaluated using metrics. It can be identified as a smaller unit of microservice, and as a result, the module quality has improved.