• 제목/요약/키워드: Imprecise set

검색결과 25건 처리시간 0.022초

러프 엔트로피를 이용한 범주형 데이터의 클러스터링 (lustering of Categorical Data using Rough Entropy)

  • 박인규
    • 한국인터넷방송통신학회논문지
    • /
    • 제13권5호
    • /
    • pp.183-188
    • /
    • 2013
  • 객체를 분류하기 위하여 유사한 특징을 기반으로 하는 다양한 클러스터해석은 데이터 마이닝에서 필수적이다. 그러나 많은 데이터베이스에 포함되어 있는 범주형 데이터의 경우에 기존의 분할접근방법은 객체간의 불확실성을 처리하는데 한계가 있다. 범주형 데이터의 분할과정에서 식별불가능에 의한 동치류의 불확실성에 대한 접근논리가 러프집합의 대수학적인 논리에만 국한되어서 알고리즘의 안정성과 효율성이 떨어지는 요인으로 작용하고 있다. 본 논문에서는 범주형 데이터에 존재하는 속성의 의존도를 고려하기 위하여 정보이론적인 척도를 기반으로 러프엔트로피를 정의하고 MMMR이라는 알고리즘을 제안하여 분할속성을 추출한다. 제안된 방법의 성능을 분석하고 비교하기 위하여 K-means, 퍼지에 의한 방법과 표준편차를 이용한 기존의 방법과 비교우위를 ZOO데이터에 국한하여 알아본다. ZOO데이터를 이용하여 기존의 범주형 알고리즘과의 비교우위를 살펴보고 제안된 알고리즘의 효율성을 검증한다.

Pre-Computation Based Selective Probing (PCSP) Scheme for Distributed Quality of Service (QoS) Routing with Imprecise State Information

  • Lee Won-Ick;Lee Byeong-Gi
    • Journal of Communications and Networks
    • /
    • 제8권1호
    • /
    • pp.70-84
    • /
    • 2006
  • We propose a new distributed QoS routing scheme called pre-computation based selective probing (PCSP). The PCSP scheme is designed to provide an exact solution to the constrained optimization problem with moderate overhead, considering the practical environment where the state information available for the routing decision is not exact. It does not limit the number of probe messages, instead, employs a qualitative (or conditional) selective probing approach. It considers both the cost and QoS metrics of the least-cost and the best-QoS paths to calculate the end-to-end cost of the found feasible paths and find QoS-satisfying least-cost paths. It defines strict probing condition that excludes not only the non-feasible paths but also the non-optimal paths. It additionally pre-computes the QoS variation taking into account the impreciseness of the state information and applies two modified QoS-satisfying conditions to the selection rules. This strict probing condition and carefully designed probing approaches enable to strictly limit the set of neighbor nodes involved in the probing process, thereby reducing the message overhead without sacrificing the optimal properties. However, the PCSP scheme may suffer from high message overhead due to its conservative search process in the worst case. In order to bound such message overhead, we extend the PCSP algorithm by applying additional quantitative heuristics. Computer simulations reveal that the PCSP scheme reduces message overhead and possesses ideal success ratio with guaranteed optimal search. In addition, the quantitative extensions of the PCSP scheme turn out to bound the worst-case message overhead with slight performance degradation.

이동객체 데이터베이스에서의 밀집 영역 연속 탐색 (Continuous Discovery of Dense Regions in the Database of Moving Objects)

  • 이영구;김원영
    • 인터넷정보학회논문지
    • /
    • 제9권4호
    • /
    • pp.115-131
    • /
    • 2008
  • 일상 생활에서 널리 사용 중인 핸드폰, PDA 등 소형 이동장치들의 밀집된 영역들을 구하는 것은 매우 중요한 문제들 중의 하나로서, 군대의 집결, 차량 이동의 모니터링 등 다양한 분야에 사용될 수 있다. 본 논문에서는 대량의 이동객체 데이터베이스 상에서 밀집 영역 탐색을 연속적으로 수행하기 위한 새로운 알고리즘을 제안한다. 본 논문에서는 이동객체들이 전원 절약 등의 이유로 자신들의 위치를 주기적으로 서버에 보고하는 대신, 기대되는 위치로부터 멀리 떨어지게 되는 경우에만 새로운 위치를 보고하는 환경을 가정한다. 이러한 경우 서버에서 관리되는 이동객체 위치는 정확하게 지정할 수 없고 확률적으로 표시되는데, 대량의 이동객체에 대해 확률적인 분포를 고려하여 밀집 영역을 찾기 위해서는 커다란 비용이 요구된다. 본 논문에서는 근접한 위치에 있는 이동객체들을 하나의 그룹으로 묶고, 동일한 그룹에 속한 이동객체들은 동일하게 취급함으로써 계산의 복잡도를 줄인다. 최종 결과에서 밀집 영역 판단이 모호해지는 경우에만 개별 이동객체들이 자세히 조사된다. 여러 데이터 집합들을 대상으로 다양한 실험을 수행하여 제안된 알고리즘의 우수성을 보이고, 민감성 및 확장성 분석 결과를 제시한다.

  • PDF

Hierarchical Clustering Approach of Multisensor Data Fusion: Application of SAR and SPOT-7 Data on Korean Peninsula

  • Lee, Sang-Hoon;Hong, Hyun-Gi
    • 대한원격탐사학회:학술대회논문집
    • /
    • 대한원격탐사학회 2002년도 Proceedings of International Symposium on Remote Sensing
    • /
    • pp.65-65
    • /
    • 2002
  • In remote sensing, images are acquired over the same area by sensors of different spectral ranges (from the visible to the microwave) and/or with different number, position, and width of spectral bands. These images are generally partially redundant, as they represent the same scene, and partially complementary. For many applications of image classification, the information provided by a single sensor is often incomplete or imprecise resulting in misclassification. Fusion with redundant data can draw more consistent inferences for the interpretation of the scene, and can then improve classification accuracy. The common approach to the classification of multisensor data as a data fusion scheme at pixel level is to concatenate the data into one vector as if they were measurements from a single sensor. The multiband data acquired by a single multispectral sensor or by two or more different sensors are not completely independent, and a certain degree of informative overlap may exist between the observation spaces of the different bands. This dependence may make the data less informative and should be properly modeled in the analysis so that its effect can be eliminated. For modeling and eliminating the effect of such dependence, this study employs a strategy using self and conditional information variation measures. The self information variation reflects the self certainty of the individual bands, while the conditional information variation reflects the degree of dependence of the different bands. One data set might be very less reliable than others in the analysis and even exacerbate the classification results. The unreliable data set should be excluded in the analysis. To account for this, the self information variation is utilized to measure the degrees of reliability. The team of positively dependent bands can gather more information jointly than the team of independent ones. But, when bands are negatively dependent, the combined analysis of these bands may give worse information. Using the conditional information variation measure, the multiband data are split into two or more subsets according the dependence between the bands. Each subsets are classified separately, and a data fusion scheme at decision level is applied to integrate the individual classification results. In this study. a two-level algorithm using hierarchical clustering procedure is used for unsupervised image classification. Hierarchical clustering algorithm is based on similarity measures between all pairs of candidates being considered for merging. In the first level, the image is partitioned as any number of regions which are sets of spatially contiguous pixels so that no union of adjacent regions is statistically uniform. The regions resulted from the low level are clustered into a parsimonious number of groups according to their statistical characteristics. The algorithm has been applied to satellite multispectral data and airbone SAR data.

  • PDF

유사도 알고리즘을 활용한 시맨틱 프로세스 검색방안 (Semantic Process Retrieval with Similarity Algorithms)

  • 이홍주
    • Asia pacific journal of information systems
    • /
    • 제18권1호
    • /
    • pp.79-96
    • /
    • 2008
  • One of the roles of the Semantic Web services is to execute dynamic intra-organizational services including the integration and interoperation of business processes. Since different organizations design their processes differently, the retrieval of similar semantic business processes is necessary in order to support inter-organizational collaborations. Most approaches for finding services that have certain features and support certain business processes have relied on some type of logical reasoning and exact matching. This paper presents our approach of using imprecise matching for expanding results from an exact matching engine to query the OWL(Web Ontology Language) MIT Process Handbook. MIT Process Handbook is an electronic repository of best-practice business processes. The Handbook is intended to help people: (1) redesigning organizational processes, (2) inventing new processes, and (3) sharing ideas about organizational practices. In order to use the MIT Process Handbook for process retrieval experiments, we had to export it into an OWL-based format. We model the Process Handbook meta-model in OWL and export the processes in the Handbook as instances of the meta-model. Next, we need to find a sizable number of queries and their corresponding correct answers in the Process Handbook. Many previous studies devised artificial dataset composed of randomly generated numbers without real meaning and used subjective ratings for correct answers and similarity values between processes. To generate a semantic-preserving test data set, we create 20 variants for each target process that are syntactically different but semantically equivalent using mutation operators. These variants represent the correct answers of the target process. We devise diverse similarity algorithms based on values of process attributes and structures of business processes. We use simple similarity algorithms for text retrieval such as TF-IDF and Levenshtein edit distance to devise our approaches, and utilize tree edit distance measure because semantic processes are appeared to have a graph structure. Also, we design similarity algorithms considering similarity of process structure such as part process, goal, and exception. Since we can identify relationships between semantic process and its subcomponents, this information can be utilized for calculating similarities between processes. Dice's coefficient and Jaccard similarity measures are utilized to calculate portion of overlaps between processes in diverse ways. We perform retrieval experiments to compare the performance of the devised similarity algorithms. We measure the retrieval performance in terms of precision, recall and F measure? the harmonic mean of precision and recall. The tree edit distance shows the poorest performance in terms of all measures. TF-IDF and the method incorporating TF-IDF measure and Levenshtein edit distance show better performances than other devised methods. These two measures are focused on similarity between name and descriptions of process. In addition, we calculate rank correlation coefficient, Kendall's tau b, between the number of process mutations and ranking of similarity values among the mutation sets. In this experiment, similarity measures based on process structure, such as Dice's, Jaccard, and derivatives of these measures, show greater coefficient than measures based on values of process attributes. However, the Lev-TFIDF-JaccardAll measure considering process structure and attributes' values together shows reasonably better performances in these two experiments. For retrieving semantic process, we can think that it's better to consider diverse aspects of process similarity such as process structure and values of process attributes. We generate semantic process data and its dataset for retrieval experiment from MIT Process Handbook repository. We suggest imprecise query algorithms that expand retrieval results from exact matching engine such as SPARQL, and compare the retrieval performances of the similarity algorithms. For the limitations and future work, we need to perform experiments with other dataset from other domain. And, since there are many similarity values from diverse measures, we may find better ways to identify relevant processes by applying these values simultaneously.