• Title/Summary/Keyword: data similarity

Search Result 2,098, Processing Time 0.025 seconds

Evaluation of Similarity Analysis of Newspaper Article Using Natural Language Processing

  • Ayako Ohshiro;Takeo Okazaki;Takashi Kano;Shinichiro Ueda
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.6
    • /
    • pp.1-7
    • /
    • 2024
  • Comparing text features involves evaluating the "similarity" between texts. It is crucial to use appropriate similarity measures when comparing similarities. This study utilized various techniques to assess the similarities between newspaper articles, including deep learning and a previously proposed method: a combination of Pointwise Mutual Information (PMI) and Word Pair Matching (WPM), denoted as PMI+WPM. For performance comparison, law data from medical research in Japan were utilized as validation data in evaluating the PMI+WPM method. The distribution of similarities in text data varies depending on the evaluation technique and genre, as revealed by the comparative analysis. For newspaper data, non-deep learning methods demonstrated better similarity evaluation accuracy than deep learning methods. Additionally, evaluating similarities in law data is more challenging than in newspaper articles. Despite deep learning being the prevalent method for evaluating textual similarities, this study demonstrates that non-deep learning methods can be effective regarding Japanese-based texts.

A Behavior-based Authentication Using the Measuring Cosine Similarity (코사인 유사도 측정을 통한 행위 기반 인증)

  • Gil, Seon-Woong;Lee, Ki-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.4
    • /
    • pp.17-22
    • /
    • 2020
  • Behavior-based authentication technology, which is currently being researched a lot, requires a long extraction of a lot of data to increase the recognition rate of authentication compared to other authentication technologies. This paper uses the touch sensor and the gyroscope embedded in the smartphone in the Android environment to measure five times to the user to use only the minimum data that is essential among the behavior feature data used in the behavior-based authentication study. By requesting, a total of six behavior feature data were collected by touching the five touch screen, and the mean value was calculated from the changes in data during the next touch measurement to measure the cosine similarity between the value and the measured value. After generating the allowable range of cosine similarity by performing, we propose a user behavior based authentication method that compares the cosine similarity value of the authentication attempt data. Through this paper, we succeeded in demonstrating high performance from the first EER of 37.6% to the final EER of 1.9% by adjusting the threshold applied to the cosine similarity authentication range even in a small number of feature data and experimenter environments.

A Tolerant Rough Set Approach for Handwritten Numeral Character Classification

  • Kim, Daijin;Kim, Chul-Hyun
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1998.06a
    • /
    • pp.288-295
    • /
    • 1998
  • This paper proposes a new data classification method based on the tolerant rough set that extends the existing equivalent rough set. Similarity measure between two data is described by a distance function of all constituent attributes and they are defined to be tolerant when their similarity measure exceeds a similarity threshold value. The determination of optimal similarity theshold value is very important for the accurate classification. So, we determine it optimally by using the genetic algorithm (GA), where the goal of evolution is to balance two requirements such that (1) some tolerant objects are required to be included in the same class as many as possible. After finding the optimal similarity threshold value, a tolerant set of each object is obtained and the data set is grounded into the lower and upper approximation set depending on the coincidence of their classes. We propose a two-stage classification method that all data are classified by using the lower approxi ation at the first stage and then the non-classified data at the first stage are classified again by using the rough membership functions obtained from the upper approximation set. We apply the proposed classification method to the handwritten numeral character classification. problem and compare its classification performance and learning time with those of the feed forward neural network's back propagation algorithm.

  • PDF

Semi-supervised learning using similarity and dissimilarity

  • Seok, Kyung-Ha
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.1
    • /
    • pp.99-105
    • /
    • 2011
  • We propose a semi-supervised learning algorithm based on a form of regularization that incorporates similarity and dissimilarity penalty terms. Our approach uses a graph-based encoding of similarity and dissimilarity. We also present a model-selection method which employs cross-validation techniques to choose hyperparameters which affect the performance of the proposed method. Simulations using two types of dat sets demonstrate that the proposed method is promising.

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

Development of Similarity-Based Document Clustering System (유사성 계수에 의한 문서 클러스터링 시스템 개발)

  • Woo Hoon-Shik;Yim Dong-Soon
    • Proceedings of the Society of Korea Industrial and System Engineering Conference
    • /
    • 2002.05a
    • /
    • pp.119-124
    • /
    • 2002
  • Clustering of data is of a great interest in many data mining applications. In the field of document clustering, a document is represented as a data in a high dimensional space. Therefore, the document clustering can be accomplished with a general data clustering techniques. In this paper, we introduce a document clustering system based on similarity among documents. The developed system consists of three functions: 1) gatherings documents utilizing a search agent; 2) determining similarity coefficients between any two documents from term frequencies; 3) clustering documents with similarity coefficients. Especially, the document clustering is accomplished by a hybrid algorithm utilizing genetic and K-Means methods.

  • PDF

A Study on the CBR Pattern using Similarity and the Euclidean Calculation Pattern (유사도와 유클리디안 계산패턴을 이용한 CBR 패턴연구)

  • Yun, Jong-Chan;Kim, Hak-Chul;Kim, Jong-Jin;Youn, Sung-Dae
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.4
    • /
    • pp.875-885
    • /
    • 2010
  • CBR (Case-Based Reasoning) is a technique to infer the relationships between existing data and case data, and the method to calculate similarity and Euclidean distance is mostly frequently being used. However, since those methods compare all the existing and case data, it also has a demerit that it takes much time for data search and filtering. Therefore, to solve this problem, various researches have been conducted. This paper suggests the method of SE(Speed Euclidean-distance) calculation that utilizes the patterns discovered in the existing process of computing similarity and Euclidean distance. Because SE calculation applies the patterns and weight found during inputting new cases and enables fast data extraction and short operation time, it can enhance computing speed for temporal or spatial restrictions and eliminate unnecessary computing operation. Through this experiment, it has been found that the proposed method improves performance in various computer environments or processing rate more efficiently than the existing method that extracts data using similarity or Euclidean method does.

Operations on the Similarity Measures of Fuzzy Sets

  • Omran, Saleh;Hassaballah, M.
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.7 no.3
    • /
    • pp.205-208
    • /
    • 2007
  • Measuring the similarity between fuzzy sets plays a vital role in several fields. However, none of all well-known similarity measure methods is all-powerful, and all have the localization of its usage. This paper defines some operations on the similarity measures of fuzzy sets such as summation and multiplication of two similarity measures. Also, these operations will be generalized to any number of similarity measures. These operations will be very useful especially in the field of computer vision, and data retrieval because these fields need to combine and find some relations between similarity measures.

Similarity Search Algorithm Based on Hyper-Rectangular Representation of Video Data Sets (비디오 데이터 세트의 하이퍼 사각형 표현에 기초한 비디오 유사성 검색 알고리즘)

  • Lee, Seok-Lyong
    • The KIPS Transactions:PartD
    • /
    • v.11D no.4
    • /
    • pp.823-834
    • /
    • 2004
  • In this research, the similarity search algorithms are provided for large video data streams. A video stream that consists of a number of frames can be expressed by a sequence in the multidimensional data space, by representing each frame with a multidimensional vector By analyzing various characteristics of the sequence, it is partitioned into multiple video segments and clusters which are represented by hyper-rectangles. Using the hyper-rectangles of video segments and clusters, similarity functions between two video streams are defined, and two similarity search algorithms are proposed based on the similarity functions algorithms by hyper-rectangles and by representative frames. The former is an algorithm that guarantees the correctness while the latter focuses on the efficiency with a slight sacrifice of the correctness Experiments on different types of video streams and synthetically generated stream data show the strength of our proposed algorithms.

Development of a New Similarity Index to Compare Time-series Profile Data for Animal and Human Experiments (동물 및 임상 시험의 시계열 프로파일 데이터 비교를 위한 유사성 지수 개발)

  • Lee, Ye Gyoung;Lee, Hyun Jeong;Jang, Hyeon Ae;Shin, Sangmun
    • Journal of Korean Society for Quality Management
    • /
    • v.49 no.2
    • /
    • pp.145-159
    • /
    • 2021
  • Purpose: A statistical similarity evaluation to compare pharmacokinetics(PK) profile data between nonclinical and clinical experiments has become a significant issue on many drug development processes. This study proposes a new similarity index by considering important parameters, such as the area under the curve(AUC) and the time-series profile of various PK data. Methods: In this study, a new profile similarity index(PSI) by using the concept of a process capability index(Cp) is proposed in order to investigate the most similar animal PK profile compared to the target(i.e., Human PK profile). The proposed PSI can be calculated geometric and arithmetic means of all short term similarity indices at all time points on time-series both animal and human PK data. Designed simulation approaches are demonstrated for a verification purpose. Results: Two different simulation studies are conducted by considering three variances(i.e., small, medium, and large variances) as well as three different characteristic types(smaller the better, larger the better, nominal the best). By using the proposed PSI, the most similar animal PK profile compare to the target human PK profile can be obtained in the simulation studies. In addition, a case study represents differentiated results compare to existing simple statistical analysis methods(i.e., root mean squared error and quality loss). Conclusion: The proposed PSI can effectively estimate the level of similarity between animal, human PK profiles. By using these PSI results, we can reduce the number of animal experiments because we only focus on the significant animal representing a high PSI value.