• Title/Summary/Keyword: String Similarity

Search Result 47, Processing Time 0.021 seconds

Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms

  • Galbadrakh, Bulgan;Lee, Kyung-Eun;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • v.10 no.4
    • /
    • pp.266-270
    • /
    • 2012
  • Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.

Word Similarity Calculation by Using the Edit Distance Metrics with Consonant Normalization

  • Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.11 no.4
    • /
    • pp.573-582
    • /
    • 2015
  • Edit distance metrics are widely used for many applications such as string comparison and spelling error corrections. Hamming distance is a metric for two equal length strings and Damerau-Levenshtein distance is a well-known metrics for making spelling corrections through string-to-string comparison. Previous distance metrics seems to be appropriate for alphabetic languages like English and European languages. However, the conventional edit distance criterion is not the best method for agglutinative languages like Korean. The reason is that two or more letter units make a Korean character, which is called as a syllable. This mechanism of syllable-based word construction in the Korean language causes an edit distance calculation to be inefficient. As such, we have explored a new edit distance method by using consonant normalization and the normalization factor.

Efficient Inverted List Search Technique using Bitmap Filters (비트맵 필터를 이용한 효율적인 역 리스트 탐색 기법)

  • Kwon, In-Teak;Kim, Jong-Ik
    • The KIPS Transactions:PartD
    • /
    • v.18D no.6
    • /
    • pp.415-422
    • /
    • 2011
  • Finding similar strings is an important operation because textual data can have errors, duplications, and inconsistencies by nature. Many algorithms have been developed for string approximate searches and most of them make use of inverted lists to find similar strings. These algorithms basically perform merge operations on inverted lists. In this paper, we develop a bitmap representation of an inverted list and propose an efficient search algorithm that can skip unnecessary inverted lists without searching using bitmap filters. Experimental results show that the proposed technique consistently improve the performance of the search.

The Recognition of Occluded 2-D Objects Using the String Matching and Hash Retrieval Algorithm (스트링 매칭과 해시 검색을 이용한 겹쳐진 이차원 물체의 인식)

  • Kim, Kwan-Dong;Lee, Ji-Yong;Lee, Byeong-Gon;Ahn, Jae-Hyeong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.7
    • /
    • pp.1923-1932
    • /
    • 1998
  • This paper deals with a 2-D objects recognition algorithm. And in this paper, we present an algorithm which can reduce the computation time in model retrieval by means of hashing technique instead of using the binary~tree method. In this paper, we treat an object boundary as a string of structural units and use an attributed string matching algorithm to compute similarity measure between two strings. We select from the privileged strings a privileged string wIth mmimal eccentricity. This privileged string is treated as the reference string. And thell we wllstructed hash table using the distance between privileged string and the reference string as a key value. Once the database of all model strings is built, the recognition proceeds by segmenting the scene into a polygonal approximation. The distance between privileged string extracted from the scene and the reference string is used for model hypothesis rerieval from the table. As a result of the computer simulation, the proposed method can recognize objects only computing, the distance 2-3tiems, while previous method should compute the distance 8-10 times for model retrieval.

  • PDF

Similarity Measure based on XML Document's Structure and Contents (XML 문서의 구조와 내용을 고려한 유사도 측정)

  • Kim, Woo-Saeng
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.8
    • /
    • pp.1043-1050
    • /
    • 2008
  • XML has become a standard for data representation and exchange on the Internet. With a large number of XML documents on the Web, there is an increasing need to automatically process those structurally rich documents for information retrieval, document management, and data mining applications. In this paper, we propose a new method to measure the similarity between XML documents by considering their structures and contents. The similarity of document's structure is found by a simple string matching technique and that of document's contents is found by weights taking into account of the names and positions of elements. The overall algorithm runs in time that is linear in the combined size of the two documents involved in comparison evaluation.

  • PDF

Efficient Hyperplane Generation Techniques for Human Activity Classification in Multiple-Event Sensors Based Smart Home (다중 이벤트 센서 기반 스마트 홈에서 사람 행동 분류를 위한 효율적 의사결정평면 생성기법)

  • Chang, Juneseo;Kim, Boguk;Mun, Changil;Lee, Dohyun;Kwak, Junho;Park, Daejin;Jeong, Yoosoo
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.14 no.5
    • /
    • pp.277-286
    • /
    • 2019
  • In this paper, we propose an efficient hyperplane generation technique to classify human activity from combination of events and sequence information obtained from multiple-event sensors. By generating hyperplane efficiently, our machine learning algorithm classify with less memory and run time than the LSVM (Linear Support Vector Machine) for embedded system. Because the fact that light weight and high speed algorithm is one of the most critical issue in the IoT, the study can be applied to smart home to predict human activity and provide related services. Our approach is based on reducing numbers of hyperplanes and utilizing robust string comparing algorithm. The proposed method results in reduction of memory consumption compared to the conventional ML (Machine Learning) algorithms; 252 times to LSVM and 34,033 times to LSTM (Long Short-Term Memory), although accuracy is decreased slightly. Thus our method showed outstanding performance on accuracy per hyperplane; 240 times to LSVM and 30,520 times to LSTM. The binarized image is then divided into groups, where each groups are converted to binary number, in order to reduce the number of comparison done in runtime process. The binary numbers are then converted to string. The test data is evaluated by converting to string and measuring similarity between hyperplanes using Levenshtein algorithm, which is a robust dynamic string comparing algorithm. This technique reduces runtime and enables the proposed algorithm to become 27% faster than LSVM, and 90% faster than LSTM.

Exclusion of Non-similar Candidates using Positional Accuracy based on Levenstein Distance from N-best Recognition Results of Isolated Word Recognition (레벤스타인 거리에 기초한 위치 정확도를 이용한 고립 단어 인식 결과의 비유사 후보 단어 제외)

  • Yun, Young-Sun;Kang, Jeom-Ja
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.109-115
    • /
    • 2009
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. In this paper, we investigate several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. At first, word distance method based on phone and syllable distances are considered. These methods use just Levenstein distance on phones or double Levenstein distance algorithm on syllables of candidates. Next, word similarity approaches are presented that they use characters' position information of word candidates. Each character's position is labeled to inserted, deleted, and correct position after alignment between source and target string. The word similarities are obtained from characters' positional probabilities which mean the frequency ratio of the same characters' observations on the position. From experimental results, we can find that the proposed methods are effective for removing non-similar words without loss of system performance from the N-best recognition candidates of the systems.

  • PDF

A Study on the Prediction of Drug Efficacy by Using Molecular Structure (분자구조 유사도를 활용한 약물 효능 예측 알고리즘 연구)

  • Jeong, Hwayoung;Song, Changhyeon;Cho, Hyeyoun;Key, Jaehong
    • Journal of Biomedical Engineering Research
    • /
    • v.43 no.4
    • /
    • pp.230-240
    • /
    • 2022
  • Drug regeneration technology is an efficient strategy than the existing new drug development process, which requires large costs and time by using drugs that have already been proven safe. In this study, we recognize the importance of the new drug regeneration aspect of new drug development and research in predicting functional similarities through the basic molecular structure that forms drugs. We test four string-based algorithms by using SMILES data and searching for their similarities. And by using the ATC codes, pair them with functional similarities, which we compare and validate to select the optimal model. We confirmed that the higher the molecular structure similarity, the higher the ATC code matching rate. We suggest the possibility of additional potency of random drugs, which can be predicted through data that give information on drugs with high molecular similarities. This model has the advantage of being a great combination with additional data, so we look forward to using this model in future research.

Modified Edit Distance Method for Finding Similar Words in Various Smartphone Keypad Environment (다양한 스마트폰 키패드 환경에서 유사 단어 검색을 위한 수정된 편집 거리 계산 방법)

  • Song, Yeong-Kil;Kim, Hark-Soo
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.12
    • /
    • pp.12-18
    • /
    • 2011
  • Most smartphone use virtual keypads based on touch-pad. The virtual keypads often make typographical errors because of the physical limitations of device such as small screen and limited input methods. To resolve this problem, many similar word-finding methods have been studied. In the paper, we propose an edit distance method (a well-known string similarity measure) that is modified to consider various types of virtual keypads. The proposed method effectively covers typographical errors in various keypads by converting an input string into a physical key sequence and by reflecting characteristics of virtual keypads to edit scores. In the experiments with various keypads, the proposed method showed better performances than a typical edit distance method.

Semantic Image Retrieval Using Color Distribution and Similarity Measurement in WordNet (컬러 분포와 WordNet상의 유사도 측정을 이용한 의미적 이미지 검색)

  • Choi, Jun-Ho;Cho, Mi-Young;Kim, Pan-Koo
    • The KIPS Transactions:PartB
    • /
    • v.11B no.4
    • /
    • pp.509-516
    • /
    • 2004
  • Semantic interpretation of image is incomplete without some mechanism for understanding semantic content that is not directly visible. For this reason, human assisted content-annotation through natural language is an attachment of textual description to image. However, keyword-based retrieval is in the level of syntactic pattern matching. In other words, dissimilarity computation among terms is usually done by using string matching not concept matching. In this paper, we propose a method for computerized semantic similarity calculation In WordNet space. We consider the edge, depth, link type and density as well as existence of common ancestors. Also, we have introduced method that applied similarity measurement on semantic image retrieval. To combine wi#h the low level features, we use the spatial color distribution model. When tested on a image set of Microsoft's 'Design Gallery Line', proposed method outperforms other approach.