• Title/Summary/Keyword: Text similarity

Search Result 276, Processing Time 0.038 seconds

Text Region Detection using Edge and Regional Minima/Maxima Transformation from Natural Scene Images (에지 및 국부적 최소/최대 변환을 이용한 자연 이미지로부터 텍스트 영역 검출)

  • Park, Jong-Cheon;Lee, Keun-Wang
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.2
    • /
    • pp.358-363
    • /
    • 2009
  • Text region detection from the natural scene images used in a variety of applications, many research are needed in this field. Recent research methods is to detect the text region using various algorithm which it is combination of edge based and connected component based. Therefore, this paper proposes an text region detection using edge and regional minima/maxima transformation algorithm from natural scene images, and then detect the connected components of edge and regional minima/maxima, labeling edge and regional minima/maxima connected components. Analysis the labeled regions and then detect a text candidate regions, each of detected text candidates combined and create a single text candidate image, Final text region validated by comparing the similarity and adjacency of individual characters, and then as the final text regions are detected. As the results of experiments, proposed algorithm improved the correctness of text regions detection using combined edge and regional minima/maxima connected components detection methods.

An Experimental Study on Selecting Association Terms Using Text Mining Techniques (텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구)

  • Kim, Su-Yeon;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.3 s.61
    • /
    • pp.147-165
    • /
    • 2006
  • In this study, experiments for selection of association terms were conducted in order to discover the optimum method in selecting additional terms that are related to an initial query term. Association term sets were generated by using support, confidence, and lift measures of the Apriori algorithm, and also by using the similarity measures such as GSS, Jaccard coefficient, cosine coefficient, and Sokal & Sneath 5, and mutual information. In performance evaluation of term selection methods, precision of association terms as well as the overlap ratio of association terms and relevant documents' indexing terms were used. It was found that Apriori algorithm and GSS achieved the highest level of performances.

An Improvement Of Efficiency For kNN By Using A Heuristic (휴리스틱을 이용한 kNN의 효율성 개선)

  • Lee, Jae-Moon
    • The KIPS Transactions:PartB
    • /
    • v.10B no.6
    • /
    • pp.719-724
    • /
    • 2003
  • This paper proposed a heuristic to enhance the speed of kNN without loss of its accuracy. The proposed heuristic minimizes the computation of the similarity between two documents which is the dominant factor in kNN. To do this, the paper proposes a method to calculate the upper limit of the similarity and to sort the training documents. The proposed heuristic was implemented on the existing framework of the text categorization, so called, AI :: Categorizer and it was compared with the conventional kNN with the well-known data, Router-21578. The comparisons show that the proposed heuristic outperforms kNN about 30∼40% with respect to the execution time.

A Comparative Study of WWW Search Engine Performance (WWW 탐색도구의 색인 및 탐색 기능 평가에 관한 연구)

  • Chung Young-Mee;Kim Seong-Eun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.31 no.1
    • /
    • pp.153-184
    • /
    • 1997
  • The importance of WWW search services is increasing as Internet information resources explode. An evaluation of current 9 search services was first conducted by comparing descriptively the features concerning indexing, searching, and ranking of search results. Secondly, a couple of search queries were used to evaluate search performance of those services by the measures of retrieval effectiveness. the degree of overlap in searching sites, and the degree of similarity between services. In this experiment, Alta Vista, HotBot and Open Text Index showed better results for the retrieval effectiveness. The level of similarity among the 9 search services was extremely low.

  • PDF

An Innovative Approach of Bangla Text Summarization by Introducing Pronoun Replacement and Improved Sentence Ranking

  • Haque, Md. Majharul;Pervin, Suraiya;Begum, Zerina
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.752-777
    • /
    • 2017
  • This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.

Korean Text Automatic Summarization using Semantically Expanded Sentence Similarity (의미적으로 확장된 문장 간 유사도를 이용한 한국어 텍스트 자동 요약)

  • Kim, Heechan;Lee, Soowon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.11a
    • /
    • pp.841-844
    • /
    • 2014
  • 텍스트 자동 요약은 수많은 텍스트 데이터를 처리함에 있어 중요한 연구 분야이다. 이중 추출요약은 현재 가장 많이 연구가 되고 있는 자동 요약 분야이다. 본 논문은 추출 요약의 선두 연구인 TextRank는 문장 간 유사도를 계산할 때 문장 내 단어 간의 의미적 유사성을 충분히 고려하지 못하였다. 본 연구에서는 의미적 유사성을 고려한 새로운 단어 간 유사도 측정 방법을 제안한다. 추출된 문장 간 유사도는 그래프로 표현되며, TextRank의 랭킹 알고리즘과 동일한 랭킹 알고리즘을 사용하여 실험적으로 평가하였다. 그 결과 문장 간 유사성을 고려할 때 단어의 의미적 요소를 충분히 고려하여 정보의 유실을 최소화하여야 한다는 것을 실험 결과로써 확인할 수 있었다.

Developing a New Algorithm for Conversational Agent to Detect Recognition Error and Neologism Meaning: Utilizing Korean Syllable-based Word Similarity (대화형 에이전트 인식오류 및 신조어 탐지를 위한 알고리즘 개발: 한글 음절 분리 기반의 단어 유사도 활용)

  • Jung-Won Lee;Il Im
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.267-286
    • /
    • 2023
  • The conversational agents such as AI speakers utilize voice conversation for human-computer interaction. Voice recognition errors often occur in conversational situations. Recognition errors in user utterance records can be categorized into two types. The first type is misrecognition errors, where the agent fails to recognize the user's speech entirely. The second type is misinterpretation errors, where the user's speech is recognized and services are provided, but the interpretation differs from the user's intention. Among these, misinterpretation errors require separate error detection as they are recorded as successful service interactions. In this study, various text separation methods were applied to detect misinterpretation. For each of these text separation methods, the similarity of consecutive speech pairs using word embedding and document embedding techniques, which convert words and documents into vectors. This approach goes beyond simple word-based similarity calculation to explore a new method for detecting misinterpretation errors. The research method involved utilizing real user utterance records to train and develop a detection model by applying patterns of misinterpretation error causes. The results revealed that the most significant analysis result was obtained through initial consonant extraction for detecting misinterpretation errors caused by the use of unregistered neologisms. Through comparison with other separation methods, different error types could be observed. This study has two main implications. First, for misinterpretation errors that are difficult to detect due to lack of recognition, the study proposed diverse text separation methods and found a novel method that improved performance remarkably. Second, if this is applied to conversational agents or voice recognition services requiring neologism detection, patterns of errors occurring from the voice recognition stage can be specified. The study proposed and verified that even if not categorized as errors, services can be provided according to user-desired results.

Similarity Analysis of Hospitalization using Crowding Distance

  • Jung, Yong Gyu;Choi, Young Jin;Cha, Byeong Heon
    • International journal of advanced smart convergence
    • /
    • v.5 no.2
    • /
    • pp.53-58
    • /
    • 2016
  • With the growing use of big data and data mining, it serves to understand how such techniques can be used to understand various relationships in the healthcare field. This study uses hierarchical methods of data analysis to explore similarities in hospitalization across several New York state counties. The study utilized methods of measuring crowding distance of data for age-specific hospitalization period. Crowding distance is defined as the longest distance, or least similarity, between urban cities. It is expected that the city of Clinton have the greatest distance, while Albany the other cities are closer because they are connected by the shortest distance to each step. Similarities were stronger across hospital stays categorized by age. Hierarchical clustering can be applied to predict the similarity of data across the 10 cities of hospitalization with the measurement of crowding distance. In order to enhance the performance of hierarchical clustering, comparison can be made across congestion distance when crowding distance is applied first through the application of converting text to an attribute vector. Measurements of similarity between two objects are dependent on the measurement method used in clustering but is distinguished from the similarity of the distance; where the smaller the distance value the more similar two things are to one other. By applying this specific technique, it is found that the distance between crowding is reduced consistently in relationship to similarity between the data increases to enhance the performance of the experiments through the application of special techniques. Furthermore, through the similarity by city hospitalization period, when the construction of hospital wards in cities, by referring to results of experiments, or predict possible will land to the extent of the size of the hospital facilities hospital stay is expected to be useful in efficiently managing the patient in a similar area.

Similarity Measurement Method of Trajectory using Indexing Information of Moving Object in Video (비디오 내 이동 객체의 색인 정보를 이용한 궤적 유사도 측정 기법)

  • Kim, Jeong In;Choi, Chang;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.1 no.3
    • /
    • pp.43-47
    • /
    • 2012
  • The recent proliferation of multimedia data necessitates the effectively and efficiently retrieving of multimedia data. These research not only focus on the retrieving methods of text matching but also on using the multimedia data features. Therefore, this paper is a similarity measurement method of trajectory using indexing information of moving object in video, for similarity measurement. This method consists of 2 steps. Firstly, Video data is processed indexing for trajectory extraction of moving objects using CCTV. Finally, we describe to compare DTW(Dynamic Time Warping) to TSR(Tansent Space Representation) algorithm.

  • PDF