• Title/Summary/Keyword: Text Clustering

Search Result 202, Processing Time 0.022 seconds

Title Extraction from Book Cover Images Using Histogram of Oriented Gradients and Color Information

  • Do, Yen;Kim, Soo Hyung;Na, In Seop
    • International Journal of Contents
    • /
    • v.8 no.4
    • /
    • pp.95-102
    • /
    • 2012
  • In this paper, we present a technique to extract the title areas from book cover images. A typical book cover image may contain text, pictures, diagrams as well as complex and irregular background. In addition, the high variability of character features such as thickness, font, position, background and tilt of the text also makes the text extraction task more complicated. Therefore, we propose a two steps efficient method that uses Histogram of Oriented Gradients and color information to find the title areas. Firstly, text localization is carried out to find the title candidates. Finally, refinement process is performed to find the sufficient components of title areas. To obtain the best result, we also use other constraints about the size, ratio between the length and width of the title. We achieve encouraging results of extracted title regions from book cover images which prove the advantages and efficiency of the proposed method.

Text extraction in images using simplify color and edges pattern analysis (색상 단순화와 윤곽선 패턴 분석을 통한 이미지에서의 글자추출)

  • Yang, Jae-Ho;Park, Young-Soo;Lee, Sang-Hun
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.8
    • /
    • pp.33-40
    • /
    • 2017
  • In this paper, we propose a text extraction method by pattern analysis on contour for effective text detection in image. Text extraction algorithms using edge based methods show good performance in images with simple backgrounds, The images of complex background has a poor performance shortcomings. The proposed method simplifies the color of the image by using K-means clustering in the preprocessing process to detect the character region in the image. Enhance the boundaries of the object through the High pass filter to improve the inaccuracy of the boundary of the object in the color simplification process. Then, by using the difference between the expansion and erosion of the morphology technique, the edges of the object is detected, and the character candidate region is discriminated by analyzing the pattern of the contour portion of the acquired region to remove the unnecessary region (picture, background). As a final result, we have shown that the characters included in the candidate character region are extracted by removing unnecessary regions.

Decomposition of a Text Block into Words Using Projection Profiles, Gaps and Special Symbols (투영 프로파일, GaP 및 특수 기호를 이용한 텍스트 영역의 어절 단위 분할)

  • Jeong Chang Bu;Kim Soo Hyung
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.9
    • /
    • pp.1121-1130
    • /
    • 2004
  • This paper proposes a method for line and word segmentation for machine-printed text blocks. To separate a text region into the unit of lines, it analyses the horizontal projection profile and performs a recursive projection profile cut method. In the word segmentation, between-word gaps are identified by a hierarchical clustering method after finding gaps in the text line by using a connected component analysis. In addition, a special symbol detection technique is applied to find two types of special symbols tying between words using their morphologic features. An experiment with 84 text regions from English and Korean documents shows that the proposed method achieves 99.92% accuracy of word segmentation, while a commercial OCR software named Armi 6.0 Pro$^{TM}$ has 97.58% accuracy.y.

Keyword Analysis of Two SCI Journals on Rock Engineering by using Text Mining (텍스트 마이닝을 이용한 암반공학분야 SCI논문의 주제어 분석)

  • Jung, Yong-Bok;Park, Eui-Seob
    • Tunnel and Underground Space
    • /
    • v.25 no.4
    • /
    • pp.303-319
    • /
    • 2015
  • Text mining is one of the branches of data mining and is used to find any meaningful information from the large amount of text. In this study, we analyzed titles and keywords of two SCI journals on rock engineering by using text mining to find major research area, trend and associations of research fields. Visualization of the results was also included for the intuitive understanding of the results. Two journals showed similar research fields but different patterns in the associations among research fields. IJRMMS showed simple network, that is one big group based on the keyword 'rock' with a few small groups. On the other hand, RMRE showed a complex network among various medium groups. Trend analysis by clustering and linear regression of keyword - year frequency matrix provided that most of the keywords increased in number as time goes by except a few descending keywords.

Discovering Meaningful Trends in the Inaugural Addresses of North Korean Leader Via Text Mining (텍스트마이닝을 활용한 북한 지도자의 신년사 및 연설문 트렌드 연구)

  • Park, Chul-Soo
    • Journal of Information Technology Applications and Management
    • /
    • v.26 no.3
    • /
    • pp.43-59
    • /
    • 2019
  • The goal of this paper is to investigate changes in North Korea's domestic and foreign policies through automated text analysis over North Korean new year addresses, one of most important and authoritative document publicly announced by North Korean government. Based on that data, we then analyze the status of text mining research, using a text mining technique to find the topics, methods, and trends of text mining research. We also investigate the characteristics and method of analysis of the text mining techniques, confirmed by analysis of the data. We propose a procedure to find meaningful tendencies based on a combination of text mining, cluster analysis, and co-occurrence networks. To demonstrate applicability and effectiveness of the proposed procedure, we analyzed the inaugural addresses of Kim Jung Un of the North Korea from 2017 to 2019. The main results of this study show that trends in the North Korean national policy agenda can be discovered based on clustering and visualization algorithms. We found that uncovered semantic structures of North Korean new year addresses closely follow major changes in North Korean government's positions toward their own people as well as outside audience such as USA and South Korea.

The Document Clustering using Multi-Objective Genetic Algorithms (다목적 유전자 알고리즘을 이용한문서 클러스터링)

  • Lee, Jung-Song;Park, Soon-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.2
    • /
    • pp.57-64
    • /
    • 2012
  • In this paper, the multi-objective genetic algorithm is proposed for the document clustering which is important in the text mining field. The most important function in the document clustering algorithm is to group the similar documents in a corpus. So far, the k-means clustering and genetic algorithms are much in progress in this field. However, the k-means clustering depends too much on the initial centroid, the genetic algorithm has the disadvantage of coming off in the local optimal value easily according to the fitness function. In this paper, the multi-objective genetic algorithm is applied to the document clustering in order to complement these disadvantages while its accuracy is analyzed and compared to the existing algorithms. In our experimental results, the multi-objective genetic algorithm introduced in this paper shows the accuracy improvement which is superior to the k-means clustering(about 20 %) and the general genetic algorithm (about 17 %) for the document clustering.

Decision Method of Importance of E-Mail based on User Profiles (사용자 프로파일에 기반한 전자 메일의 중요도 결정)

  • Lee, Samuel Sang-Kon
    • The KIPS Transactions:PartB
    • /
    • v.15B no.5
    • /
    • pp.493-500
    • /
    • 2008
  • Although modern day people gather many data from the network, the users want only the information needed. Using this technology, the users can extract on the data that satisfy the query. As the previous studies use the single data in the document, frequency of the data for example, it cannot be considered as the effective data clustering method. What is needed is the effective clustering technology that can process the electronic network documents such as the e-mail or XML that contain the tags of various formats. This paper describes the study of extracting the information from the user query based on the multi-attributes. It proposes a method of extracting the data such as the sender, text type, time limit syntax in the text, and title from the e-mail and using such data for filtering. It also describes the experiment to verify that the multi-attribute based clustering method is more accurate than the existing clustering methods using only the word frequency.

A Binary Prediction Method for Outlier Detection using One-class SVM and Spectral Clustering in High Dimensional Data (고차원 데이터에서 One-class SVM과 Spectral Clustering을 이용한 이진 예측 이상치 탐지 방법)

  • Park, Cheong Hee
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.6
    • /
    • pp.886-893
    • /
    • 2022
  • Outlier detection refers to the task of detecting data that deviate significantly from the normal data distribution. Most outlier detection methods compute an outlier score which indicates the degree to which a data sample deviates from normal. However, setting a threshold for an outlier score to determine if a data sample is outlier or normal is not trivial. In this paper, we propose a binary prediction method for outlier detection based on spectral clustering and one-class SVM ensemble. Given training data consisting of normal data samples, a clustering method is performed to find clusters in the training data, and the ensemble of one-class SVM models trained on each cluster finds the boundaries of the normal data. We show how to obtain a threshold for transforming outlier scores computed from the ensemble of one-class SVM models into binary predictive values. Experimental results with high dimensional text data show that the proposed method can be effectively applied to high dimensional data, especially when the normal training data consists of different shapes and densities of clusters.

Cluster-based Information Retrieval with Tolerance Rough Set Model

  • Ho, Tu-Bao;Kawasaki, Saori;Nguyen, Ngoc-Binh
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.2 no.1
    • /
    • pp.26-32
    • /
    • 2002
  • The objectives of this paper are twofold. First is to introduce a model for representing documents with semantics relatedness using rough sets but with tolerance relations instead of equivalence relations (TRSM). Second is to introduce two document hierarchical and nonhierarchical clustering algorithms based on this model and TRSM cluster-based information retrieval using these two algorithms. The experimental results show that TRSM offers an alterative approach to text clustering and information retrieval.

Extracting and Clustering of Story Events from a Story Corpus

  • Yu, Hye-Yeon;Cheong, Yun-Gyung;Bae, Byung-Chull
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.10
    • /
    • pp.3498-3512
    • /
    • 2021
  • This article describes how events that make up text stories can be represented and extracted. We also address the results from our simple experiment on extracting and clustering events in terms of emotions, under the assumption that different emotional events can be associated with the classified clusters. Each emotion cluster is based on Plutchik's eight basic emotion model, and the attributes of the NLTK-VADER are used for the classification criterion. While comparisons of the results with human raters show less accuracy for certain emotion types, emotion types such as joy and sadness show relatively high accuracy. The evaluation results with NRC Word Emotion Association Lexicon (aka EmoLex) show high accuracy values (more than 90% accuracy in anger, disgust, fear, and surprise), though precision and recall values are relatively low.