• Title/Summary/Keyword: text

Search Result 13,536, Processing Time 0.032 seconds

Stroke Width-Based Contrast Feature for Document Image Binarization

  • Van, Le Thi Khue;Lee, Gueesang
    • Journal of Information Processing Systems
    • /
    • v.10 no.1
    • /
    • pp.55-68
    • /
    • 2014
  • Automatic segmentation of foreground text from the background in degraded document images is very much essential for the smooth reading of the document content and recognition tasks by machine. In this paper, we present a novel approach to the binarization of degraded document images. The proposed method uses a new local contrast feature extracted based on the stroke width of text. First, a pre-processing method is carried out for noise removal. Text boundary detection is then performed on the image constructed from the contrast feature. Then local estimation follows to extract text from the background. Finally, a refinement procedure is applied to the binarized image as a post-processing step to improve the quality of the final results. Experiments and comparisons of extracting text from degraded handwriting and machine-printed document image against some well-known binarization algorithms demonstrate the effectiveness of the proposed method.

Machine Printed and Handwritten Text Discrimination in Korean Document Images

  • Trieu, Son Tung;Lee, Guee Sang
    • Smart Media Journal
    • /
    • v.5 no.3
    • /
    • pp.30-34
    • /
    • 2016
  • Nowadays, there are a lot of Korean documents, which often need to be identified in one of printed or handwritten text. Early methods for the identification use structural features, which can be simple and easy to apply to text of a specific font, but its performance depends on the font type and characteristics of the text. Recently, the bag-of-words model has been used for the identification, which can be invariant to changes in font size, distortions or modifications to the text. The method based on bag-of-words model includes three steps: word segmentation using connected component grouping, feature extraction, and finally classification using SVM(Support Vector Machine). In this paper, bag-of-words model based method is proposed using SURF(Speeded Up Robust Feature) for the identification of machine printed and handwritten text in Korean documents. The experiment shows that the proposed method outperforms methods based on structural features.

An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

  • Mikawa, Kenta;Ishida, Takashi;Goto, Masayuki
    • Industrial Engineering and Management Systems
    • /
    • v.11 no.1
    • /
    • pp.87-93
    • /
    • 2012
  • This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

  • Kumar, Aarti;Das, Sujoy
    • Journal of Information Science Theory and Practice
    • /
    • v.3 no.1
    • /
    • pp.24-39
    • /
    • 2015
  • Pre-retrieval query formulation is an important step for identifying local text reuse. Local reuse with high obfuscation, paraphrasing, and translation poses a challenge of finding the reused text in a document. In this paper, three pre-retrieval query formulation strategies for heuristic retrieval in case of low obfuscated, high obfuscated, and translated text are studied. The strategies used are (a) Query formulation using proper nouns; (b) Query formulation using unique words (Hapax); and (c) Query formulation using most frequent words. Whereas in case of low and high obfuscation and simulated paraphrasing, keywords with Hapax proved to be slightly more efficient, initial results indicate that the simple strategy of query formulation using proper nouns gives promising results and may prove better in reducing the size of the corpus for post processing, for identifying local text reuse in case of obfuscated and translated text reuse.

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

  • Li, Xu;Yao, Chunlong;Fan, Fenglong;Yu, Xiaoqiang
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.863-875
    • /
    • 2017
  • The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.

A Study on Herbal Processing Terminology (본초(本草) 포제관련(炮製關聯) 용어(用語)에 대(對)한 연구(硏究))

  • Song, Ji-Chung;Shim, Hyun-A;Eom, Dong-Myung
    • Journal of Society of Preventive Korean Medicine
    • /
    • v.16 no.3
    • /
    • pp.107-117
    • /
    • 2012
  • Objective : Processing of medicinals are one of the most important part in medicinal treatment. However, in text books, there are disagreements and several terms with same meanings. Method : We tried to compare the processing of medicinals in text book, Bonchohak especially in exterior-releasing medicinal and heat-clearing medicinal. Results : The terms of processing of medicinals in introductions of text book, Bonchohak are different from those in an itemized discussion of exterior-releasing medicinal and heat-clearing medicinal. Conclusion : The terms of processing of medicinals in text book, Bonchohak should be reorganized and improved to make be clear and sure as a text book.

Building a text collection for Urdu information retrieval

  • Rasheed, Imran;Banka, Haider;Khan, Hamaid M.
    • ETRI Journal
    • /
    • v.43 no.5
    • /
    • pp.856-868
    • /
    • 2021
  • Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

The Influence of English Proficiency and Text Types on Korean College Students' Paraphrasing for Plagiarism Prevention

  • Choe, Yoonhee
    • International Journal of Advanced Culture Technology
    • /
    • v.9 no.1
    • /
    • pp.183-189
    • /
    • 2021
  • This study examines the effects of Korean college students' English proficiency and the English text types on their paraphrases. Korean college students with three groups of English proficiency (high, mid, and low) read two types of English texts, causal texts, and argumentative texts, and paraphrased them in English. Students' paraphrase text was evaluated in terms of content (idea exposition, idea development, and wrap up), organization (coherence and cohesion) and language use (grammatical accuracy), and analyzed by MANOVA. As a result, it was found that there was a significant difference in their paraphrase performance according to the participants' English proficiency levels rather than the types of English texts. The results of this study have educational implications for English paraphrase education to prevent plagiarism for Korean university students.

A Comparative Study of Word Embedding Models for Arabic Text Processing

  • Assiri, Fatmah;Alghamdi, Nuha
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.8
    • /
    • pp.399-403
    • /
    • 2022
  • Natural texts are analyzed to obtain their intended meaning to be classified depending on the problem under study. One way to represent words is by generating vectors of real values to encode the meaning; this is called word embedding. Similarities between word representations are measured to identify text class. Word embeddings can be created using word2vec technique. However, recently fastText was implemented to provide better results when it is used with classifiers. In this paper, we will study the performance of well-known classifiers when using both techniques for word embedding with Arabic dataset. We applied them to real data collected from Wikipedia, and we found that both word2vec and fastText had similar accuracy with all used classifiers.

Evaluation of Human Factors on Text Content Displayed in Mixed Reality (혼합현실에서 텍스트 콘텐츠 표시에 대한 휴먼팩터 평가)

  • Kim, Dae-Yeon
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.9
    • /
    • pp.1316-1327
    • /
    • 2022
  • In this study, the effect of text content on users in mixed reality was investigated and subjective evaluation was performed using a statistical approach. The position and size of the text were defined as independent variables, and eye comfort and visibility were analyzed as dependent variables. Twenty participants viewed the content for 96 seconds and then performed a related survey rating task. As a result of two-way ANOVA, the interaction between text position and size and the main effect on text size were not statistically significant. The main effect on text position was found to be statistically significant, and as a result of the analysis, the bottom middle was preferred for both eye comfort and visibility.