• Title/Summary/Keyword: Text processing

Search Result 1,197, Processing Time 0.025 seconds

A Hierarchical Text Rating System for Objectionable Documents

  • Jeong, Chi-Yoon;Han, Seung-Wan;Nam, Taek-Yong
    • Journal of Information Processing Systems
    • /
    • v.1 no.1 s.1
    • /
    • pp.22-26
    • /
    • 2005
  • In this paper, we classified the objectionable texts into four rates according to their harmfulness and proposed the hierarchical text rating system for objectionable documents. Since the documents in the same category have similarities in used words, expressions and structure of the document, the text rating system, which uses a single classification model, has low accuracy. To solve this problem, we separate objectionable documents into several subsets by using their properties, and then classify the subsets hierarchically. The proposed system consists of three layers. In each layer, we select features using the chi-square statistics, and then the weight of the features, which is calculated by using the TF-IDF weighting scheme, is used as an input of the non-linear SVM classifier. By means of a hierarchical scheme using the different features and the different number of features in each layer, we can characterize the objectionability of documents more effectively and expect to improve the performance of the rating system. We compared the performance of the proposed system and performance of several text rating systems and experimental results show that the proposed system can archive an excellent classification performance.

Automatic In-Text Keyword Tagging based on Information Retrieval

  • Kim, Jin-Suk;Jin, Du-Seok;Kim, Kwang-Young;Choe, Ho-Seop
    • Journal of Information Processing Systems
    • /
    • v.5 no.3
    • /
    • pp.159-166
    • /
    • 2009
  • As shown in Wikipedia, tagging or cross-linking through major keywords in a document collection improves not only the readability of documents but also responsive and adaptive navigation among related documents. In recent years, the Semantic Web has increased the importance of social tagging as a key feature of the Web 2.0 and, as its crucial phenotype, Tag Cloud has emerged to the public. In this paper we provide an efficient method of automated in-text keyword tagging based on large-scale controlled term collection or keyword dictionary, where the computational complexity of O(mN) - if a pattern matching algorithm is used - can be reduced to O(mlogN) - if an Information Retrieval technique is adopted - while m is the length of target document and N is the total number of candidate terms to be tagged. The result shows that automatic in-text tagging with keywords filtered by Information Retrieval speeds up to about 6 $\sim$ 40 times compared with the fastest pattern matching algorithm.

A method for Character Segmentation using Frequence Characteristics and Back Propagation Neural Network (주파수 특성과 역전파 신경망 알고리즘을 이용한 문자 영역 분할 방법)

  • Chun Byung-Tae;Song Chee-Yang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.4 s.42
    • /
    • pp.55-60
    • /
    • 2006
  • The proposed method uses FFT(Fast Fourier Transform) and neural networks in order to extract texts in real time. In general, text areas are found in the higher frequency domain, thus, can be characterized using FFT. The neural network are learned by character region(high frequency) and non character region(low frequency). The candidate text areas can be thus found by applying the higher frequency characteristics to neural network. Therefore, the final text area is extracted by verifying the candidate areas. Experimental results show a perfect candidate extraction rate and about 95% text extraction rate. The strength of the proposed algorithm is its simplicity, real-time processing by not processing the entire image.

  • PDF

Research on Chinese Microblog Sentiment Classification Based on TextCNN-BiLSTM Model

  • Haiqin Tang;Ruirui Zhang
    • Journal of Information Processing Systems
    • /
    • v.19 no.6
    • /
    • pp.842-857
    • /
    • 2023
  • Currently, most sentiment classification models on microblogging platforms analyze sentence parts of speech and emoticons without comprehending users' emotional inclinations and grasping moral nuances. This study proposes a hybrid sentiment analysis model. Given the distinct nature of microblog comments, the model employs a combined stop-word list and word2vec for word vectorization. To mitigate local information loss, the TextCNN model, devoid of pooling layers, is employed for local feature extraction, while BiLSTM is utilized for contextual feature extraction in deep learning. Subsequently, microblog comment sentiments are categorized using a classification layer. Given the binary classification task at the output layer and the numerous hidden layers within BiLSTM, the Tanh activation function is adopted in this model. Experimental findings demonstrate that the enhanced TextCNN-BiLSTM model attains a precision of 94.75%. This represents a 1.21%, 1.25%, and 1.25% enhancement in precision, recall, and F1 values, respectively, in comparison to the individual deep learning models TextCNN. Furthermore, it outperforms BiLSTM by 0.78%, 0.9%, and 0.9% in precision, recall, and F1 values.

Effectiveness of Fuzzy Graph Based Document Model

  • Aswathy M R;P.C. Reghu Raj;Ajeesh Ramanujan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.8
    • /
    • pp.2178-2198
    • /
    • 2024
  • Graph-based document models have good capabilities to reveal inter-dependencies among unstructured text data. Natural language processing (NLP) systems that use such models as an intermediate representation have shown good performance. This paper proposes a novel fuzzy graph-based document model and to demonstrate its effectiveness by applying fuzzy logic tools for text summarization. The proposed system accepts a text document as input and identifies some of its sentence level features, namely sentence position, sentence length, numerical data, thematic word, proper noun, title feature, upper case feature, and sentence similarity. The fuzzy membership value of each feature is computed from the sentences. We also propose a novel algorithm to construct the fuzzy graph as an intermediate representation of the input document. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric is used to evaluate the model. The evaluation based on different quality metrics was also performed to verify the effectiveness of the model. The ANOVA test confirms the hypothesis that the proposed model improves the summarizer performance by 10% when compared with the state-of-the-art summarizers employing alternate intermediate representations for the input text.

The Adaptive SPAM Mail Detection System using Clustering based on Text Mining

  • Hong, Sung-Sam;Kong, Jong-Hwan;Han, Myung-Mook
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.6
    • /
    • pp.2186-2196
    • /
    • 2014
  • Spam mail is one of the most general mail dysfunctions, which may cause psychological damage to internet users. As internet usage increases, the amount of spam mail has also gradually increased. Indiscriminate sending, in particular, occurs when spam mail is sent using smart phones or tablets connected to wireless networks. Spam mail consists of approximately 68% of mail traffic; however, it is believed that the true percentage of spam mail is at a much more severe level. In order to analyze and detect spam mail, we introduce a technique based on spam mail characteristics and text mining; in particular, spam mail is detected by extracting the linguistic analysis and language processing. Existing spam mail is analyzed, and hidden spam signatures are extracted using text clustering. Our proposed method utilizes a text mining system to improve the detection and error detection rates for existing spam mail and to respond to new spam mail types.

Skewed Angle Detection in Text Images Using Orthogonal Angle View

  • Chin, Seong-Ah;Choo, Moon-Won
    • Proceedings of the IEEK Conference
    • /
    • 2000.07a
    • /
    • pp.62-65
    • /
    • 2000
  • In this paper we propose skewed angle detection methods for images that contain text that is not aligned horizontally. In most images text areas are aligned along the horizontal axis, however there are many occasions when the text may be at a skewed angle (denoted by 0 < ${\theta}\;{\leq}\;{\pi}$). In the work described, we adapt the Hough transform, Shadow and Threshold Projection methods to detect the skewed angle of text in an input image using the orthogonal angle view property. The results of this method are a primary text skewed angle, which allows us to rotate the original input image into an image with horizontally aligned text. This utilizes document image processing prior to the recognition stage.

  • PDF

Keywords-based Video Summary System using FastText Algorithm (FastText 알고리즘을 이용한 사용자 지정 키워드 기반 동영상 요약 시스템)

  • Kyungmin Kim;Seungmin Park
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.693-694
    • /
    • 2023
  • 본 논문에서는 FastText 알고리즘을 기반으로 한 사용자 지정 키워드 기반 동영상 요약 시스템을 제안한다. 사용자가 키워드를 입력하면 시스템은 해당 키워드와 관련된 단어들을 FastText를 통해 추출하며, 이를 STT (Speech-to-Text)로 변환된 동영상에서 타임 스탬프 기반으로 인식한다. 인식된 키워드와 관련된 내용은 클립 형식으로 요약되어 사용자에게 제공된다. 본 연구의 목적은 숏폼 콘텐츠 환경에서 효과적인 콘텐츠 추출 및 제공을 통해 사용자 경험과 정보 제공의 효율성을 향상시키기 위함이다. 제안된 시스템은 사용자 지정 키워드에 맞춰 다양한 동영상 플랫폼에서 효율적인 영상 요약을 제공함으로써 온라인 동영상 환경에서 큰 혁신을 이끌어낼 것으로 기대된다.

  • PDF

A Study on Smart Text Reader for converting Text through TTS (단위 또는 약어의 의미에 맞는 풀 네임(fulI name) 음성 출력 방법에 관한 연구)

  • Park, An na;Son, Byoung-jun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.11a
    • /
    • pp.806-808
    • /
    • 2014
  • 현재까지의 음성 출력 시스템은 텍스트를 있는 그대로 읽어 주는 것에 불과했다. 단위, 약어의 경우 알파벳을 그대로 읽어 주게 되어 그 본래의 의미를 제대로 파악하기 어려웠다. 본 연구에서는 단위나 약어의 본래의 의미를 찾아서 풀어서 음성 변환해 주는 방법을 제안함으로써 시각 장애인에게도 텍스트의 정확한 정보를 전달할 수 있다는 장점이 있다.