• Title/Summary/Keyword: non-text identification

Search Result 13, Processing Time 0.024 seconds

Separation of Text and Non-text in Document Layout Analysis using a Recursive Filter

  • Tran, Tuan-Anh;Na, In-Seop;Kim, Soo-Hyung
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.10
    • /
    • pp.4072-4091
    • /
    • 2015
  • A separation of text and non-text elements plays an important role in document layout analysis. A number of approaches have been proposed but the quality of separation result is still limited due to the complex of the document layout. In this paper, we present an efficient method for the classification of text and non-text components in document image. It is the combination of whitespace analysis with multi-layer homogeneous regions which called recursive filter. Firstly, the input binary document is analyzed by connected components analysis and whitespace extraction. Secondly, a heuristic filter is applied to identify non-text components. After that, using statistical method, we implement the recursive filter on multi-layer homogeneous regions to identify all text and non-text elements of the binary image. Finally, all regions will be reshaped and remove noise to get the text document and non-text document. Experimental results on the ICDAR2009 page segmentation competition dataset and other datasets prove the effectiveness and superiority of proposed method.

Protein Named Entity Identification Based on Probabilistic Features Derived from GENIA Corpus and Medical Text on the Web

  • Sumathipala, Sagara;Yamada, Koichi;Unehara, Muneyuki;Suzuki, Izumi
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.15 no.2
    • /
    • pp.111-120
    • /
    • 2015
  • Protein named entity identification is one of the most essential and fundamental predecessor for extracting information about protein-protein interactions from biomedical literature. In this paper, we explore the use of abstracts of biomedical literature in MEDLINE for protein name identification and present the results of the conducted experiments. We present a robust and effective approach to classify biomedical named entities into protein and non-protein classes, based on a rich set of features: orthographic, keyword, morphological and newly introduced Protein-Score features. Our procedure shows significant performance in the experiments on GENIA corpus using Random Forest, achieving the highest values of precision 92.7%, recall 91.7%, and F-measure 92.2% for protein identification, while reducing the training and testing time significantly.

Speaker Identification Using Higher-Order Statistics In Noisy Environment (고차 통계를 이용한 잡음 환경에서의 화자식별)

  • Shin, Tae-Young;Kim, Gi-Sung;Kwon, Young-Uk;Kim, Hyung-Soon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.6
    • /
    • pp.25-35
    • /
    • 1997
  • Most of speech analysis methods developed up to date are based on second order statistics, and one of the biggest drawback of these methods is that they show dramatical performance degradation in noisy environments. On the contrary, the methods using higher order statistics(HOS), which has the property of suppressing Gaussian noise, enable robust feature extraction in noisy environments. In this paper we propose a text-independent speaker identification system using higher order statistics and compare its performance with that using the conventional second-order-statistics-based method in both white and colored noise environments. The proposed speaker identification system is based on the vector quantization approach, and employs HOS-based voiced/unvoiced detector in order to extract feature parameters for voiced speech only, which has non-Gaussian distribution and is known to contain most of speaker-specific characteristics. Experimental results using 50 speaker's database show that higher-order-statistics-based method gives a better identificaiton performance than the conventional second-order-statistics-based method in noisy environments.

  • PDF

Spin in Randomised Clinical Trial Reports of Interventions for Obesity (비만 중재 관련 무작위배정 비교임상연구 보고의 spin 연구)

  • Lee, Sle;Won, Jiyoon;Kim, Seoyeon;Park, Su Jeong;Lee, Hyangsook
    • Korean Journal of Acupuncture
    • /
    • v.34 no.4
    • /
    • pp.251-264
    • /
    • 2017
  • Objectives : To identify the prevalence and types of spin in randomised controlled trials(RCTs) of obesity with statistically non-significant results for primary outcomes to provide adequate reporting directions. Methods : Spin is specific reporting strategy that could lead the readers to misinterpret the results of RCTs. RCTs on obesity with statistically non-significant primary outcomes published from July 2015 to June 2016 were retrieved from PubMed. All included RCTs were classified into 3 intervention categories. The identification and classification of spin in the included articles was performed by two independent researchers. Results : Among 46 RCTs with statistically non-significant primary outcomes, 32 studies were assessed as having at least one spin in title, abstract or main text. Of these, 9 articles were on complementary and alternative medicine, 7 on western medicine and 16 on dietary supplement and exercise. The frequency of spin among the types of interventions was similar. The most common type of spin was 'focusing on statistical significance within-group comparison' in results section of abstract and main text, and 'focusing only on treatment effectiveness with no consideration of statistical significance' in conclusion section of abstract and main text. Studies where random sequence generation was appropriately done was less likely to have spin. Conclusions : As a majority of obesity RCTs have spin, researchers should pay more attention to adequately interpreting and reporting statistically non-significant results.

Development of Interface Design and Evaluation Criteria of Internet Retail Transaction (인터넷 상거래의 인터페이스 디자인 및 평가지침 개발)

  • Park, Hee-Sok;Jang, Dong-Sung;Lee, Jeong-Kew;In, Chi-Ho
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.26 no.2
    • /
    • pp.146-154
    • /
    • 2000
  • This study showed a way to develop web site of shopping mall through systematic identification of the interface elements affecting user's performance and subjective sensibility for Internet shopping mall. To quantify the effects of factors, simulators were designed and used in experiments. It was shown that location of 'Table of Contents'(left > right = up = down) and 'Menu' type(non_Drop down and page Movement = Drop down and page Movement > Drop down and non_age Movement) were the significant factors. However, whether 'Frame' was used or not, there was no significant difference. Also, in the evaluation of subjective sensibility, 'Background color' was a significant factor. And for 'Header & Scanning Column color', yellow color had a tendency to enhance satisfaction for 'simplicity', while green or blue color strengthened 'balance' feeling. But 'text style' was not significant.

  • PDF

Distinguishing Referential Expression 'Geot' Using Decision Tree (결정 트리를 이용한 지시 표현 '것'의 구별)

  • Jo, Eun-Kyoung;Kim, Hark-Soo;Seo, Jung-Yun
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.880-888
    • /
    • 2007
  • Referential expression 'Geot' is often occurred in Korean dialogues. However, it has not been properly dealt with by the previous researchers of reference resolution, since it is not by itself the referential expression like pronoun and definite noun phrases, and it has never been discriminated from non-referring 'geot'. To resolve this problem, we establish a feature set which is based on the linguistic property of 'geot' and the discourse property of its text, and propose a method to identify referential 'geot' from non-referring 'geot' using decision tree. In the experiment, our system achieved the F-measures of 92.3% for non-referring geot and of 82.2% for referential geot and the total classification performance of 89.27%, and outperformed the classification system based on pattern rules.

A Knowledge-based System for Analyzing Sophisticated Geometric Structure of Document Images (문서 영상의 정교한 기하적 구조분석을 위한 지식베이스 시스템)

  • Lee, Kyong-Ho;Choy, Yoon-Chul;Cho, Sung-Bae
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.11
    • /
    • pp.795-813
    • /
    • 2001
  • Sophisticated geometric structure analysis must be preceded to create electronic document from logical components extracted from document image. this paper presents a knowledge-based method for sophisticated geometric structure analysis of technical journal pages. The proposed knowledge base encodes geometric characteristics that are not only common in technical journals but also publication-specific in the form rules. The method takes the hybrid of top-down and bottom-up techniques and consists of two phases: region segmentation and identification. Generally, the result of segmentation process does not have a one-to-one matching with composite layout components. Therefore, the proposed method identifies non-text objects such as image, drawing and table, as well as text objects such as text line and equation by splitting or grouping segmented regions into composite layout components. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence show that the proposed method has performed geometrical structure analysis successfully on more than 99% of the test images, resulting in sophisticated performance compared with previous works.

  • PDF

Characterization of Five Shu Acupoint Pattern in Saam Acupuncture Using Text Mininig (텍스트마이닝을 통한 사암침법 오수혈 사용 패턴 분석)

  • Park, In-Soo;Jung, Won-Mo;Lee, Ye-Seul;Hahm, Dae-Hyun;Park, Hi-Joon;Chae, Younbyoung
    • Korean Journal of Acupuncture
    • /
    • v.32 no.2
    • /
    • pp.66-74
    • /
    • 2015
  • Background : Saam acupuncture were composed by applying the elemental concepts from the Five Phase theory - the relationships between the cycles such as Saeng(Sheng, 'nourishing' or 'creating') and Geuk(Ke, 'suppressing' or 'controlling') - onto the Five Phase points and 12 channels to compensate for the imbalance in each of the 12 main energy traits. Objective : The present study is aimed to find out the characteristics of Five Phase points pattern in Saam acupuncture. Methods : We analysed the characteristics of five elements of the Five Phase points in Korean medical texts such as Saamdoinchimguyogyeol, Dongeuibogam and Chimgugyeongheombang in mid Chosun Dynasty. Using non-negative factorization(NNMF) methods, we extracted the feature matrix of five elements of Five Phase points in each classic medical text. Results : In Saam acupuncture, two characteristics were most prominent: (1) "Self" component of Five elements, (2) "Mother" and "Grandmother" component of Five elements. Conclusions : Saam acupuncture used the combination of Five-Shu acupoint based on ZangFu pattern identification. Our findings suggest that grasping the characteristics of Five Phase points combinations can improve the understanding the selection of the relevant acupoints based on the ZangFu pattern identifications.

Online Document Mining Approach to Predicting Crowdfunding Success (온라인 문서 마이닝 접근법을 활용한 크라우드펀딩의 성공여부 예측 방법)

  • Nam, Suhyeon;Jin, Yoonsun;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.45-66
    • /
    • 2018
  • Crowdfunding has become more popular than angel funding for fundraising by venture companies. Identification of success factors may be useful for fundraisers and investors to make decisions related to crowdfunding projects and predict a priori whether they will be successful or not. Recent studies have suggested several numeric factors, such as project goals and the number of associated SNS, studying how these affect the success of crowdfunding campaigns. However, prediction of the success of crowdfunding campaigns via non-numeric and unstructured data is not yet possible, especially through analysis of structural characteristics of documents introducing projects in need of funding. Analysis of these documents is promising because they are open and inexpensive to obtain. We propose a novel method to predict the success of a crowdfunding project based on the introductory text. To test the performance of the proposed method, in our study, texts related to 1,980 actual crowdfunding projects were collected and empirically analyzed. From the text data set, the following details about the projects were collected: category, number of replies, funding goal, fundraising method, reward, number of SNS followers, number of images and videos, and miscellaneous numeric data. These factors were identified as significant input features to be used in classification algorithms. The results suggest that the proposed method outperforms other recently proposed, non-text-based methods in terms of accuracy, F-score, and elapsed time.

A Case Study on Closed Captions: Focusing on on Netflix (넷플릭스 <오징어 게임> 폐쇄자막 연구)

  • Jeong, Sua;Lee, Jimin
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.279-285
    • /
    • 2024
  • This study aims to evaluate the accuracy and completeness of Korean and English closed captions for Netflix's "Squid Game" and to present implications based on the findings. To achieve this, the closed captioning guidelines of the U.S. Federal Communications Commission, DCMP, and the Korea Communications Commission were identified and analyzed. The analysis of the subtitle of the entire "Squid Game" series reveals that, while Korean closed captions accurately present slangs and titles, they present non-existent information in speaker identification. In English closed captions, speaker identification guidelines are well followed, but omissions of slangs and title mistranslations are observed. In terms of completeness, both Korean and English closed captions are found to omit certain audio parts. To address these issues, the study suggests strengthening the QA process, establishing a system to communicate original text problems during translation, and utilizing general English subtitles.