• Title/Summary/Keyword: Text segmentation

Search Result 140, Processing Time 0.031 seconds

Language-Independent Word Acquisition Method Using a State-Transition Model

  • Xu, Bin;Yamagishi, Naohide;Suzuki, Makoto;Goto, Masayuki
    • Industrial Engineering and Management Systems
    • /
    • v.15 no.3
    • /
    • pp.224-230
    • /
    • 2016
  • The use of new words, numerous spoken languages, and abbreviations on the Internet is extensive. As such, automatically acquiring words for the purpose of analyzing Internet content is very difficult. In a previous study, we proposed a method for Japanese word segmentation using character N-grams. The previously proposed method is based on a simple state-transition model that is established under the assumption that the input document is described based on four states (denoted as A, B, C, and D) specified beforehand: state A represents words (nouns, verbs, etc.); state B represents statement separators (punctuation marks, conjunctions, etc.); state C represents postpositions (namely, words that follow nouns); and state D represents prepositions (namely, words that precede nouns). According to this state-transition model, based on the states applied to each pseudo-word, we search the document from beginning to end for an accessible pattern. In other words, the process of this transition detects some words during the search. In the present paper, we perform experiments based on the proposed word acquisition algorithm using Japanese and Chinese newspaper articles. These articles were obtained from Japan's Kyoto University and the Chinese People's Daily. The proposed method does not depend on the language structure. If text documents are expressed in Unicode the proposed method can, using the same algorithm, obtain words in Japanese and Chinese, which do not contain spaces between words. Hence, we demonstrate that the proposed method is language independent.

Extracting the Slope and Compensating the Image Using Edges and Image Segmentation in Real World Image (실세계 영상에서 경계선과 영상 분할을 이용한 기울기 검출 및 보정)

  • Paek, Jaegyung;Seo, Yeong Geon
    • Journal of Digital Contents Society
    • /
    • v.17 no.5
    • /
    • pp.441-448
    • /
    • 2016
  • In this paper, we propose a method that segments the image, extracts its slope and compensate it in the image that text and background are mixed. The proposed method uses morphology based preprocessing and extracts the edges using canny operator. And after segmenting the image which the edges are extracted, it excludes the areas which the edges are included, only uses the area which the edges are included and creates the projection histograms according to their various direction slopes. Using them, it takes a slope having the greatest edge concentrativeness of each area and compensates the slope of the scene. On extracting the slope of the mixed scene of the text and background, the method can get better results as 0.7% than the existing methods as it excludes the useless areas that the edges do not exist.

Text extraction from camera based document image (카메라 기반 문서영상에서의 문자 추출)

  • 박희주;김진호
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.8 no.2
    • /
    • pp.14-20
    • /
    • 2003
  • This paper presents a text extraction method of camera based document image. It is more difficult to recognize camera based document image in comparison with scanner based image because of segmentation problem due to variable lighting condition and versatile fonts. Both document binarization and character extraction are important processes to recognize camera based document image. After converting color image into grey level image, gray level normalization is used to extract character region independent of lighting condition and background image. Local adaptive binarization method is then used to extract character from the background after the removal of noise. In this character extraction step, the information of the horizontal and vertical projection and the connected components is used to extract character line, word region and character region. To evaluate the proposed method, we have experimented with documents mixed Hangul, English, symbols and digits of the ETRI database. An encouraging binarization and character extraction results have been obtained.

  • PDF

Movement Search in Video Stream Using Shape Sequence (동영상에서 모양 시퀀스를 이용한 동작 검색 방법)

  • Choi, Min-Seok
    • Journal of Korea Multimedia Society
    • /
    • v.12 no.4
    • /
    • pp.492-501
    • /
    • 2009
  • Information on movement of objects in videos can be used as an important part in categorizing and separating the contents of a scene. This paper is proposing a shape-based movement-matching algorithm to effectively find the movement of an object in video streams. Information on object movement is extracted from the object boundaries from the input video frames becoming expressed in continuous 2D shape information while individual 2D shape information is converted into a lD shape feature using the shape descriptor. Object movement in video can be found as simply as searching for a word in a text without a separate movement segmentation process using the sequence of the shape descriptor listed according to order. The performance comparison results with the MPEG-7 shape variation descriptor showed that the proposed method can effectively express the movement information of the object and can be applied to movement search and analysis applications.

  • PDF

Analysis of Published Articles in the Korean Journal of Health Service Management (2007-2018): Centered on Research Methodology (2007~2018 보건의료산업학회지 게재논문 분석: 연구방법론 중심으로)

  • Moon, Jeong Eun;Jang, Keum-Seong
    • The Korean Journal of Health Service Management
    • /
    • v.14 no.1
    • /
    • pp.195-209
    • /
    • 2020
  • Objectives: This study aimed to analyze the papers in the Korean Journal of Health Service Management (KJHSM) (2007-2018) in order to identify the research trends and aid the future development of healthcare-related research. Methods: Data collection was conducted from September 1-30, 2019. The KSHSM website and Lorea Citatin Index (KCI) electronic database provided 605 copies of original text. Results: Of these, 538 studies are original articles and 7 studies are review articles; 23.7% of the studies presented conceptual framework, 58.4% implemented convenience extraction, and 64.7% collected data using questionnaires. 29.3% of key words were included in the healthcare service, and 48.5% were excluded from the submission field. Conclusions: For the qualitative improvement and development of the journal, it is necessary to consider the relevance of refinement of the methodological approach, segmentation in the field of submission, and selection of keywords.

Grapheme Segmentation Method for Low Quality Printed Hangul Text Recognition (저해상도 인쇄체 한글 영상 인식을 위한 자소 분할 방법)

  • Lee Seong-Hun;Cho Kyu-Tae;Kim Jin-Sik;Kim Jin-Hyung;Jung Cheol-Kon;Kim Sang-Kyun;Moon Young-Su;Kim Ji-Yeun
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.06b
    • /
    • pp.382-384
    • /
    • 2006
  • 본 논문에서는 저해상도 한글 영상을 자소 단위로 분리하는 방법을 제안한다. 비디오 자막이나 저해상도 스캔 영상의 경우 자소간 획이 접촉되거나 잡영이 많이 포함되어 기존의 자소 분할 방법으로는 한계가 있다. 한자 문자열을 문자 단위로 분할하는데 사용된 비선형 분할 경로 알고리즘을 한글 낱자 영상에 적용하여 자소 단위로 분할한다. 기존의 분할 경로 알고리즘을 한글 자소 분할에 효과적으로 적용하기 위해서 우세점 탐지 알고리즘을 이용하여 자소간 접촉점을 찾고 이를 바탕으로 생성된 분할 경로에 따라 여러 개의 자소 후보 영상이 생성된다. 자소 영상을 자소 인식기로 인식한 결과 높은 인식률을 보이는 것을 실험을 통하여 확인하였다.

  • PDF

Comparison of Word Extraction Methods Based on Unsupervised Learning for Analyzing East Asian Traditional Medicine Texts (한의학 고문헌 텍스트 분석을 위한 비지도학습 기반 단어 추출 방법 비교)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.3
    • /
    • pp.47-57
    • /
    • 2019
  • Objectives : We aim to assist in choosing an appropriate method for word extraction when analyzing East Asian Traditional Medical texts based on unsupervised learning. Methods : In order to assign ranks to substrings, we conducted a test using one method(BE:Branching Entropy) for exterior boundary value, three methods(CS:cohesion score, TS:t-score, SL:simple-ll) for interior boundary value, and six methods(BExSL, BExTS, BExCS, CSxTS, CSxSL, TSxSL) from combining them. Results : When Miss Rate(MR) was used as the criterion, the error was minimal when the TS and SL were used together, while the error was maximum when CS was used alone. When number of segmented texts was applied as weight value, the results were the best in the case of SL, and the worst in the case of BE alone. Conclusions : Unsupervised-Learning-Based Word Extraction is a method that can be used to analyze texts without a prepared set of vocabulary data. When using this method, SL or the combination of SL and TS could be considered primarily.

Retrieval of Player Event in Golf Videos Using Spoken Content Analysis (음성정보 내용분석을 통한 골프 동영상에서의 선수별 이벤트 구간 검색)

  • Kim, Hyoung-Gook
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.7
    • /
    • pp.674-679
    • /
    • 2009
  • This paper proposes a method of player event retrieval using combination of two functions: detection of player name in speech information and detection of sound event from audio information in golf videos. The system consists of indexing module and retrieval module. At the indexing time audio segmentation and noise reduction are applied to audio stream demultiplexed from the golf videos. The noise-reduced speech is then fed into speech recognizer, which outputs spoken descriptors. The player name and sound event are indexed by the spoken descriptors. At search time, text query is converted into phoneme sequences. The lists of each query term are retrieved through a description matcher to identify full and partial phrase hits. For the retrieval of the player name, this paper compares the results of word-based, phoneme-based, and hybrid approach.

Sign Language Dataset Built from S. Korean Government Briefing on COVID-19 (대한민국 정부의 코로나 19 브리핑을 기반으로 구축된 수어 데이터셋 연구)

  • Sim, Hohyun;Sung, Horyeol;Lee, Seungjae;Cho, Hyeonjoong
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.8
    • /
    • pp.325-330
    • /
    • 2022
  • This paper conducts the collection and experiment of datasets for deep learning research on sign language such as sign language recognition, sign language translation, and sign language segmentation for Korean sign language. There exist difficulties for deep learning research of sign language. First, it is difficult to recognize sign languages since they contain multiple modalities including hand movements, hand directions, and facial expressions. Second, it is the absence of training data to conduct deep learning research. Currently, KETI dataset is the only known dataset for Korean sign language for deep learning. Sign language datasets for deep learning research are classified into two categories: Isolated sign language and Continuous sign language. Although several foreign sign language datasets have been collected over time. they are also insufficient for deep learning research of sign language. Therefore, we attempted to collect a large-scale Korean sign language dataset and evaluate it using a baseline model named TSPNet which has the performance of SOTA in the field of sign language translation. The collected dataset consists of a total of 11,402 image and text. Our experimental result with the baseline model using the dataset shows BLEU-4 score 3.63, which would be used as a basic performance of a baseline model for Korean sign language dataset. We hope that our experience of collecting Korean sign language dataset helps facilitate further research directions on Korean sign language.

Design and Implementation of Automated Detection System of Personal Identification Information for Surgical Video De-Identification (수술 동영상의 비식별화를 위한 개인식별정보 자동 검출 시스템 설계 및 구현)

  • Cho, Youngtak;Ahn, Kiok
    • Convergence Security Journal
    • /
    • v.19 no.5
    • /
    • pp.75-84
    • /
    • 2019
  • Recently, the value of video as an important data of medical information technology is increasing due to the feature of rich clinical information. On the other hand, video is also required to be de-identified as a medical image, but the existing methods are mainly specialized in the stereotyped data and still images, which makes it difficult to apply the existing methods to the video data. In this paper, we propose an automated system to index candidate elements of personal identification information on a frame basis to solve this problem. The proposed system performs indexing process using text and person detection after preprocessing by scene segmentation and color knowledge based method. The generated index information is provided as metadata according to the purpose of use. In order to verify the effectiveness of the proposed system, the indexing speed was measured using prototype implementation and real surgical video. As a result, the work speed was more than twice as fast as the playing time of the input video, and it was confirmed that the decision making was possible through the case of the production of surgical education contents.