• Title/Summary/Keyword: 텍스트

Search Result 5,030, Processing Time 0.029 seconds

Translation Pre-processing Technique for Improving Analysis Performance of Korean News (한국어 뉴스 분석 성능 향상을 위한 번역 전처리 기법)

  • Lee, Ji-Min;Jeong, Da-Woon;Gu, Yeong-Hyeon;Yoo, Seong-Joon
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2020.07a
    • /
    • pp.619-623
    • /
    • 2020
  • 한국어는 교착어로 1개 이상의 형태소가 단어를 이루고 있기 때문에 텍스트 분석 시 형태소를 분리하는 작업이 필요하다. 자연어를 처리하는 대부분의 알고리즘은 영미권에서 만들어졌고 영어는 굴절어로 특정 경우를 제외하고 일반적으로 하나의 형태소가 단어를 구성하는 구조이다. 그리고 영문은 주로 띄어쓰기 위주로 토큰화가 진행되기 때문에 텍스트 분석이 한국어에 비해 복잡함이 떨어지는 편이다. 이러한 이유들로 인해 한국어 텍스트 분석은 영문 텍스트 분석에 비해 한계점이 있다고 알려져 있다. 한국어 텍스트 분석의 성능 향상을 위해 본 논문에서는 번역 전처리 기법을 제안한다. 번역 전처리 기법이란 원본인 한국어 텍스트를 영문으로 번역하고 전처리를 거친 뒤 분석된 결과를 재번역하는 것이다. 본 논문에서는 한국어 뉴스 기사 데이터와 번역 전처리 기법이 적용된 영문 뉴스 텍스트 데이터를 사용했다. 그리고 주제어 역할을 하는 키워드를 단어 간의 유사도를 계산하는 알고리즘인 Word2Vec(Word to Vector)을 통해 유사 단어를 추출했다. 이렇게 도출된 유사 단어를 텍스트 분석 전문가 대상으로 성능 비교 투표를 진행했을 때, 한국어 뉴스보다 번역 전처리 기법이 적용된 영문 뉴스가 약 3배의 득표 차이로 의미있는 결과를 도출했다.

  • PDF

Text Region Detection using Edge and Regional Minima/Maxima Transformation from Natural Scene Images (에지 및 국부적 최소/최대 변환을 이용한 자연 이미지로부터 텍스트 영역 검출)

  • Park, Jong-Cheon;Lee, Keun-Wang
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.2
    • /
    • pp.358-363
    • /
    • 2009
  • Text region detection from the natural scene images used in a variety of applications, many research are needed in this field. Recent research methods is to detect the text region using various algorithm which it is combination of edge based and connected component based. Therefore, this paper proposes an text region detection using edge and regional minima/maxima transformation algorithm from natural scene images, and then detect the connected components of edge and regional minima/maxima, labeling edge and regional minima/maxima connected components. Analysis the labeled regions and then detect a text candidate regions, each of detected text candidates combined and create a single text candidate image, Final text region validated by comparing the similarity and adjacency of individual characters, and then as the final text regions are detected. As the results of experiments, proposed algorithm improved the correctness of text regions detection using combined edge and regional minima/maxima connected components detection methods.

The Slope Extraction and Compensation Based on Adaptive Edge Enhancement to Extract Scene Text Region (장면 텍스트 영역 추출을 위한 적응적 에지 강화 기반의 기울기 검출 및 보정)

  • Back, Jaegyung;Jang, Jaehyuk;Seo, Yeong Geon
    • Journal of Digital Contents Society
    • /
    • v.18 no.4
    • /
    • pp.777-785
    • /
    • 2017
  • In the modern real world, we can extract and recognize some texts to get a lot of information from the scene containing them, so the techniques for extracting and recognizing text areas from a scene are constantly evolving. They can be largely divided into texture-based method, connected component method, and mixture of both. Texture-based method finds and extracts text based on the fact that text and others have different values such as image color and brightness. Connected component method is determined by using the geometrical properties after making similar pixels adjacent to each pixel to the connection element. In this paper, we propose a method to adaptively change to improve the accuracy of text region extraction, detect and correct the slope of the image using edge and image segmentation. The method only extracts the exact area containing the text by correcting the slope of the image, so that the extracting rate is 15% more accurate than MSER and 10% more accurate than EEMSER.

Case Study on Public Document Classification System That Utilizes Text-Mining Technique in BigData Environment (빅데이터 환경에서 텍스트마이닝 기법을 활용한 공공문서 분류체계의 적용사례 연구)

  • Shim, Jang-sup;Lee, Kang-wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.1085-1089
    • /
    • 2015
  • Text-mining technique in the past had difficulty in realizing the analysis algorithm due to text complexity and degree of freedom that variables in the text have. Although the algorithm demanded lots of effort to get meaningful result, mechanical text analysis took more time than human text analysis. However, along with the development of hardware and analysis algorithm, big data technology has appeared. Thanks to big data technology, all the previously mentioned problems have been solved while analysis through text-mining is recognized to be valuable as well. However, applying text-mining to Korean text is still at the initial stage due to the linguistic domain characteristics that the Korean language has. If not only the data searching but also the analysis through text-mining is possible, saving the cost of human and material resources required for text analysis will lead efficient resource utilization in numerous public work fields. Thus, in this paper, we compare and evaluate the public document classification by handwork to public document classification where word frequency(TF-IDF) in a text-mining-based text and Cosine similarity between each document have been utilized in big data environment.

  • PDF

Construction Bid Data Analysis for Overseas Projects Based on Text Mining - Focusing on Overseas Construction Project's Bidder Inquiry (텍스트 마이닝을 통한 해외건설공사 입찰정보 분석 - 해외건설공사의 입찰자 질의(Bidder Inquiry) 정보를 대상으로 -)

  • Lee, JeeHee;Yi, June-Seong;Son, JeongWook
    • Korean Journal of Construction Engineering and Management
    • /
    • v.17 no.5
    • /
    • pp.89-96
    • /
    • 2016
  • Most data generated in construction projects is unstructured text data. Unstructured data analysis is very needed in order for effective analysis on large amounts of text-based documents, such as contracts, specifications, and RFI. This study analysed previously performed project's bid related documents (bidder inquiry) in overseas construction projects; as a results of the analysis frequent words in documents, association rules among the words, and various document topics were derived. This study suggests effective text analysis approach for massive documents with short time using text mining technique, and this approach is expected to extend the unstructured text data analysis in construction industry.

HTML Text Extraction Using Frequency Analysis (빈도 분석을 이용한 HTML 텍스트 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1135-1143
    • /
    • 2021
  • Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

Machine Learning Language Model Implementation Using Literary Texts (문학 텍스트를 활용한 머신러닝 언어모델 구현)

  • Jeon, Hyeongu;Jung, Kichul;Kwon, Kyoungah;Lee, Insung
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.2
    • /
    • pp.427-436
    • /
    • 2021
  • The purpose of this study is to implement a machine learning language model that learns literary texts. Literary texts have an important characteristic that pairs of question-and-answer are not frequently clearly distinguished. Also, literary texts consist of pronouns, figurative expressions, soliloquies, etc. They hinder the necessity of machine learning using literary texts by making it difficult to learn algorithms. Algorithms that learn literary texts can show more human-friendly interactions than algorithms that learn general sentences. For this goal, this paper proposes three text correction tasks that must be preceded in researches using literary texts for machine learning language model: pronoun processing, dialogue pair expansion, and data amplification. Learning data for artificial intelligence should have clear meanings to facilitate machine learning and to ensure high effectiveness. The introduction of special genres of texts such as literature into natural language processing research is expected not only to expand the learning area of machine learning, but to show a new language learning method.

Scene Text Extraction in Natural Images using Hierarchical Feature Combination and Verification (계층적 특징 결합 및 검증을 이용한 자연이미지에서의 장면 텍스트 추출)

  • 최영우;김길천;송영자;배경숙;조연희;노명철;이성환;변혜란
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.4
    • /
    • pp.420-438
    • /
    • 2004
  • Artificially or naturally contained texts in the natural images have significant and detailed information about the scenes. If we develop a method that can extract and recognize those texts in real-time, the method can be applied to many important applications. In this paper, we suggest a new method that extracts the text areas in the natural images using the low-level image features of color continuity. gray-level variation and color valiance and that verifies the extracted candidate regions by using the high-level text feature such as stroke. And the two level features are combined hierarchically. The color continuity is used since most of the characters in the same text lesion have the same color, and the gray-level variation is used since the text strokes are distinctive in their gray-values to the background. Also, the color variance is used since the text strokes are distinctive in their gray-values to the background, and this value is more sensitive than the gray-level variations. The text level stroke features are extracted using a multi-resolution wavelet transforms on the local image areas and the feature vectors are input to a SVM(Support Vector Machine) classifier for the verification. We have tested the proposed method using various kinds of the natural images and have confirmed that the extraction rates are very high even in complex background images.

Skew Estimation and Correction in Text Images using Shape Moments (형태 모멘트를 이용한 텍스트 이미지 경사 측정 및 교정)

  • Choo, Moon-Won;Chin, Seong-Ah
    • The Journal of the Korea Contents Association
    • /
    • v.3 no.1
    • /
    • pp.14-20
    • /
    • 2003
  • In this paper efficient skew estimation and correction approaches are proposed. To detect the skew of text images, Hough transform using the perpendicular angle view property and shape moments are peformed. The resultant primary text skew angle is used to align the original text. The performance evaluations of the proposed methods with respect to running time are shown.

  • PDF

College Admissions Consultation Chatbot based on Text Similarity (텍스트 유사도 기반의 대학 입시 상담 챗봇)

  • Lee, Se-Hoon;Cha, Hyun-Suk;Jeon, Chan-Ho;Baek, Yeong-Tae
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.441-442
    • /
    • 2018
  • 본 논문에서는 입시상담을 위한 챗봇 시스템을 텍스트 유사도 기반으로 개발하였다. 텍스트를 인지하여 답변을 제공해 주는 방식이며 실시간을 요하는 데이터들은 크롤링한 데이터를 가공을 한 후 사용자에게 대답을 해주고 사용자가 답변에 얼마나 좋은 정보인지 체크하여 그에 맞는 답변을 내어 준다. 사용자의 텍스트를 인식하는 것은 텍스트 유사도를 이용하여 정확하게 인지하고 사용자의 질문과 답변을 서버 DB에 저장을 하여 비슷한 질문이 있을 경우 저장된 답변과 평점을 이용하여 답변을 제공한다.

  • PDF