• 제목/요약/키워드: text vector

검색결과 283건 처리시간 0.029초

문서측 자질선정을 이용한 고속 문서분류기의 성능향상에 관한 연구 (Improving the Performance of a Fast Text Classifier with Document-side Feature Selection)

  • 이재윤
    • 정보관리연구
    • /
    • 제36권4호
    • /
    • pp.51-69
    • /
    • 2005
  • 문서분류에 있어서 분류속도의 향상이 중요한 연구과제가 되고 있다. 최근 개발된 자질값투표 기법은 문서자동분류 문제에 대해서 매우 빠른 속도를 가졌지만, 분류정확도는 만족스럽지 못하다. 이 논문에서는 새로운 자질선정 기법인 문서측 자질선정 기법을 제안하고, 이를 자질값투표 기법에 적용해 보았다. 문서측 자질선정은 일반적인 분류자질선정과 달리 학습집단이 아닌 분류대상 문서의 자질 중 일부만을 선택하여 분류에 이용하는 방식이다. 문서측 자질선정을 적용한 실험에서는, 간단하고 빠른 자질값투표 분류기로 SVM 분류기만큼 좋은 성능을 얻을 수 있었다.

An Efficient Machine Learning-based Text Summarization in the Malayalam Language

  • P Haroon, Rosna;Gafur M, Abdul;Nisha U, Barakkath
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제16권6호
    • /
    • pp.1778-1799
    • /
    • 2022
  • Automatic text summarization is a procedure that packs enormous content into a more limited book that incorporates significant data. Malayalam is one of the toughest languages utilized in certain areas of India, most normally in Kerala and in Lakshadweep. Natural language processing in the Malayalam language is relatively low due to the complexity of the language as well as the scarcity of available resources. In this paper, a way is proposed to deal with the text summarization process in Malayalam documents by training a model based on the Support Vector Machine classification algorithm. Different features of the text are taken into account for training the machine so that the system can output the most important data from the input text. The classifier can classify the most important, important, average, and least significant sentences into separate classes and based on this, the machine will be able to create a summary of the input document. The user can select a compression ratio so that the system will output that much fraction of the summary. The model performance is measured by using different genres of Malayalam documents as well as documents from the same domain. The model is evaluated by considering content evaluation measures precision, recall, F score, and relative utility. Obtained precision and recall value shows that the model is trustable and found to be more relevant compared to the other summarizers.

Co-Trained Support Vector Machines을 이용한 문서분류 (Text Categorization Using Co-Trained Support Vector Machines)

  • 박성배;장병탁
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2002년도 봄 학술발표논문집 Vol.29 No.1 (B)
    • /
    • pp.259-261
    • /
    • 2002
  • 대부분의 자동문서분류 시스템은 문서에 사용된 단어의 분포만 고려하고, 또 하나의 중요한 정보인 통사 정보는 무시한다. 본 논문에서는 통사정보와 어휘정보를 모두 사용함으로써 대규모의 비구조 문서를 분류하는 방법을 제시한다. 이를 위해, 학습 데이터에 대해 독립된 두 개의 관점을 요구하는 일종의 부분 감독 학습 알고리즘인 co-training 알고리즘을 사용한다. 어휘정보와 통사정보가 각각 문서의 독립된 관점이 될 수 있으므로, 이 두 정보와 레이블이 없는 문서를 사용하여 문서 분류의 성능을 높일 수 있다. Reelers-21578 문서집합과 TREC-7 filtering 문서집합에 대한 실험 결과는 제시된 방법의 유효성을 보인다.

  • PDF

A Functional Central Limit Theorem for the Multivariate Linear Process Generated by Negatively Associated Random Vectors

  • Kim, Tae-Sung;Seo, Hye-Young
    • Communications for Statistical Applications and Methods
    • /
    • 제8권3호
    • /
    • pp.615-623
    • /
    • 2001
  • A functional central limit theorem is obtained for a stationary multivariate linear process of the form (no abstract. see full-text) where{ $Z_{t}$} is a sequence of strictly stationary m-dimensional negatively associated random vectors with E $Z_{t}$=O and E∥ $Z_{t}$$^2$<$\infty$ and { $A_{u}$} is a sequence of coefficient matrices with (no abstract. see full-text) and (no abstract. see full-text).text).).

  • PDF

의료 웹포럼에서의 텍스트 분석을 통한 정보적 지지 및 감성적 지지 유형의 글 분류 모델 (The Informative Support and Emotional Support Classification Model for Medical Web Forums using Text Analysis)

  • 우지영;이민정
    • 한국IT서비스학회지
    • /
    • 제11권sup호
    • /
    • pp.139-152
    • /
    • 2012
  • In the medical web forum, people share medical experience and information as patients and patents' families. Some people search medical information written in non-expert language and some people offer words of comport to who are suffering from diseases. Medical web forums play a role of the informative support and the emotional support. We propose the automatic classification model of articles in the medical web forum into the information support and emotional support. We extract text features of articles in web forum using text mining techniques from the perspective of linguistics and then perform supervised learning to classify texts into the information support and the emotional support types. We adopt the Support Vector Machine (SVM), Naive-Bayesian, decision tree for automatic classification. We apply the proposed model to the HealthBoards forum, which is also one of the largest and most dynamic medical web forum.

문자영역의 분리와 기하학적 도면요소의 인식에 의한 도면 자동입력 (Automatic Drawing Input by Segmentation of Text Region and Recognltion of Geometric Drawing Element)

  • 배창석;민병우
    • 전자공학회논문지B
    • /
    • 제31B권6호
    • /
    • pp.91-103
    • /
    • 1994
  • As CAD systems are introduced in the filed of engineering design, the necessities for automatic drawing input are increased . In this paper, we propose a method for realizing automatic drawing input by separation of text regions and graphic regions, extraction of line vectors from graphic regions, and recognition of circular arcs and circles from line vectors. Sizes of isolated regions, on a drawing are used for separating text regions and graphic regions. Thinning and maximum allowable error method are used to extract line vectors. And geometric structures of line vectors are analyzed to recognize circular arcs and circles. By processing text regions and graphic regions separately, 30~40% of vector information can be reduced. Recognition of circular arcs and circles can increase the utilization of automatic drawing input function.

  • PDF

Single Pass Algorithm for Text Clustering by Encoding Documents into Tables

  • Jo, Tae-Ho
    • 한국멀티미디어학회논문지
    • /
    • 제11권12호
    • /
    • pp.1749-1757
    • /
    • 2008
  • This research proposes a modified version of single pass algorithm specialized for text clustering. Encoding documents into numerical vectors for using the traditional version of single pass algorithm causes the two main problems: huge dimensionality and sparse distribution. Therefore, in order to address the two problems, this research modifies the single pass algorithm into its version where documents are encoded into not numerical vectors but other forms. In the proposed version, documents are mapped into tables and the operation on two tables is defined for using the single pass algorithm. The goal of this research is to improve the performance of single pass algorithm for text clustering by modifying it into the specialized version.

  • PDF

Design of Image Generation System for DCGAN-Based Kids' Book Text

  • Cho, Jaehyeon;Moon, Nammee
    • Journal of Information Processing Systems
    • /
    • 제16권6호
    • /
    • pp.1437-1446
    • /
    • 2020
  • For the last few years, smart devices have begun to occupy an essential place in the life of children, by allowing them to access a variety of language activities and books. Various studies are being conducted on using smart devices for education. Our study extracts images and texts from kids' book with smart devices and matches the extracted images and texts to create new images that are not represented in these books. The proposed system will enable the use of smart devices as educational media for children. A deep convolutional generative adversarial network (DCGAN) is used for generating a new image. Three steps are involved in training DCGAN. Firstly, images with 11 titles and 1,164 images on ImageNet are learned. Secondly, Tesseract, an optical character recognition engine, is used to extract images and text from kids' book and classify the text using a morpheme analyzer. Thirdly, the classified word class is matched with the latent vector of the image. The learned DCGAN creates an image associated with the text.