• 제목/요약/키워드: Vector Space Model

검색결과 365건 처리시간 0.026초

키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법 (A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model)

  • 조원진;노상규;윤지영;박진수
    • Asia pacific journal of information systems
    • /
    • 제21권1호
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

특허 문서로부터 키워드 추출을 위한 위한 텍스트 마이닝 기반 그래프 모델 (Text-mining Based Graph Model for Keyword Extraction from Patent Documents)

  • 이순근;임영문;엄완섭
    • 대한안전경영과학회지
    • /
    • 제17권4호
    • /
    • pp.335-342
    • /
    • 2015
  • The increasing interests on patents have led many individuals and companies to apply for many patents in various areas. Applied patents are stored in the forms of electronic documents. The search and categorization for these documents are issues of major fields in data mining. Especially, the keyword extraction by which we retrieve the representative keywords is important. Most of techniques for it is based on vector space model. But this model is simply based on frequency of terms in documents, gives them weights based on their frequency and selects the keywords according to the order of weights. However, this model has the limit that it cannot reflect the relations between keywords. This paper proposes the advanced way to extract the more representative keywords by overcoming this limit. In this way, the proposed model firstly prepares the candidate set using the vector model, then makes the graph which represents the relation in the pair of candidate keywords in the set and selects the keywords based on this relationship graph.

Chatbot Design Method Using Hybrid Word Vector Expression Model Based on Real Telemarketing Data

  • Zhang, Jie;Zhang, Jianing;Ma, Shuhao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제14권4호
    • /
    • pp.1400-1418
    • /
    • 2020
  • In the development of commercial promotion, chatbot is known as one of significant skill by application of natural language processing (NLP). Conventional design methods are using bag-of-words model (BOW) alone based on Google database and other online corpus. For one thing, in the bag-of-words model, the vectors are Irrelevant to one another. Even though this method is friendly to discrete features, it is not conducive to the machine to understand continuous statements due to the loss of the connection between words in the encoded word vector. For other thing, existing methods are used to test in state-of-the-art online corpus but it is hard to apply in real applications such as telemarketing data. In this paper, we propose an improved chatbot design way using hybrid bag-of-words model and skip-gram model based on the real telemarketing data. Specifically, we first collect the real data in the telemarketing field and perform data cleaning and data classification on the constructed corpus. Second, the word representation is adopted hybrid bag-of-words model and skip-gram model. The skip-gram model maps synonyms in the vicinity of vector space. The correlation between words is expressed, so the amount of information contained in the word vector is increased, making up for the shortcomings caused by using bag-of-words model alone. Third, we use the term frequency-inverse document frequency (TF-IDF) weighting method to improve the weight of key words, then output the final word expression. At last, the answer is produced using hybrid retrieval model and generate model. The retrieval model can accurately answer questions in the field. The generate model can supplement the question of answering the open domain, in which the answer to the final reply is completed by long-short term memory (LSTM) training and prediction. Experimental results show which the hybrid word vector expression model can improve the accuracy of the response and the whole system can communicate with humans.

SPOT 4 영상의 기하보정을 위한 시선 벡터 조정 모델 (Line-of-Sight (LOS) Vector Adjustment Model for Restitution of SPOT 4 Imagery)

  • 정형섭
    • 한국측량학회지
    • /
    • 제28권2호
    • /
    • pp.247-254
    • /
    • 2010
  • SPOT 4 위성영상의 기하 왜곡을 보정하기 위하여 새로운 접근방법을 연구하였다. 우주공간에서 위성과 지구의 관계를 정립함으로서 새로운 조건 방정식을 유도하였다. 초기 위성에 대한 정보가 어떤 일정한 변화에 의해 왜곡이 있다고 가정하고, LOS(Line-Of-Sight) 벡터를 변화시켜 위성영상의 기하를 보정하는 LOS 벡터 조정 모델을 연구하였다. 본 모델을 증명하기 위하여 관측각이 큰 SPOT 4 위성영상을 대상으로 실험하였다. 또한, 정확한 실험을 위하여 GPS로부터 측량한 10개의 지상기준점(GCPs)과 25개의 검사점(check points)을 사용하였다. SPOT 4 위성영상에 주어진 초기 위성정보(위성 위치, 속도, 자세, 관측각 등)를 그대로 이용하여 계산한 위성영상 기하는 총 35개의 지상기준점과 검사점에 대하여 거의 일정한 변화량을 지녔으며, 이를 통해 SPOT 4 위성영상에 시선벡터조정모델을 적용할 수 있음을 확인하였다. 시선벡터조정모델을 적용하여 영상에 고르게 분포하는 지상기준점을 2점에서 10점까지 변화시키면서 검사점의 오차를 계산하였고, 25개 검사점 오차는 모두 1픽셀 미만이었다. 새로운 접근 방법인 이 모델은 2점 이상의 지상기준점을 이용하여 SPOT 4 영상 기하를 효과적으로 보정하였으며, 또한 SPOT 영상과 촬영방식이 동일한 고해상 위성영상에 대해서도 좋은 결과를 얻을 것으로 기대한다.

Design of nonlinear optimal regulators using lower dimensional riemannian geometric models

  • Izawa, Yoshiaki;Hakomori, Kyojiro
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 제어로봇시스템학회 1994년도 Proceedings of the Korea Automatic Control Conference, 9th (KACC) ; Taejeon, Korea; 17-20 Oct. 1994
    • /
    • pp.628-633
    • /
    • 1994
  • A new Riemannian geometric model for the controlled plant is proposed by imbedding the control vector space in the state space, so as to reduce the dimension of the model. This geometric model is derived by replacing the orthogonal straight coordinate axes on the state space of a linear system with the curvilinear coordinate axes. Therefore the integral manifold of the geometric model becomes homeomorphic to that of fictitious linear system. For the lower dimensional Riemannian geometric model, a nonlinear optimal regulator with a quadratic form performance index which contains the Riemannian metric tensor is designed. Since the integral manifold of the nonlinear regulator is determined to be homeomorphic to that of the linear regulator, it is expected that the basic properties of the linear regulator such as feedback structure, stability and robustness are to be reflected in those of the nonlinear regulator. To apply the above regulator theory to a real nonlinear plant, it is discussed how to distort the curvilinear coordinate axes on which a nonlinear plant behaves as a linear system. Consequently, a partial differential equation with respect to the homeomorphism is derived. Finally, the computational algorithm for the nonlinear optimal regulator is discussed and a numerical example is shown.

  • PDF

위성 탑재용 천문력 생성 프로그램 개발 (Development of Planetary Ephemeris Generation Program for Satellite)

  • 이광현;조동현;김해동
    • 한국항공우주학회지
    • /
    • 제47권3호
    • /
    • pp.220-227
    • /
    • 2019
  • 궤도상에 있는 인공위성은 천문력 기반 태양 모델을 사용하여 기준 벡터를 형성한다. 이를 위해 제트 추진 연구소(JPL)에서 개발한 천문력인 DE-Series, 또는 Vallado가 제안한 기준 벡터 생성식을 사용한다. DE-Series는 체비셰프 다항식의 수치 계수를 제공하는데 정밀도가 높다는 장점이 있지만 인공위성의 탑재 컴퓨터의 계산 부담이 있으며, Vallado 방식은 생성식을 통해 태양 벡터를 간단히 구할 수 있지만 낮은 정밀도를 제공한다. 본 논문에서는 DE-Series를 통해 얻은 태양의 위치를 체비셰프 다항식으로 Curve fitting하여, 관성좌표계에서의 태양 위치좌표를 구할 수 있는 체비셰프 다항식 계수를 제공하는 프로그램을 개발하였다. 기존 방식에 비해 정밀도를 향상시킬 수 있었으며, 제안된 방법은 고성능, 고정밀 초소형위성 임무에 활용될 수 있다.

적외선 영상을 이용한 Gradient Vector Field 기반의 표적 및 화염 자동인식 연구 (A Study of Automatic Recognition on Target and Flame Based Gradient Vector Field Using Infrared Image)

  • 김춘호;이주영
    • 한국항공우주학회지
    • /
    • 제49권1호
    • /
    • pp.63-73
    • /
    • 2021
  • 본 논문은 공중 혹은 해상배경에 표적과 화염이 동시에 존재할 때, 무인항공기에 장착된 EOTS(Electro-Optical Targeting System; 전자광학 추적장비)가 표적을 추적하기 위해 화염의 영향에 강건하도록 표적을 자동 인식하는 기법을 제안한다. 제안한 기법은 표적과 화염의 적외선 영상을 Gradient Vector Field로 변환하고, 각 Gradient magnitude를 Polynomial Curve Fitting 도구에 적용하여 다항식 계수를 추출 및 얕은 신경망 모델에 학습함으로써, 표적과 화염을 자동으로 인식한다. 확보한 표적 및 화염의 다양한 적외선 영상 DB를 학습데이터, 검증데이터, 시험데이터로 분류하여 제안한 기법의 표적 및 화염 자동 인식 성능을 확인하였다. 본 알고리듬을 활용하여 무인항공기의 자동비행 중 충돌회피, 산불탐지, 공중 및 해상의 목표물을 자동탐지 및 인식하는 분야에 적용될 수 있다.

Support-vector-machine Based Sensorless Control of Permanent Magnet Synchronous Motor

  • Back, Woon-Jae;Han, Dong-Chang;Kim, Jong-Mu;Park, Jung-Il;Lee, Suk-Gyu
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 제어로봇시스템학회 2004년도 ICCAS
    • /
    • pp.149-152
    • /
    • 2004
  • Speed and torque control of PMSM(Permanent Magnet Synchronous Motor) are usually achieved by using position and speed sensors which require additional mounting space, reduce the reliability in harsh environments and increase the cost of a motor. Therefore, many studies have been performed for the elimination of speed and position sensors. In this paper, a novel speed sensorless control of a permanent magnet synchronous motor based on SVMR(Support Vector Machine Regression) is presented. The SVM regression method is an algorithm that estimates an unknown mapping between a system's input and outputs, from the available data or training data. Two well-known different voltage model is necessary to estimate the speed of a PMSM. The validity and the usefulness of proposed algorithm are thoroughly verified through numerical simulation.

  • PDF

문서 분류 알고리즘을 이용한 한국어 스팸 문서 분류 성능 비교 (Comparing Korean Spam Document Classification Using Document Classification Algorithms)

  • 송철환;유성준
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2006년도 가을 학술발표논문집 Vol.33 No.2 (C)
    • /
    • pp.222-225
    • /
    • 2006
  • 한국은 다른 나라에 비해 많은 인터넷 사용자를 가지고 있다. 이에 비례해서 한국의 인터넷 유저들은 Spam Mail에 대해 많은 불편함을 호소하고 있다. 이러한 문제를 해결하기 위해 본 논문은 다양한 Feature Weighting, Feature Selection 그리고 문서 분류 알고리즘들을 이용한 한국어 스팸 문서 Filtering연구에 대해 기술한다. 그리고 한국어 문서(Spam/Non-Spam 문서)로부터 영사를 추출하고 이를 각 분류 알고리즘의 Input Feature로써 이용한다. 그리고 우리는 Feature weighting 에 대해 기존의 전통적인 방법이 아니라 각 Feature에 대해 Variance 값을 구하고 Global Feature를 선택하기 위해 Max Value Selection 방법에 적용 후에 전통적인 Feature Selection 방법인 MI, IG, CHI 들을 적용하여 Feature들을 추출한다. 이렇게 추출된 Feature들을 Naive Bayes, Support Vector Machine과 같은 분류 알고리즘에 적용한다. Vector Space Model의 경우에는 전통적인 방법 그대로 사용한다. 그 결과 우리는 Support Vector Machine Classifier, TF-IDF Variance Weighting(Combined Max Value Selection), CHI Feature Selection 방법을 사용할 경우 Recall(99.4%), Precision(97.4%), F-Measure(98.39%)의 성능을 보였다.

  • PDF

Coflow-Counterflow 개념을 이용한 추력벡터 노즐에서 발생하는 유동특성에 관한 연구 (A Study of Thrust-Vectoring Nozzle Flow Using Coflow-Counterflow Concept)

  • 정성재;;김희동
    • 대한기계학회:학술대회논문집
    • /
    • 대한기계학회 2003년도 추계학술대회
    • /
    • pp.592-597
    • /
    • 2003
  • Thrust vector control using a coflow-counterflow concept is achieved by suction and blowing through a slot adjacent to a primary jet which is shrouded by a suction collar. In the present study, the flow characteristics of thrust vectoring is investigated using a numerical method. The nozzle has a design Mach number of 2.0, and the operation pressure ratio is varied to obtain various flow features of the nozzle flow. Test conditions are in the range of the nozzle pressure ratio from 6.0 to 10.0, and a suction pressure from 90kPa to 35kPa. Two-dimensional, compressible Navier-Stokes computations are conducted with RNG ${\kappa}-{\varepsilon}$ turbulence model. The computational results provide an understanding of the detailed physics of the thrust vectoring process. It is found that an increase in the nozzle pressure ratio leads to increased thrust efficiency but reduces the thrust vector angle.

  • PDF