• 제목/요약/키워드: IVSM

검색결과 2건 처리시간 0.016초

키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법 (A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model)

  • 조원진;노상규;윤지영;박진수
    • Asia pacific journal of information systems
    • /
    • 제21권1호
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

포도씨유 및 추출물의 카테킨류 측정방법 개선 (An Improved Method for Determination of Catechin and Its Derivatives in Extract and Oil of Grape Seeds)

  • 문성옥;이준영;김은정;최상원
    • 한국식품과학회지
    • /
    • 제35권4호
    • /
    • pp.576-585
    • /
    • 2003
  • 현재 식품공전 및 식품첨가물 공전에서 포도씨유 및 추출물의 catechin 함량 측정방법으로 널리 사용되고 있는 바닐린비색법을 polyamide cartridge를 사용함으로써 개선하였고 보다 정확하며 재현성 있게 catechin 함량을 측정할 수 있는 HPLC 방법을 개발하였다. 포도씨유 및 용매(열수, 에탄올, 메탄올 및 아세톤)추출물의 catechin 함량을 바닐린비색법으로 측정한 결과 각각 $30{\sim}40\;mg%$ (g/포도씨유 100g) 및 $17{\sim}43%$(g/추출물 100g) 범위를 차지하였다. 반면, polyamide cartridge를 사용한 개량 바닐린비색법으로 위의 포도씨유 및 추출물의 catechin 함량을 측정한 결과 각각 미량($1{\sim}5ppm$) 및 $4.0{\sim}7.5%$ 차지하였다. 한편, HPLC 방법으로 포도씨유의 catechin 함량을 측정한 결과 포도씨의 주된 4가지 catechin 성분[(+)-catechin, procyanidin $B_2$, (-)-epicatechin 및 (-)-epicatechin gallate]은 거의 확인할 수 없었다. 그러나 포도씨 용매추출물의 catechin 함량을 측정한 결과 4가지 주된 catechin 성분을 확인할 수 있었으며, 그 중 (+)-catechin 및 (-)-epicatechin 함량이 각각 $1.35{\sim}2.60%$$2.35{\sim}4.59%$ 범위로서 상당히 높은 반면, procyanidin $B_2$ 및 (-)-epicatechin gallate 함량은 각각 $0.77{\sim}1.36%$0.06${\sim}0.30%$ 범위로서 상대적으로 낮았다 또한, HPLC에 의해 국내의 포도 품종별 및 산지별 포도씨의 catechin 조성 함량을 측정하였다. 이와같이 위에서 정립된 HPLC 방법에 의한 포도씨의 4가지 주된 catechin 성분의 회수율은 95% 이상으로 높았으며, 재현율은 모두 5% 미만의 편차를 나타내어 분리효율이 우수하였다. 그리고 4가지 catechin 성분의 검출한계는 $1{\sim}5\;ppm$ 범위를 나타내었다.