• Title/Summary/Keyword: Noun Phrase Extraction

Search Result 8, Processing Time 0.028 seconds

Effective Thematic Words Extraction from a Book using Compound Noun Phrase Synthesis Method

  • Ahn, Hee-Jeong;Kim, Kee-Won;Kim, Seung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.3
    • /
    • pp.107-113
    • /
    • 2017
  • Most of online bookstores are providing a user with the bibliographic book information rather than the concrete information such as thematic words and atmosphere. Especially, thematic words help a user to understand books and cast a wide net. In this paper, we propose an efficient extraction method of thematic words from book text by applying the compound noun and noun phrase synthetic method. The compound nouns represent the characteristics of a book in more detail than single nouns. The proposed method extracts the thematic word from book text by recognizing two types of noun phrases, such as a single noun and a compound noun combined with single nouns. The recognized single nouns, compound nouns, and noun phrases are calculated through TF-IDF weights and extracted as main words. In addition, this paper suggests a method to calculate the frequency of subject, object, and other roles separately, not just the sum of the frequencies of all nouns in the TF-IDF calculation method. Experiments is carried out in the field of economic management, and thematic word extraction verification is conducted through survey and book search. Thus, 9 out of the 10 experimental results used in this study indicate that the thematic word extracted by the proposed method is more effective in understanding the content. Also, it is confirmed that the thematic word extracted by the proposed method has a better book search result.

Restoring Omitted Sentence Constituents in Encyclopedia Documents Using Structural SVM (Structural SVM을 이용한 백과사전 문서 내 생략 문장성분 복원)

  • Hwang, Min-Kook;Kim, Youngtae;Ra, Dongyul;Lim, Soojong;Kim, Hyunki
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.131-150
    • /
    • 2015
  • Omission of noun phrases for obligatory cases is a common phenomenon in sentences of Korean and Japanese, which is not observed in English. When an argument of a predicate can be filled with a noun phrase co-referential with the title, the argument is more easily omitted in Encyclopedia texts. The omitted noun phrase is called a zero anaphor or zero pronoun. Encyclopedias like Wikipedia are major source for information extraction by intelligent application systems such as information retrieval and question answering systems. However, omission of noun phrases makes the quality of information extraction poor. This paper deals with the problem of developing a system that can restore omitted noun phrases in encyclopedia documents. The problem that our system deals with is almost similar to zero anaphora resolution which is one of the important problems in natural language processing. A noun phrase existing in the text that can be used for restoration is called an antecedent. An antecedent must be co-referential with the zero anaphor. While the candidates for the antecedent are only noun phrases in the same text in case of zero anaphora resolution, the title is also a candidate in our problem. In our system, the first stage is in charge of detecting the zero anaphor. In the second stage, antecedent search is carried out by considering the candidates. If antecedent search fails, an attempt made, in the third stage, to use the title as the antecedent. The main characteristic of our system is to make use of a structural SVM for finding the antecedent. The noun phrases in the text that appear before the position of zero anaphor comprise the search space. The main technique used in the methods proposed in previous research works is to perform binary classification for all the noun phrases in the search space. The noun phrase classified to be an antecedent with highest confidence is selected as the antecedent. However, we propose in this paper that antecedent search is viewed as the problem of assigning the antecedent indicator labels to a sequence of noun phrases. In other words, sequence labeling is employed in antecedent search in the text. We are the first to suggest this idea. To perform sequence labeling, we suggest to use a structural SVM which receives a sequence of noun phrases as input and returns the sequence of labels as output. An output label takes one of two values: one indicating that the corresponding noun phrase is the antecedent and the other indicating that it is not. The structural SVM we used is based on the modified Pegasos algorithm which exploits a subgradient descent methodology used for optimization problems. To train and test our system we selected a set of Wikipedia texts and constructed the annotated corpus in which gold-standard answers are provided such as zero anaphors and their possible antecedents. Training examples are prepared using the annotated corpus and used to train the SVMs and test the system. For zero anaphor detection, sentences are parsed by a syntactic analyzer and subject or object cases omitted are identified. Thus performance of our system is dependent on that of the syntactic analyzer, which is a limitation of our system. When an antecedent is not found in the text, our system tries to use the title to restore the zero anaphor. This is based on binary classification using the regular SVM. The experiment showed that our system's performance is F1 = 68.58%. This means that state-of-the-art system can be developed with our technique. It is expected that future work that enables the system to utilize semantic information can lead to a significant performance improvement.

A Method for Clustering Noun Phrases into Coreferents for the Same Person in Novels Translated into Korean (한국어 번역 소설에서 인물명 명사구의 동일인물 공통참조 클러스터링 방법)

  • Park, Taekeun;Kim, Seung-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.3
    • /
    • pp.533-542
    • /
    • 2017
  • Novels include various character names, depending on the genre and the spatio-temporal background of the novels and the nationality of characters. Besides, characters and their names in a novel are created by the author's pen and imagination. As a result, any proper noun dictionary cannot include all kinds of character names. In addition, the novels translated into Korean have character names consisting of two or more nouns (such as "Harry Potter"). In this paper, we propose a method to extract noun phrases for character names and to cluster the noun phrases into coreferents for the same character name. In the extraction of noun phrases, we utilize KKMA morpheme analyzer and CPFoAN character identification tool. In clustering the noun phrases into coreferents, we construct a directed graph with the character names extracted by CPFoAN and the extracted noun phrases, and then we create name sets for characters by traversing connected subgraphs in the directed graph. With four novels translated into Korean, we conduct a survey to evaluate the proposed method. The results show that the proposed method will be useful for speaker identification as well as for constructing the social network of characters.

A Review of the Opinion Target Extraction using Sequence Labeling Algorithms based on Features Combinations

  • Aziz, Noor Azeera Abdul;MohdAizainiMaarof, MohdAizainiMaarof;Zainal, Anazida;HazimAlkawaz, Mohammed
    • Journal of Internet Computing and Services
    • /
    • v.17 no.5
    • /
    • pp.111-119
    • /
    • 2016
  • In recent years, the opinion analysis is one of the key research fronts of any domain. Opinion target extraction is an essential process of opinion analysis. Target is usually referred to noun or noun phrase in an entity which is deliberated by the opinion holder. Extraction of opinion target facilitates the opinion analysis more precisely and in addition helps to identify the opinion polarity i.e. users can perceive opinion in detail of a target including all its features. One of the most commonly employed algorithms is a sequence labeling algorithm also called Conditional Random Fields. In present article, recent opinion target extraction approaches are reviewed based on sequence labeling algorithm and it features combinations by analyzing and comparing these approaches. The good selection of features combinations will in some way give a good or better accuracy result. Features combinations are an essential process that can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Hence, in general this review eventually leads to the contribution for the opinion analysis approach and assist researcher for the opinion target extraction in particular.

Text Classification for Patents: Experiments with Unigrams, Bigrams and Different Weighting Methods

  • Im, ChanJong;Kim, DoWan;Mandl, Thomas
    • International Journal of Contents
    • /
    • v.13 no.2
    • /
    • pp.66-74
    • /
    • 2017
  • Patent classification is becoming more critical as patent filings have been increasing over the years. Despite comprehensive studies in the area, there remain several issues in classifying patents on IPC hierarchical levels. Not only structural complexity but also shortage of patents in the lower level of the hierarchy causes the decline in classification performance. Therefore, we propose a new method of classification based on different criteria that are categories defined by the domain's experts mentioned in trend analysis reports, i.e. Patent Landscape Report (PLR). Several experiments were conducted with the purpose of identifying type of features and weighting methods that lead to the best classification performance using Support Vector Machine (SVM). Two types of features (noun and noun phrases) and five different weighting schemes (TF-idf, TF-rf, TF-icf, TF-icf-based, and TF-idcef-based) were experimented on.

Mention Detection Using Pointer Networks for Coreference Resolution

  • Park, Cheoneum;Lee, Changki;Lim, Soojong
    • ETRI Journal
    • /
    • v.39 no.5
    • /
    • pp.652-661
    • /
    • 2017
  • A mention has a noun or noun phrase as its head and constructs a chunk that defines any meaning, including a modifier. Mention detection refers to the extraction of mentions from a document. In mentions, coreference resolution refers to determining any mentions that have the same meaning. Pointer networks, which are models based on a recurrent neural network encoder-decoder, outputs a list of elements corresponding to an input sequence. In this paper, we propose mention detection using pointer networks. This approach can solve the problem of overlapped mention detection, which cannot be solved by a sequence labeling approach. The experimental results show that the performance of the proposed mention detection approach is F1 of 80.75%, which is 8% higher than rule-based mention detection, and the performance of the coreference resolution has a CoNLL F1 of 56.67% (mention boundary), which is 7.68% higher than coreference resolution using rule-based mention detection.

Feature Extraction of Web Document using Association Word Mining (연관 단어 마이닝을 사용한 웹문서의 특징 추출)

  • 고수정;최준혁;이정현
    • Journal of KIISE:Databases
    • /
    • v.30 no.4
    • /
    • pp.351-361
    • /
    • 2003
  • The previous studies to extract features for document through word association have the problems of updating profiles periodically, dealing with noun phrases, and calculating the probability for indices. We propose more effective feature extraction method which is using association word mining. The association word mining method, by using Apriori algorithm, represents a feature for document as not single words but association-word-vectors. Association words extracted from document by Apriori algorithm depend on confidence, support, and the number of composed words. This paper proposes an effective method to determine confidence, support, and the number of words composing association words. Since the feature extraction method using association word mining does not use the profile, it need not update the profile, and automatically generates noun phrase by using confidence and support at Apriori algorithm without calculating the probability for index. We apply the proposed method to document classification using Naive Bayes classifier, and compare it with methods of information gain and TFㆍIDF. Besides, we compare the method proposed in this paper with document classification methods using index association and word association based on the model of probability, respectively.

A Relation Analysis between NDSL User Queries and Technical Terms (NDSL 검색 질의어와 기술용어간의 관계에 대한 분석적 연구)

  • Kang, Nam-Gyu;Cho, Min-Hee;Kwon, Oh-Seok
    • Journal of Information Management
    • /
    • v.39 no.3
    • /
    • pp.163-177
    • /
    • 2008
  • In this paper, we analyzed the relationship between user query keywords that is used to search NDSL and technical terms extracted from NDSL journals. For the analysis, we extracted about 833,000 query keywords from NDSL search logs during nearly 17 months and approximately 41,000,000 technical terms from NDSL, INSPEC, FSTA journals. And we used only the English noun phrase in extracted those and then we did an experiment on analysis of equality, relationship analysis and frequency analysis.