• Title/Summary/Keyword: Compound noun extraction

Search Result 9, Processing Time 0.026 seconds

Effective Thematic Words Extraction from a Book using Compound Noun Phrase Synthesis Method

  • Ahn, Hee-Jeong;Kim, Kee-Won;Kim, Seung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.3
    • /
    • pp.107-113
    • /
    • 2017
  • Most of online bookstores are providing a user with the bibliographic book information rather than the concrete information such as thematic words and atmosphere. Especially, thematic words help a user to understand books and cast a wide net. In this paper, we propose an efficient extraction method of thematic words from book text by applying the compound noun and noun phrase synthetic method. The compound nouns represent the characteristics of a book in more detail than single nouns. The proposed method extracts the thematic word from book text by recognizing two types of noun phrases, such as a single noun and a compound noun combined with single nouns. The recognized single nouns, compound nouns, and noun phrases are calculated through TF-IDF weights and extracted as main words. In addition, this paper suggests a method to calculate the frequency of subject, object, and other roles separately, not just the sum of the frequencies of all nouns in the TF-IDF calculation method. Experiments is carried out in the field of economic management, and thematic word extraction verification is conducted through survey and book search. Thus, 9 out of the 10 experimental results used in this study indicate that the thematic word extracted by the proposed method is more effective in understanding the content. Also, it is confirmed that the thematic word extracted by the proposed method has a better book search result.

Korean Base-Noun Extraction and its Application (한국어 기준명사 추출 및 그 응용)

  • Kim, Jae-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.15B no.6
    • /
    • pp.613-620
    • /
    • 2008
  • Noun extraction plays an important part in the fields of information retrieval, text summarization, and so on. In this paper, we present a Korean base-noun extraction system and apply it to text summarization to deal with a huge amount of text effectively. The base-noun is an atomic noun but not a compound noun and we use tow techniques, filtering and segmenting. The filtering technique is used for removing non-nominal words from text before extracting base-nouns and the segmenting technique is employed for separating a particle from a nominal and for dividing a compound noun into base-nouns. We have shown that both of the recall and the precision of the proposed system are about 89% on the average under experimental conditions of ETRI corpus. The proposed system has applied to Korean text summarization system and is shown satisfactory results.

Error-driven Noun-Connection Rule Extraction for Morphological Analysis (오류에 기반한 복합명사 좌우접속규칙 사전 구축)

  • Lee, Kong Joo;Lee, Songwook
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.36 no.8
    • /
    • pp.1123-1128
    • /
    • 2012
  • The goal of this research is to develop an error-driven noun-connection rules which is used for breaking complicate nouns in Korean morphology analysis module. We collected complicate nouns from Web sites, and analyzed them by CnuMa. Whenever we find errors from outputs of the analyzer, we write noun-connection rules to correct the errors. The noun-connection rules are devised by considering left/right contexts in compound nouns. The error-driven noun-connection rules are helpful in improving precision and recall of a Korean morphology analyzer, CnuMa by 2.8% and 10.8%, respectively.

A Method for Compound Noun Extraction to Improve Accuracy of Keyword Analysis of Social Big Data

  • Kim, Hyeon Gyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.8
    • /
    • pp.55-63
    • /
    • 2021
  • Since social big data often includes new words or proper nouns, statistical morphological analysis methods have been widely used to process them properly which are based on the frequency of occurrence of each word. However, these methods do not properly recognize compound nouns, and thus have a problem in that the accuracy of keyword extraction is lowered. This paper presents a method to extract compound nouns in keyword analysis of social big data. The proposed method creates a candidate group of compound nouns by combining the words obtained through the morphological analysis step, and extracts compound nouns by examining their frequency of appearance in a given review. Two algorithms have been proposed according to the method of constructing the candidate group, and the performance of each algorithm is expressed and compared with formulas. The comparison result is verified through experiments on real data collected online, where the results also show that the proposed method is suitable for real-time processing.

Extraction of English-Korean Compound Noun Translation through Automatic Alignment Method (자동 정렬을 통한 영한 복합어의 역어 추출)

  • 이주호;최기선;이재성
    • Proceedings of the Korean Society for Cognitive Science Conference
    • /
    • 2000.06a
    • /
    • pp.309-314
    • /
    • 2000
  • 본 논문에서는 양국어로 된 병렬 코퍼스로부터 복합어의 역어를 추출하기 위한 정렬 방법을 제시한다. 여기에서는 개념어에 대한 양국어 공기정보를 사용하여 기본 정렬을 하고, 인접한 개념어로 정렬의 단위를 확장했다. 또한 재추정 기법을 사용하여 대역 확률을 계산함으로써 보다 높은 정확률을 얻을 수 있었다. 본 논문에서 제안한 방법을 적용하여 139,265개의 영어 어절로 이루어진 우루과이 라운드 영한 병렬 코퍼스에 대해서 실험한 결과 2,290개의 대역어쌍을 얻었고, 그 정확률은 74%였다.

  • PDF

Extraction of English-Korean Compound Noun Translation through Automatic Alignment Method (자동 정렬을 통한 영한 복합어의 역어 추출)

  • Lee, Ju-Ho;Choi, Key-Sun;Lee, Jae-Sung
    • Annual Conference on Human and Language Technology
    • /
    • 2000.10d
    • /
    • pp.309-314
    • /
    • 2000
  • 본 논문에서는 양국어로 된 병렬 코퍼스로부터 복합어의 역어를 추출하기 위한 정렬 방법을 제시한다. 여기에서는 개념어에 대한 양국어 공기정보를 사용하여 기본 정렬을 하고, 인접한 개념어로 정렬의 단위를 확장했다. 또한 재추정 기법을 사용하여 대역 확률을 계산함으로써 보다 높은 정확률을 얻을 수 있었다. 본 논문에서 제안한 방법을 적용하여 139,265개의 영어 어절로 이루어진 우루과이 라운드 영한 병렬 코퍼스에 대해서 실험한 결과 2,290개의 대역어 쌍을 얻었고, 그 정확률은 74%였다.

  • PDF

A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering (스팸 문자 필터링을 위한 변형된 한글 SMS 문장의 정규화 기법)

  • Kang, Seung-Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.7
    • /
    • pp.271-276
    • /
    • 2014
  • Short message service(SMS) in a mobile communication environment is a very convenient method. However, it caused a serious side effect of generating spam messages for advertisement. Those who send spam messages distort or deform SMS sentences to avoid the messages being filtered by automatic filtering system. In order to increase the performance of spam filtering system, we need to recover the distorted sentences into normal sentences. This paper proposes a method of normalizing the various types of distorted sentence and extracting keywords through automatic word spacing and compound noun decomposition.

The Big Data Analytics Regarding the Cadastral Resurvey News Articles

  • Joo, Yong-Jin;Kim, Duck-Ho
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.32 no.6
    • /
    • pp.651-659
    • /
    • 2014
  • With the popularization of big data environment, big data have been highlighted as a key information strategy to establish national spatial data infrastructure for a scientific land policy and the extension of the creative economy. Especially interesting from our point of view is the cadastral information is a core national information source that forms the basis of spatial information that leads to people's daily life including the production and consumption of information related to real estate. The purpose of our paper is to suggest the scheme of big data analytics with respect to the articles of cadastral resurvey project in order to approach cadastral information in terms of spatial data integration. As specific research method, the TM (Text Mining) package from R was used to read various formats of news reports as texts, and nouns were extracted by using the KoNLP package. That is, we searched the main keywords regarding cadastral resurvey, performing extraction of compound noun and data mining analysis. And visualization of the results was presented. In addition, new reports related to cadastral resurvey between 2012 and 2014 were searched in newspapers, and nouns were extracted from the searched data for the data mining analysis of cadastral information. Furthermore, the approval rating, reliability, and improvement of rules were presented through correlation analyses among the extracted compound nouns. As a result of the correlation analysis among the most frequently used ones of the extracted nouns, five groups of data consisting of 133 keywords were generated. The most frequently appeared words were "cadastral resurvey," "civil complaint," "dispute," "cadastral survey," "lawsuit," "settlement," "mediation," "discrepant land," and "parcel." In Conclusions, the cadastral resurvey performed in some local governments has been proceeding smoothly as positive results. On the other hands, disputes from owner of land have been provoking a stream of complaints from parcel surveying for the cadastral resurvey. Through such keyword analysis, various public opinion and the types of civil complaints related to the cadastral resurvey project can be identified to prevent them through pre-emptive responses for direct call centre on the cadastral surveying, Electronic civil service and customer counseling, and high quality services about cadastral information can be provided. This study, therefore, provides a stepping stones for developing an account of big data analytics which is able to comprehensively examine and visualize a variety of news report and opinions in cadastral resurvey project promotion. Henceforth, this will contribute to establish the foundation for a framework of the information utilization, enabling scientific decision making with speediness and correctness.

An Analysis of IT Trends Using Tweet Data (트윗 데이터를 활용한 IT 트렌드 분석)

  • Yi, Jin Baek;Lee, Choong Kwon;Cha, Kyung Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.143-159
    • /
    • 2015
  • Predicting IT trends has been a long and important subject for information systems research. IT trend prediction makes it possible to acknowledge emerging eras of innovation and allocate budgets to prepare against rapidly changing technological trends. Towards the end of each year, various domestic and global organizations predict and announce IT trends for the following year. For example, Gartner Predicts 10 top IT trend during the next year, and these predictions affect IT and industry leaders and organization's basic assumptions about technology and the future of IT, but the accuracy of these reports are difficult to verify. Social media data can be useful tool to verify the accuracy. As social media services have gained in popularity, it is used in a variety of ways, from posting about personal daily life to keeping up to date with news and trends. In the recent years, rates of social media activity in Korea have reached unprecedented levels. Hundreds of millions of users now participate in online social networks and communicate with colleague and friends their opinions and thoughts. In particular, Twitter is currently the major micro blog service, it has an important function named 'tweets' which is to report their current thoughts and actions, comments on news and engage in discussions. For an analysis on IT trends, we chose Tweet data because not only it produces massive unstructured textual data in real time but also it serves as an influential channel for opinion leading on technology. Previous studies found that the tweet data provides useful information and detects the trend of society effectively, these studies also identifies that Twitter can track the issue faster than the other media, newspapers. Therefore, this study investigates how frequently the predicted IT trends for the following year announced by public organizations are mentioned on social network services like Twitter. IT trend predictions for 2013, announced near the end of 2012 from two domestic organizations, the National IT Industry Promotion Agency (NIPA) and the National Information Society Agency (NIA), were used as a basis for this research. The present study analyzes the Twitter data generated from Seoul (Korea) compared with the predictions of the two organizations to analyze the differences. Thus, Twitter data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. To overcome these challenges, we used SAS IRS (Information Retrieval Studio) developed by SAS to capture the trend in real-time processing big stream datasets of Twitter. The system offers a framework for crawling, normalizing, analyzing, indexing and searching tweet data. As a result, we have crawled the entire Twitter sphere in Seoul area and obtained 21,589 tweets in 2013 to review how frequently the IT trend topics announced by the two organizations were mentioned by the people in Seoul. The results shows that most IT trend predicted by NIPA and NIA were all frequently mentioned in Twitter except some topics such as 'new types of security threat', 'green IT', 'next generation semiconductor' since these topics non generalized compound words so they can be mentioned in Twitter with other words. To answer whether the IT trend tweets from Korea is related to the following year's IT trends in real world, we compared Twitter's trending topics with those in Nara Market, Korea's online e-Procurement system which is a nationwide web-based procurement system, dealing with whole procurement process of all public organizations in Korea. The correlation analysis show that Tweet frequencies on IT trending topics predicted by NIPA and NIA are significantly correlated with frequencies on IT topics mentioned in project announcements by Nara market in 2012 and 2013. The main contribution of our research can be found in the following aspects: i) the IT topic predictions announced by NIPA and NIA can provide an effective guideline to IT professionals and researchers in Korea who are looking for verified IT topic trends in the following topic, ii) researchers can use Twitter to get some useful ideas to detect and predict dynamic trends of technological and social issues.