• Title/Summary/Keyword: idf-domain

Search Result 26, Processing Time 0.031 seconds

Event Sentence Extraction for Online Trend Analysis (온라인 동향 분석을 위한 이벤트 문장 추출 방안)

  • Yun, Bo-Hyun
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.9
    • /
    • pp.9-15
    • /
    • 2012
  • A conventional event sentence extraction research doesn't learn the 3W features in the learning step and applies the rule on whether the 3W feature exists in the extraction step. This paper presents a sentence weight based event sentence extraction method that calculates the weight of the 3W features in the learning step and applies the weight of the 3W features in the extraction step. In the experimental result, we show that top 30% features by the $TF{\times}IDF$ weighting method is good in the feature filtering. In the real estate domain of the public issue, the performance of sentence weight based event sentence extraction method is improved by who and when of 3W features. Moreover, In the real estate domain of the public issue, the sentence weight based event sentence extraction method is better than the other machine learning based extraction method.

A Clustering Technique Using Association Rules for The Library and Information Science Terminology (연관규칙을 이용한 문헌정보학 전문용어 클러스터링 기법에 관한 연구)

  • Seung, Hyon-Woo;Park, Mi-Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.37 no.2
    • /
    • pp.89-105
    • /
    • 2003
  • In this paper, an effective method for clustering terminologies extracted from text is proposed, in order to develope a search engine to extract relevant information from large web documents. To prevent frequency of the meaningless association rules among general terminologies, only useful association rules among terminologies are produced using database tables which consist of domain-specific terminologies. Such association rules are produced by applying the Apriori algorithm after forming transaction units from groups of association rules in a document. A group of association rules produced from a terminology forms in a cluster.

Retrieval methodology for similar NPP LCO cases based on domain specific NLP

  • No Kyu Seong ;Jae Hee Lee ;Jong Beom Lee;Poong Hyun Seong
    • Nuclear Engineering and Technology
    • /
    • v.55 no.2
    • /
    • pp.421-431
    • /
    • 2023
  • Nuclear power plants (NPPs) have technical specifications (Tech Specs) to ensure that the equipment and key operating parameters necessary for the safe operation of the power plant are maintained within limiting conditions for operation (LCO) determined by a safety analysis. The LCO of Tech Specs that identify the lowest functional capability of equipment required for safe operation for a facility must be complied for the safe operation of NPP. There have been previous studies to aid in compliance with LCO relevant to rule-based expert systems; however, there is an obvious limit to expert systems for implementing the rules for many situations related to LCO. Therefore, in this study, we present a retrieval methodology for similar LCO cases in determining whether LCO is met or not met. To reflect the natural language processing of NPP features, a domain dictionary was built, and the optimal term frequency-inverse document frequency variant was selected. The retrieval performance was improved by adding a Boolean retrieval model based on terms related to the LCO in addition to the vector space model. The developed domain dictionary and retrieval methodology are expected to be exceedingly useful in determining whether LCO is met.

A Semantic Orientation Prediction Method of Sentiment Features Based on the General and Domain-Dependent Characteristics (일반적, 영역 의존적 특성을 반영한 감정 자질의 의미지향성 추정 방법)

  • Hwang, Jaewon;Ko, Youngjoong
    • Annual Conference on Human and Language Technology
    • /
    • 2009.10a
    • /
    • pp.155-159
    • /
    • 2009
  • 본 논문은 한국어 문서 감정분류를 위한 중요한 어휘 자원인 감정자질(Sentiment Feature)의 의미지향성(Semantic Orientation) 추정을 위해 일반적인 특성과 영역(Domain) 의존적인 특성을 반영하여 한국어 문서 감정분류(Sentiment Classification)의 성능 향상을 얻을 수 있는 기법을 제안한다. 감정자질의 의미지 향성은 검색 엔진을 통해 추출한 각 감정 자질의 스니핏(Snippet)과 실험 말뭉치를 이용하여 추정할 수 있다. 검색 엔진을 통해 추출된 스니핏은 감정자질의 일반적인 특성을 반영하며, 실험 말뭉치는 분류하고자 하는 영역 의존적인 특성을 반영한다. 이렇게 얻어진 감정자질의 의미지향성 수치는 각 문장의 감정강도를 추정하기 위해 이용되며, 문장의 감정 강도의 값을 TF-IDF 가중치 기법에 접목하여 감정자질의 가중치를 책정한다. 최종적으로 학습 과정에서 긍정 문서에서는 긍정 감정자질, 부정 문서에서는 부정 감정자질을 대상으로 추가 가중치를 부여하여 학습하였다. 본 논문에서는 문서 분류에 뛰어난 성능을 보여주는 지지 벡터 기계(Support Vector Machine)를 사용하여 제안한 방법의 성능을 평가한다. 평가 결과, 일반적인 정보 검색에서 사용하는 내용어(Content Word) 기반의 자질을 사용한 경우보다 3.1%의 성능향상을 보였다.

  • PDF

Text Classification for Patents: Experiments with Unigrams, Bigrams and Different Weighting Methods

  • Im, ChanJong;Kim, DoWan;Mandl, Thomas
    • International Journal of Contents
    • /
    • v.13 no.2
    • /
    • pp.66-74
    • /
    • 2017
  • Patent classification is becoming more critical as patent filings have been increasing over the years. Despite comprehensive studies in the area, there remain several issues in classifying patents on IPC hierarchical levels. Not only structural complexity but also shortage of patents in the lower level of the hierarchy causes the decline in classification performance. Therefore, we propose a new method of classification based on different criteria that are categories defined by the domain's experts mentioned in trend analysis reports, i.e. Patent Landscape Report (PLR). Several experiments were conducted with the purpose of identifying type of features and weighting methods that lead to the best classification performance using Support Vector Machine (SVM). Two types of features (noun and noun phrases) and five different weighting schemes (TF-idf, TF-rf, TF-icf, TF-icf-based, and TF-idcef-based) were experimented on.

A Study on the Dense Vector Representation of Query-Passage for Open Domain Question Answering (오픈 도메인 질의응답을 위한 질문-구절의 밀집 벡터 표현 연구)

  • Minji Jung;Saebyeok Lee;Youngjune Kim;Cheolhun Heo;Chunghee Lee
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.115-121
    • /
    • 2022
  • 질문에 답하기 위해 관련 구절을 검색하는 기술은 오픈 도메인 질의응답의 검색 단계를 위해 필요하다. 전통적인 방법은 정보 검색 기법인 빈도-역문서 빈도(TF-IDF) 기반으로 희소한 벡터 표현을 활용하여 구절을 검색한다. 하지만 희소 벡터 표현은 벡터 길이가 길 뿐만 아니라, 질문에 나오지 않는 단어나 토큰을 검색하지 못한다는 취약점을 가진다. 밀집 벡터 표현 연구는 이러한 취약점을 개선하고 있으며 대부분의 연구가 영어 데이터셋을 학습한 것이다. 따라서, 본 연구는 한국어 데이터셋을 학습한 밀집 벡터 표현을 연구하고 여러 가지 부정 샘플(negative sample) 추출 방법을 도입하여 전이 학습한 모델 성능을 비교 분석한다. 또한, 대화 응답 선택 태스크에서 밀집 검색에 활용한 순위 재지정 상호작용 레이어를 추가한 실험을 진행하고 비교 분석한다. 밀집 벡터 표현 모델을 학습하는 것이 도전적인 과제인만큼 향후에도 다양한 시도가 필요할 것으로 보인다.

  • PDF

A Collaborative Filtering System Combined with Users' Review Mining : Application to the Recommendation of Smartphone Apps (사용자 리뷰 마이닝을 결합한 협업 필터링 시스템: 스마트폰 앱 추천에의 응용)

  • Jeon, ByeoungKug;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.1-18
    • /
    • 2015
  • Collaborative filtering(CF) algorithm has been popularly used for recommender systems in both academic and practical applications. A general CF system compares users based on how similar they are, and creates recommendation results with the items favored by other people with similar tastes. Thus, it is very important for CF to measure the similarities between users because the recommendation quality depends on it. In most cases, users' explicit numeric ratings of items(i.e. quantitative information) have only been used to calculate the similarities between users in CF. However, several studies indicated that qualitative information such as user's reviews on the items may contribute to measure these similarities more accurately. Considering that a lot of people are likely to share their honest opinion on the items they purchased recently due to the advent of the Web 2.0, user's reviews can be regarded as the informative source for identifying user's preference with accuracy. Under this background, this study proposes a new hybrid recommender system that combines with users' review mining. Our proposed system is based on conventional memory-based CF, but it is designed to use both user's numeric ratings and his/her text reviews on the items when calculating similarities between users. In specific, our system creates not only user-item rating matrix, but also user-item review term matrix. Then, it calculates rating similarity and review similarity from each matrix, and calculates the final user-to-user similarity based on these two similarities(i.e. rating and review similarities). As the methods for calculating review similarity between users, we proposed two alternatives - one is to use the frequency of the commonly used terms, and the other one is to use the sum of the importance weights of the commonly used terms in users' review. In the case of the importance weights of terms, we proposed the use of average TF-IDF(Term Frequency - Inverse Document Frequency) weights. To validate the applicability of the proposed system, we applied it to the implementation of a recommender system for smartphone applications (hereafter, app). At present, over a million apps are offered in each app stores operated by Google and Apple. Due to this information overload, users have difficulty in selecting proper apps that they really want. Furthermore, app store operators like Google and Apple have cumulated huge amount of users' reviews on apps until now. Thus, we chose smartphone app stores as the application domain of our system. In order to collect the experimental data set, we built and operated a Web-based data collection system for about two weeks. As a result, we could obtain 1,246 valid responses(ratings and reviews) from 78 users. The experimental system was implemented using Microsoft Visual Basic for Applications(VBA) and SAS Text Miner. And, to avoid distortion due to human intervention, we did not adopt any refining works by human during the user's review mining process. To examine the effectiveness of the proposed system, we compared its performance to the performance of conventional CF system. The performances of recommender systems were evaluated by using average MAE(mean absolute error). The experimental results showed that our proposed system(MAE = 0.7867 ~ 0.7881) slightly outperformed a conventional CF system(MAE = 0.7939). Also, they showed that the calculation of review similarity between users based on the TF-IDF weights(MAE = 0.7867) leaded to better recommendation accuracy than the calculation based on the frequency of the commonly used terms in reviews(MAE = 0.7881). The results from paired samples t-test presented that our proposed system with review similarity calculation using the frequency of the commonly used terms outperformed conventional CF system with 10% statistical significance level. Our study sheds a light on the application of users' review information for facilitating electronic commerce by recommending proper items to users.

Key Factors of Talented Scientists' Growth and ExpeI1ise Development (과학인재의 성장 및 전문성 발달과정에서의 영향 요인에 관한 연구)

  • Oh, Hun-Seok;Choi, Ji-Young;Choi, Yoon-Mi;Kwon, Kwi-Heon
    • Journal of The Korean Association For Science Education
    • /
    • v.27 no.9
    • /
    • pp.907-918
    • /
    • 2007
  • This study was conducted to explore key factors of expertise development of talented scientists who achieved outstanding research performance according to the stages of expertise development and dimensions of individual-domain-field. To fulfill the research purpose, 31 domestic scientists who were awarded major prizes in the field of science were interviewed in-depth from March to September, 2007. Stages of expertise development were analyzed in light of Csikszentmihalyi's IDFI (individual-domain-field interaction) model. Self-directed learning, multiple interests and finding strength, academic and liberal home environment, and meaningful encounter were major factors affecting expertise development in the exploration stage. In the beginner stage, independence, basic knowledge on major, and thirst for knowledge at university affected expertise development. Task commitment, finding flow, finding their field of interest and lifelong research topic, and mentor in formal education were the affecting factors in the competent stage. Finally, placing priority, communication skills, pioneering new domain, expansion of the domain, and evaluation and support system affected talented scientists' expertise development in the leading stage. The meaning of major patterns of expertise development were analyzed and described. Based on these analyses, educational implications for nurturing scientists were suggested.

Export Control System based on Case Based Reasoning: Design and Evaluation (사례 기반 지능형 수출통제 시스템 : 설계와 평가)

  • Hong, Woneui;Kim, Uihyun;Cho, Sinhee;Kim, Sansung;Yi, Mun Yong;Shin, Donghoon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.109-131
    • /
    • 2014
  • As the demand of nuclear power plant equipment is continuously growing worldwide, the importance of handling nuclear strategic materials is also increasing. While the number of cases submitted for the exports of nuclear-power commodity and technology is dramatically increasing, preadjudication (or prescreening to be simple) of strategic materials has been done so far by experts of a long-time experience and extensive field knowledge. However, there is severe shortage of experts in this domain, not to mention that it takes a long time to develop an expert. Because human experts must manually evaluate all the documents submitted for export permission, the current practice of nuclear material export is neither time-efficient nor cost-effective. Toward alleviating the problem of relying on costly human experts only, our research proposes a new system designed to help field experts make their decisions more effectively and efficiently. The proposed system is built upon case-based reasoning, which in essence extracts key features from the existing cases, compares the features with the features of a new case, and derives a solution for the new case by referencing similar cases and their solutions. Our research proposes a framework of case-based reasoning system, designs a case-based reasoning system for the control of nuclear material exports, and evaluates the performance of alternative keyword extraction methods (full automatic, full manual, and semi-automatic). A keyword extraction method is an essential component of the case-based reasoning system as it is used to extract key features of the cases. The full automatic method was conducted using TF-IDF, which is a widely used de facto standard method for representative keyword extraction in text mining. TF (Term Frequency) is based on the frequency count of the term within a document, showing how important the term is within a document while IDF (Inverted Document Frequency) is based on the infrequency of the term within a document set, showing how uniquely the term represents the document. The results show that the semi-automatic approach, which is based on the collaboration of machine and human, is the most effective solution regardless of whether the human is a field expert or a student who majors in nuclear engineering. Moreover, we propose a new approach of computing nuclear document similarity along with a new framework of document analysis. The proposed algorithm of nuclear document similarity considers both document-to-document similarity (${\alpha}$) and document-to-nuclear system similarity (${\beta}$), in order to derive the final score (${\gamma}$) for the decision of whether the presented case is of strategic material or not. The final score (${\gamma}$) represents a document similarity between the past cases and the new case. The score is induced by not only exploiting conventional TF-IDF, but utilizing a nuclear system similarity score, which takes the context of nuclear system domain into account. Finally, the system retrieves top-3 documents stored in the case base that are considered as the most similar cases with regard to the new case, and provides them with the degree of credibility. With this final score and the credibility score, it becomes easier for a user to see which documents in the case base are more worthy of looking up so that the user can make a proper decision with relatively lower cost. The evaluation of the system has been conducted by developing a prototype and testing with field data. The system workflows and outcomes have been verified by the field experts. This research is expected to contribute the growth of knowledge service industry by proposing a new system that can effectively reduce the burden of relying on costly human experts for the export control of nuclear materials and that can be considered as a meaningful example of knowledge service application.

Chatbot Design Method Using Hybrid Word Vector Expression Model Based on Real Telemarketing Data

  • Zhang, Jie;Zhang, Jianing;Ma, Shuhao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1400-1418
    • /
    • 2020
  • In the development of commercial promotion, chatbot is known as one of significant skill by application of natural language processing (NLP). Conventional design methods are using bag-of-words model (BOW) alone based on Google database and other online corpus. For one thing, in the bag-of-words model, the vectors are Irrelevant to one another. Even though this method is friendly to discrete features, it is not conducive to the machine to understand continuous statements due to the loss of the connection between words in the encoded word vector. For other thing, existing methods are used to test in state-of-the-art online corpus but it is hard to apply in real applications such as telemarketing data. In this paper, we propose an improved chatbot design way using hybrid bag-of-words model and skip-gram model based on the real telemarketing data. Specifically, we first collect the real data in the telemarketing field and perform data cleaning and data classification on the constructed corpus. Second, the word representation is adopted hybrid bag-of-words model and skip-gram model. The skip-gram model maps synonyms in the vicinity of vector space. The correlation between words is expressed, so the amount of information contained in the word vector is increased, making up for the shortcomings caused by using bag-of-words model alone. Third, we use the term frequency-inverse document frequency (TF-IDF) weighting method to improve the weight of key words, then output the final word expression. At last, the answer is produced using hybrid retrieval model and generate model. The retrieval model can accurately answer questions in the field. The generate model can supplement the question of answering the open domain, in which the answer to the final reply is completed by long-short term memory (LSTM) training and prediction. Experimental results show which the hybrid word vector expression model can improve the accuracy of the response and the whole system can communicate with humans.