• 제목/요약/키워드: Natural Language Processing

검색결과 940건 처리시간 0.027초

Evaluation of Similarity Analysis of Newspaper Article Using Natural Language Processing

  • Ayako Ohshiro;Takeo Okazaki;Takashi Kano;Shinichiro Ueda
    • International Journal of Computer Science & Network Security
    • /
    • 제24권6호
    • /
    • pp.1-7
    • /
    • 2024
  • Comparing text features involves evaluating the "similarity" between texts. It is crucial to use appropriate similarity measures when comparing similarities. This study utilized various techniques to assess the similarities between newspaper articles, including deep learning and a previously proposed method: a combination of Pointwise Mutual Information (PMI) and Word Pair Matching (WPM), denoted as PMI+WPM. For performance comparison, law data from medical research in Japan were utilized as validation data in evaluating the PMI+WPM method. The distribution of similarities in text data varies depending on the evaluation technique and genre, as revealed by the comparative analysis. For newspaper data, non-deep learning methods demonstrated better similarity evaluation accuracy than deep learning methods. Additionally, evaluating similarities in law data is more challenging than in newspaper articles. Despite deep learning being the prevalent method for evaluating textual similarities, this study demonstrates that non-deep learning methods can be effective regarding Japanese-based texts.

소프트웨어 부품의 재사용을 위한 개선된 패싯 분류 방법과 의미 유사도 측정 (Advanced Faceted Classification Scheme and Semantic Similarity Measure for Reuse of Software Components)

  • 강문설
    • 한국정보처리학회논문지
    • /
    • 제3권4호
    • /
    • pp.855-865
    • /
    • 1996
  • 본 논문에서는 재사용가능한 소프트웨어 부품의 분류 과정을 자동화하여, 소프트 웨어 부품 라이브러리에 구조적으로 저장하는 방안을 제안한다. 효율적이고 자동화 된 소프트웨어 부품의 분류를 위하여 자연어로 기술된 소프트웨어 부품 설명서로부터 의미 정보와 문장 구성 정보 등의 특징을 획득하여 소프트웨어 부품의 특성을 표현하 는 패싯을 결정하고각각의 패싯에 해당하는 항목들을 자동으로 추출하여 소프트웨어 부품 식별자를 구성하였다. 그리고 분류된 소프트웨어 부품들 사이의 의미 유사도를 측정하여 비슷한 특성을 갖는 소프트웨어 부품들을 인접한 장소에 저장시켜 구조화된 소프트웨어 부품 라이브러리를 구축하였다. 제안한 방법은 소프트웨어 부품의 분류 과정이 간단하고, 유사한 소프트웨어 부품을 쉽게 식별할 수 있었으며, 또한 소프트 웨어 부품을 라이브러리에 구조적으로 저장할 수 있다.

  • PDF

의미 사전과 반전 의견 처리를 이용한 한국어 의견 분석 시스템 개발 (Development of Korean Opinion Analysis System using Semantic Dictionary and Inverse Opinion Processing)

  • 장재건;박진수;류승택
    • 한국산학기술학회논문지
    • /
    • 제11권8호
    • /
    • pp.3070-3075
    • /
    • 2010
  • 웹 2.0 시대를 맞아 인터넷 상의 블로그 및 커뮤니티 공간에 일반 사용자들이 자신의 의견 및 생각을 표현하게 되었다. 상품 구매 시 다수의 사람들이 이러한 의견을 참조하는데, 사용자들은 소수의 의견만을 참조하고 전체적인 의견은 참조하지 못하고 있다. 의견 분석 시스템은 상품 및 서비스에 대한 인터넷 상의 글들을 분석하여 상품의 긍정, 부정을 평가하는 시스템으로 자연어 검색에서 발전한 검색이라 할 수 있다. 본 논문에서는 의견 분석 서비스에서 핵심이 되는 문장의 긍정, 부정을 파악하기 위하여 '긍정', '부정', '중립'의 극성 정보 외에 '반전'의 정보를 추가로 학습하고, 처리하는 구문 분석 및 반전 처리를 제안한다.

구문의미트리 비교기를 이용한 유사문서 판별기 (Discriminator of Similar Documents Using the Syntactic-Semantic Tree Comparator)

  • 강원석
    • 한국콘텐츠학회논문지
    • /
    • 제15권10호
    • /
    • pp.636-646
    • /
    • 2015
  • 정보사회에 문서 복제나 표절의 검출에 대한 필요성이 증대되고 있다. 그 필요성에 따라 많은 연구가 이루어지고 있으나 자연어 처리의 문제가 유사 문서 판별의 질 향상에 제약이 되었다. 최근 구문의미분석의 기술을 접목하여 유사문서 판별의 성능을 향상을 시도하였으나 구문의미분석의 결과인 구문의미트리를 비교하는 어려움이 있었다. 본 논문은 구문의미트리의 유사도를 계산하는 구문의미트리 비교기를 개발하고 이를 이용하여 유사문서를 판별하는 시스템을 설계, 구현한다. 본 시스템의 성능을 실험하기 위하여 휴먼 판별과 제안한 시스템의 판별과의 상관계수를 분석하였다. 실험결과, 구문의미트리 비교기를 이용한 유사문서 판별기의 성능을 검증할 수 있었다. 앞으로 문서 유형을 정의하고 각 유형에 맞는 판별 기법을 개발할 필요가 있다.

A bio-text mining system using keywords and patterns in a grid environment

  • Kwon, Hyuk-Ryul;Jung, Tae-Sung;Kim, Kyoung-Ran;Jahng, Hye-Kyoung;Cho, Wan-Sup;Yoo, Jae-Soo
    • 한국산업정보학회:학술대회논문집
    • /
    • 한국산업정보학회 2007년도 춘계학술대회
    • /
    • pp.48-52
    • /
    • 2007
  • As huge amount of literature including biological data is being generated after post genome era, it becomes difficult for researcher to find useful knowledge from the biological databases. Bio-text mining and related natural language processing technique are the key issues in the intelligent knowledge retrieval from the biological databases. We propose a bio-text mining technique for the biologists who find Knowledge from the huge literature. At first, web robot is used to extract and transform related literature from remote databases. To improve retrieval speed, we generate an inverted file for keywords in the literature. Then, text mining system is used for extracting given knowledge patterns and keywords. Finally, we construct a grid computing environment to guarantee processing speed in the text mining even for huge literature databases. In the real experiment for 10,000 bio-literatures, the system shows 95% precision and 98% recall.

  • PDF

한국어 발화음성에서 중점단어 탐색을 위한 기본주파수에 대한 연구 (A Study of Fundamental Frequency for Focused Word Spotting in Spoken Korean)

  • 권순일;박지형;박능수
    • 정보처리학회논문지B
    • /
    • 제15B권6호
    • /
    • pp.595-602
    • /
    • 2008
  • 각 문장 별 중점단어는 발화음성을 인식하고 그 의미를 이해하는데 도움을 준다. 발화된 음성신호로부터 중점단어를 탐색할 수 있는 방법을 찾기 위한 노력의 일환으로 실험을 통하여 문장 내에서 중점단어와 그 외의 단어들의 기본주파수의 평균과 분산, 그리고 평균 에너지를 분석해 보았다. 한국어로 된 100개의 발화문장의 음성데이터를 가지고 실험을 한 결과 중점단어는 그 외의 단어들에 비해 대부분 상대적으로 높은 기본주파수의 평균값을 나타내거나 상대적으로 높은 기본주파수의 분산 값을 나타냈다. 이 연구 결과를 이용하면 한국어의 구어문장에서 운율적 특성을 알 수 있을 뿐만 아니라, 자연어 처리를 이용한 핵심어를 추출하는 데에도 도움이 될 것이다.

변형된 한글 금칙어에 대한 실시간 필터링 시스템 (Realtime Word Filtering System against Variations of Censored Words in Korean)

  • 김찬우;성미영
    • 한국멀티미디어학회논문지
    • /
    • 제22권6호
    • /
    • pp.695-705
    • /
    • 2019
  • The level of psychological damage caused by verbal abuse among cyberbully victims is very serious. It is going to introduce a system that determines the level of sanctions against chatting in real time using the automatic prohibited words filtering based on artificial neural network. In this paper, we propose a keyword filtering method that detects the modified prohibited words and determines whether the corresponding chat should be sanctioned in real time, and a real-time chatting screening system using it. The accuracy of filtering through machine learning was improved by processing data in advance through coding techniques that express consonants and vowels of similar pronunciation at close distances. After comparing and analyzing Mahalanobis-based clustering algorithms and artificial neural network-based algorithms, algorithms that utilize artificial neural networks showed high performance. If it is applied to Internet chatting, comments or online games, it is expected that it will be able to filter more effectively than the existing filtering method and that this will ease communication inconvenience due to existing indiscriminate filtering methods.

Privacy-Preserving in the Context of Data Mining and Deep Learning

  • Altalhi, Amjaad;AL-Saedi, Maram;Alsuwat, Hatim;Alsuwat, Emad
    • International Journal of Computer Science & Network Security
    • /
    • 제21권6호
    • /
    • pp.137-142
    • /
    • 2021
  • Machine-learning systems have proven their worth in various industries, including healthcare and banking, by assisting in the extraction of valuable inferences. Information in these crucial sectors is traditionally stored in databases distributed across multiple environments, making accessing and extracting data from them a tough job. To this issue, we must add that these data sources contain sensitive information, implying that the data cannot be shared outside of the head. Using cryptographic techniques, Privacy-Preserving Machine Learning (PPML) helps solve this challenge, enabling information discovery while maintaining data privacy. In this paper, we talk about how to keep your data mining private. Because Data mining has a wide variety of uses, including business intelligence, medical diagnostic systems, image processing, web search, and scientific discoveries, and we discuss privacy-preserving in deep learning because deep learning (DL) exhibits exceptional exactitude in picture detection, Speech recognition, and natural language processing recognition as when compared to other fields of machine learning so that it detects the existence of any error that may occur to the data or access to systems and add data by unauthorized persons.

Opera Clustering: K-means on librettos datasets

  • 정하림;유주헌
    • 인터넷정보학회논문지
    • /
    • 제23권2호
    • /
    • pp.45-52
    • /
    • 2022
  • With the development of artificial intelligence analysis methods, especially machine learning, various fields are widely expanding their application ranges. However, in the case of classical music, there still remain some difficulties in applying machine learning techniques. Genre classification or music recommendation systems generated by deep learning algorithms are actively used in general music, but not in classical music. In this paper, we attempted to classify opera among classical music. To this end, an experiment was conducted to determine which criteria are most suitable among, composer, period of composition, and emotional atmosphere, which are the basic features of music. To generate emotional labels, we adopted zero-shot classification with four basic emotions, 'happiness', 'sadness', 'anger', and 'fear.' After embedding the opera libretto with the doc2vec processing model, the optimal number of clusters is computed based on the result of the elbow method. Decided four centroids are then adopted in k-means clustering to classify unsupervised libretto datasets. We were able to get optimized clustering based on the result of adjusted rand index scores. With these results, we compared them with notated variables of music. As a result, it was confirmed that the four clusterings calculated by machine after training were most similar to the grouping result by period. Additionally, we were able to verify that the emotional similarity between composer and period did not appear significantly. At the end of the study, by knowing the period is the right criteria, we hope that it makes easier for music listeners to find music that suits their tastes.

Aspect-Based Sentiment Analysis with Position Embedding Interactive Attention Network

  • Xiang, Yan;Zhang, Jiqun;Zhang, Zhoubin;Yu, Zhengtao;Xian, Yantuan
    • Journal of Information Processing Systems
    • /
    • 제18권5호
    • /
    • pp.614-627
    • /
    • 2022
  • Aspect-based sentiment analysis is to discover the sentiment polarity towards an aspect from user-generated natural language. So far, most of the methods only use the implicit position information of the aspect in the context, instead of directly utilizing the position relationship between the aspect and the sentiment terms. In fact, neighboring words of the aspect terms should be given more attention than other words in the context. This paper studies the influence of different position embedding methods on the sentimental polarities of given aspects, and proposes a position embedding interactive attention network based on a long short-term memory network. Firstly, it uses the position information of the context simultaneously in the input layer and the attention layer. Secondly, it mines the importance of different context words for the aspect with the interactive attention mechanism. Finally, it generates a valid representation of the aspect and the context for sentiment classification. The model which has been posed was evaluated on the datasets of the Semantic Evaluation 2014. Compared with other baseline models, the accuracy of our model increases by about 2% on the restaurant dataset and 1% on the laptop dataset.