• Title/Summary/Keyword: Text corpus

Search Result 244, Processing Time 0.024 seconds

Network Analysis between Uncertainty Words based on Word2Vec and WordNet (Word2Vec과 WordNet 기반 불확실성 단어 간의 네트워크 분석에 관한 연구)

  • Heo, Go Eun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.3
    • /
    • pp.247-271
    • /
    • 2019
  • Uncertainty in scientific knowledge means an uncertain state where propositions are neither true or false at present. The existing studies have analyzed the propositions written in the academic literature, and have conducted the performance evaluation based on the rule based and machine learning based approaches by using the corpus. Although they recognized that the importance of word construction, there are insufficient attempts to expand the word by analyzing the meaning of uncertainty words. On the other hand, studies for analyzing the structure of networks by using bibliometrics and text mining techniques are widely used as methods for understanding intellectual structure and relationship in various disciplines. Therefore, in this study, semantic relations were analyzed by applying Word2Vec to existing uncertainty words. In addition, WordNet, which is an English vocabulary database and thesaurus, was applied to perform a network analysis based on hypernyms, hyponyms, and synonyms relations linked to uncertainty words. The semantic and lexical relationships of uncertainty words were structurally identified. As a result, we identified the possibility of automatically expanding uncertainty words.

Developing and Pre-Processing a Dataset using a Rhetorical Relation to Build a Question-Answering System based on an Unsupervised Learning Approach

  • Dutta, Ashit Kumar;Wahab sait, Abdul Rahaman;Keshta, Ismail Mohamed;Elhalles, Abheer
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.11
    • /
    • pp.199-206
    • /
    • 2021
  • Rhetorical relations between two text fragments are essential information and support natural language processing applications such as Question - Answering (QA) system and automatic text summarization to produce an effective outcome. Question - Answering (QA) system facilitates users to retrieve a meaningful response. There is a demand for rhetorical relation based datasets to develop such a system to interpret and respond to user requests. There are a limited number of datasets for developing an Arabic QA system. Thus, there is a lack of an effective QA system in the Arabic language. Recent research works reveal that unsupervised learning can support the QA system to reply to users queries. In this study, researchers intend to develop a rhetorical relation based dataset for implementing unsupervised learning applications. A web crawler is developed to crawl Arabic content from the web. A discourse-annotated corpus is generated using the rhetorical structural theory. A Naïve Bayes based QA system is developed to evaluate the performance of datasets. The outcome shows that the performance of the QA system is improved with proposed dataset and able to answer user queries with an appropriate response. In addition, the results on fine-grained and coarse-grained relations reveal that the dataset is highly reliable.

Multi-Label Classification for Corporate Review Text: A Local Grammar Approach (머신러닝 기반의 기업 리뷰 다중 분류: 부분 문법 적용을 중심으로)

  • HyeYeon Baek;Young Kyun Chang
    • Information Systems Review
    • /
    • v.25 no.3
    • /
    • pp.27-41
    • /
    • 2023
  • Unlike the previous works focusing on the state-of-the-art methodologies to improve the performance of machine learning models, this study improves the 'quality' of training data used in machine learning. We propose a method to enhance the quality of training data through the processing of 'local grammar,' frequently used in corpus analysis. We collected a vast amount of unstructured corporate review text data posted by employees working in the top 100 companies in Korea. After improving the data quality using the local grammar process, we confirmed that the classification model with local grammar outperformed the model without it in terms of classification performance. We defined five factors of work engagement as classification categories, and analyzed how the pattern of reviews changed before and after the COVID-19 pandemic. Through this study, we provide evidence that shows the value of the local grammar-based automatic identification and classification of employee experiences, and offer some clues for significant organizational cultural phenomena.

Analysis on the English Translation of The First Chosen Educational Ordinance, Manual of Education of Koreans (1913), and Manual of Education in Chosen 1920 (1920) Using Text Mining Analytics (텍스트 마이닝(Text mining) 기법을 활용한 『제1차조선교육령』과 『조선교육요람』(1913, 1920)의영어번역본 분석)

  • Jinyoung Tak;Eunjoo Kwak;Silo Chin;Minjoo Shon;Dongmie Kim
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.6
    • /
    • pp.309-317
    • /
    • 2023
  • The purpose of this paper is to investigate how Japan tried to dominate Chosen through educational policies by analyzing three official English texts published by the Japanese Government-General of Korea: the First Chosen Educational Ordinance declared in 1911, the Manual of Education of Koreans(1913), and the Manual of Education in Chosen 1920(1920). In order to pursue this purpose, the present study carried a corpus-based diachronic analysis, rather then a qualitative analysis. Facilitating text analytics such as Word Cloud and CONCOR, this paper derived the following results: First, the first Chosen Educational Ordinance(1911) includes overall educational regulations, curriculum, and operations of schools. Second, the Manual of Education of Koreans(1913) contains the educational medium and contents on how to educate. Finally, it can be proposed that the Manual of Education in Chosen 1920(1920) contains specific implementation of education and the subject of education.

A Study on the Design and the Construction of a Korean Speech DB for Common Use (공동이용을 위한 음성DB의 설계 및 구축에 관한 연구)

  • Kim, Bong-Wan;Kim, Jong-Jin;Kim, Sun-Tae;Lee, Yong-Ju
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.4
    • /
    • pp.35-41
    • /
    • 1997
  • Speech database is an indispensable part of speech research. Speech database is necessary to use in speech research and development processes, and to evaluate performances of various speech-processing systems. To use speech database for common purpose, it is necessary to design utterance list that has all the possible phonetical events in minimal number of words, and is independent of tasks. To meet those restrictions this paper extracts PBW set from large text corpus. Speech database that was constructed using PBW set for utterance list and its properties are described in this paper.

  • PDF

Natural language processing techniques for bioinformatics

  • Tsujii, Jun-ichi
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.3-3
    • /
    • 2003
  • With biomedical literature expanding so rapidly, there is an urgent need to discover and organize knowledge extracted from texts. Although factual databases contain crucial information the overwhelming amount of new knowledge remains in textual form (e.g. MEDLINE). In addition, new terms are constantly coined as the relationships linking new genes, drugs, proteins etc. As the size of biomedical literature is expanding, more systems are applying a variety of methods to automate the process of knowledge acquisition and management. In my talk, I focus on the project, GENIA, of our group at the University of Tokyo, the objective of which is to construct an information extraction system of protein - protein interaction from abstracts of MEDLINE. The talk includes (1) Techniques we use fDr named entity recognition (1-a) SOHMM (Self-organized HMM) (1-b) Maximum Entropy Model (1-c) Lexicon-based Recognizer (2) Treatment of term variants and acronym finders (3) Event extraction using a full parser (4) Linguistic resources for text mining (GENIA corpus) (4-a) Semantic Tags (4-b) Structural Annotations (4-c) Co-reference tags (4-d) GENIA ontology I will also talk about possible extension of our work that links the findings of molecular biology with clinical findings, and claim that textual based or conceptual based biology would be a viable alternative to system biology that tends to emphasize the role of simulation models in bioinformatics.

  • PDF

A Corpus-based Hybrid Translation System for Limited Domain (제한된 도메인을 위한 코퍼스 기반의 하이브리드 번역 시스템)

  • Kang, Un-Gu;Kim, Sung-Hyun;Lee, Byung-Mun;Lee, Young-Ho
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.11
    • /
    • pp.826-836
    • /
    • 2010
  • This paper proposes a hybrid machine translation system which integrates SMT, RBMT, and PBMT in serial manner. SMT in our project has been implemented as a Quasi-syntax-based system where monotone search is done, given a preprocessed string of foreign language. Preprocessing includes rule-based reordering, NE recognition, clausal splitting, and attaching pattern translation information at the end of the input text. For lengthy & complex sentences, clausal splitting turned out to generate better translation than normal input.

Semantic Image Search: Case Study for Western Region Tourism in Thailand

  • Chantrapornchai, Chantana;Bunlaw, Netnapa;Choksuchat, Chidchanok
    • Journal of Information Processing Systems
    • /
    • v.14 no.5
    • /
    • pp.1195-1214
    • /
    • 2018
  • Typical search engines may not be the most efficient means of returning images in accordance with user requirements. With the help of semantic web technology, it is possible to search through images more precisely in any required domain, because the images are annotated according to a custom-built ontology. With appropriate annotations, a search can then, return images according to the context. This paper reports on the design of a tourism ontology relevant to touristic images. In particular, the image features and the meaning of the images are described using various properties, along with other types of information relevant to tourist attractions using the OWL language. The methodology used is described, commencing with building an image and tourism corpus, creating the ontology, and developing the search engine. The system was tested through a case study involving the western region of Thailand. The user can search specifying the specific class of image or they can use text-based searches. The results are ranked using weighted scores based on kinds of properties. The precision and recall of the prototype system was measured to show its efficiency. User satisfaction was also evaluated, was also performed and was found to be high.

Spam Filter by Using X2 Statistics and Support Vector Machines (카이제곱 통계량과 지지벡터기계를 이용한 스팸메일 필터)

  • Lee, Song-Wook
    • The KIPS Transactions:PartB
    • /
    • v.17B no.3
    • /
    • pp.249-254
    • /
    • 2010
  • We propose an automatic spam filter for e-mail data using Support Vector Machines(SVM). We use a lexical form of a word and its part of speech(POS) tags as features and select features by chi square statistics. We represent each feature by TF(text frequency), TF-IDF, and binary weight for experiments. After training SVM with the selected features, SVM classifies each e-mail as spam or not. In experiment, the selected features improve the performance of our system and we acquired overall 98.9% of accuracy with TREC05-p1 spam corpus.

Do Words in Central Bank Press Releases Affect Thailand's Financial Markets?

  • CHATCHAWAN, Sapphasak
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.8 no.4
    • /
    • pp.113-124
    • /
    • 2021
  • The study investigates how financial markets respond to a shock to tone and semantic similarity of the Bank of Thailand press releases. The techniques in natural language processing are employed to quantify the tone and the semantic similarity of 69 press releases from 2010 to 2018. The corpus of the press releases is accessible to the general public. Stock market returns and bond yields are measured by logged return on SET50 and short-term and long-term government bonds, respectively. Data are daily from January 4, 2010, to August 8, 2019. The study uses the Structural Vector Auto Regressive model (SVAR) to analyze the effects of unanticipated and temporary shocks to the tone and the semantic similarity on bond yields and stock market returns. Impulse response functions are also constructed for the analysis. The results show that 1-month, 3-month, 6-month and 1-year bond yields significantly increase in response to a positive shock to the tone of press releases and 1-month, 3-month, 6-month, 1-year and 25-year bond yields significantly increase in response to a positive shock to the semantic similarity. Interestingly, stock market returns obtained from the SET50 index insignificantly respond to the shocks from the tone and the semantic similarity of the press releases.