• Title/Summary/Keyword: Language Processing

Search Result 2,691, Processing Time 0.025 seconds

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization (부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석)

  • Cho, Dan Bi;Lee, Hyun Young;Jung, Won Sup;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

An Investigation on Digital Humanities Research Trend by Analyzing the Papers of Digital Humanities Conferences (디지털 인문학 연구 동향 분석 - Digital Humanities 학술대회 논문을 중심으로 -)

  • Chung, EunKyung
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.55 no.1
    • /
    • pp.393-413
    • /
    • 2021
  • Digital humanities, which creates new and innovative knowledge through the combination of digital information technology and humanities research problems, can be seen as a representative multidisciplinary field of study. To investigate the intellectual structure of the digital humanities field, a network analysis of authors and keywords co-word was performed on a total of 441 papers in the last two years (2019, 2020) at the Digital Humanities Conference. As the results of the author and keyword analysis show, we can find out the active activities of Europe, North America, and Japanese and Chinese authors in East Asia. Through the co-author network, 11 dis-connected sub-networks are identified, which can be seen as a result of closed co-authoring activities. Through keyword analysis, 16 sub-subject areas are identified, which are machine learning, pedagogy, metadata, topic modeling, stylometry, cultural heritage, network, digital archive, natural language processing, digital library, twitter, drama, big data, neural network, virtual reality, and ethics. This results imply that a diver variety of digital information technologies are playing a major role in the digital humanities. In addition, keywords with high frequency can be classified into humanities-based keywords, digital information technology-based keywords, and convergence keywords. The dynamics of the growth and development of digital humanities can represented in these combinations of keywords.

A Study on the Current State of the Library's AI Service and the Service Provision Plan (도서관의 인공지능(AI) 서비스 현황 및 서비스 제공 방안에 관한 연구)

  • Kwak, Woojung;Noh, Younghee
    • Journal of Korean Library and Information Science Society
    • /
    • v.52 no.1
    • /
    • pp.155-178
    • /
    • 2021
  • In the era of the 4th industrial revolution, public libraries need a strategy for promoting intelligent library services in order to actively respond to changes in the external environment such as artificial intelligence. Therefore, in this study, based on the concept of artificial intelligence and analysis of domestic and foreign artificial intelligence related trends, policies, and cases, we proposed the future direction of introduction and development of artificial intelligence services in the library. Currently, the library operates a reference information service that automatically provides answers through the introduction of artificial intelligence technologies such as deep learning and natural language processing, and develops a big data-based AI book recommendation and automatic book inspection system to increase business utilization and provide customized services for users. Has been provided. In the field of companies and industries, regardless of domestic and overseas, we are developing and servicing technologies based on autonomous driving using artificial intelligence, personal customization, etc., and providing optimal results by self-learning information using deep learning. It is developed in the form of an equation. Accordingly, in the future, libraries will utilize artificial intelligence to recommend personalized books based on the user's usage records, recommend reading and culture programs, and introduce real-time delivery services through transport methods such as autonomous drones and cars in the case of book delivery service. Service development should be promoted.

A Convergence Study of the Research Trends on Stress Urinary Incontinence using Word Embedding (워드임베딩을 활용한 복압성 요실금 관련 연구 동향에 관한 융합 연구)

  • Kim, Jun-Hee;Ahn, Sun-Hee;Gwak, Gyeong-Tae;Weon, Young-Soo;Yoo, Hwa-Ik
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.8
    • /
    • pp.1-11
    • /
    • 2021
  • The purpose of this study was to analyze the trends and characteristics of 'stress urinary incontinence' research through word frequency analysis, and their relationships were modeled using word embedding. Abstract data of 9,868 papers containing abstracts in PubMed's MEDLINE were extracted using a Python program. Then, through frequency analysis, 10 keywords were selected according to the high frequency. The similarity of words related to keywords was analyzed by Word2Vec machine learning algorithm. The locations and distances of words were visualized using the t-SNE technique, and the groups were classified and analyzed. The number of studies related to stress urinary incontinence has increased rapidly since the 1980s. The keywords used most frequently in the abstract of the paper were 'woman', 'urethra', and 'surgery'. Through Word2Vec modeling, words such as 'female', 'urge', and 'symptom' were among the words that showed the highest relevance to the keywords in the study on stress urinary incontinence. In addition, through the t-SNE technique, keywords and related words could be classified into three groups focusing on symptoms, anatomical characteristics, and surgical interventions of stress urinary incontinence. This study is the first to examine trends in stress urinary incontinence-related studies using the keyword frequency analysis and word embedding of the abstract. The results of this study can be used as a basis for future researchers to select the subject and direction of the research field related to stress urinary incontinence.

Method of ChatBot Implementation Using Bot Framework (봇 프레임워크를 활용한 챗봇 구현 방안)

  • Kim, Ki-Young
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.15 no.1
    • /
    • pp.56-61
    • /
    • 2022
  • In this paper, we classify and present AI algorithms and natural language processing methods used in chatbots. A framework that can be used to implement a chatbot is also described. A chatbot is a system with a structure that interprets the input string by constructing the user interface in a conversational manner and selects an appropriate answer to the input string from the learned data and outputs it. However, training is required to generate an appropriate set of answers to a question and hardware with considerable computational power is required. Therefore, there is a limit to the practice of not only developing companies but also students learning AI development. Currently, chatbots are replacing the existing traditional tasks, and a practice course to understand and implement the system is required. RNN and Char-CNN are used to increase the accuracy of answering questions by learning unstructured data by applying technologies such as deep learning beyond the level of responding only to standardized data. In order to implement a chatbot, it is necessary to understand such a theory. In addition, the students presented examples of implementation of the entire system by utilizing the methods that can be used for coding education and the platform where existing developers and students can implement chatbots.

Development of big data based Skin Care Information System SCIS for skin condition diagnosis and management

  • Kim, Hyung-Hoon;Cho, Jeong-Ran
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.3
    • /
    • pp.137-147
    • /
    • 2022
  • Diagnosis and management of skin condition is a very basic and important function in performing its role for workers in the beauty industry and cosmetics industry. For accurate skin condition diagnosis and management, it is necessary to understand the skin condition and needs of customers. In this paper, we developed SCIS, a big data-based skin care information system that supports skin condition diagnosis and management using social media big data for skin condition diagnosis and management. By using the developed system, it is possible to analyze and extract core information for skin condition diagnosis and management based on text information. The skin care information system SCIS developed in this paper consists of big data collection stage, text preprocessing stage, image preprocessing stage, and text word analysis stage. SCIS collected big data necessary for skin diagnosis and management, and extracted key words and topics from text information through simple frequency analysis, relative frequency analysis, co-occurrence analysis, and correlation analysis of key words. In addition, by analyzing the extracted key words and information and performing various visualization processes such as scatter plot, NetworkX, t-SNE, and clustering, it can be used efficiently in diagnosing and managing skin conditions.

Sentiment Analysis of Product Reviews to Identify Deceptive Rating Information in Social Media: A SentiDeceptive Approach

  • Marwat, M. Irfan;Khan, Javed Ali;Alshehri, Dr. Mohammad Dahman;Ali, Muhammad Asghar;Hizbullah;Ali, Haider;Assam, Muhammad
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.3
    • /
    • pp.830-860
    • /
    • 2022
  • [Introduction] Nowadays, many companies are shifting their businesses online due to the growing trend among customers to buy and shop online, as people prefer online purchasing products. [Problem] Users share a vast amount of information about products, making it difficult and challenging for the end-users to make certain decisions. [Motivation] Therefore, we need a mechanism to automatically analyze end-user opinions, thoughts, or feelings in the social media platform about the products that might be useful for the customers to make or change their decisions about buying or purchasing specific products. [Proposed Solution] For this purpose, we proposed an automated SentiDecpective approach, which classifies end-user reviews into negative, positive, and neutral sentiments and identifies deceptive crowd-users rating information in the social media platform to help the user in decision-making. [Methodology] For this purpose, we first collected 11781 end-users comments from the Amazon store and Flipkart web application covering distant products, such as watches, mobile, shoes, clothes, and perfumes. Next, we develop a coding guideline used as a base for the comments annotation process. We then applied the content analysis approach and existing VADER library to annotate the end-user comments in the data set with the identified codes, which results in a labelled data set used as an input to the machine learning classifiers. Finally, we applied the sentiment analysis approach to identify the end-users opinions and overcome the deceptive rating information in the social media platforms by first preprocessing the input data to remove the irrelevant (stop words, special characters, etc.) data from the dataset, employing two standard resampling approaches to balance the data set, i-e, oversampling, and under-sampling, extract different features (TF-IDF and BOW) from the textual data in the data set and then train & test the machine learning algorithms by applying a standard cross-validation approach (KFold and Shuffle Split). [Results/Outcomes] Furthermore, to support our research study, we developed an automated tool that automatically analyzes each customer feedback and displays the collective sentiments of customers about a specific product with the help of a graph, which helps customers to make certain decisions. In a nutshell, our proposed sentiments approach produces good results when identifying the customer sentiments from the online user feedbacks, i-e, obtained an average 94.01% precision, 93.69% recall, and 93.81% F-measure value for classifying positive sentiments.

Determination of Fire Risk Assessment Indicators for Building using Big Data (빅데이터를 활용한 건축물 화재위험도 평가 지표 결정)

  • Joo, Hong-Jun;Choi, Yun-Jeong;Ok, Chi-Yeol;An, Jae-Hong
    • Journal of the Korea Institute of Building Construction
    • /
    • v.22 no.3
    • /
    • pp.281-291
    • /
    • 2022
  • This study attempts to use big data to determine the indicators necessary for a fire risk assessment of buildings. Because most of the causes affecting the fire risk of buildings are fixed as indicators considering only the building itself, previously only limited and subjective assessment has been performed. Therefore, if various internal and external indicators can be considered using big data, effective measures can be taken to reduce the fire risk of buildings. To collect the data necessary to determine indicators, a query language was first selected, and professional literature was collected in the form of unstructured data using a web crawling technique. To collect the words in the literature, pre-processing was performed such as user dictionary registration, duplicate literature, and stopwords. Then, through a review of previous research, words were classified into four components, and representative keywords related to risk were selected from each component. Risk-related indicators were collected through analysis of related words of representative keywords. By examining the indicators according to their selection criteria, 20 indicators could be determined. This research methodology indicates the applicability of big data analysis for establishing measures to reduce fire risk in buildings, and the determined risk indicators can be used as reference materials for assessment.

A Study on the Oral Characteristics in Personal Narrative Storytelling (체험 이야기하기의 구술적 특성에 대하여)

  • Kim, Kyung-Seop
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.143-150
    • /
    • 2022
  • The folk language that lives and breathes in modern works does not just come from old stories, but it is a personal narrative which is based on the experiences of the narrator. Like many genres in oral literature, most of these personal narratives occur from the impulse of communicating and reinventing rather than from the impulse of creating. Compared to traditional folktales, stories about an individual's experiences, such as personal narratives are often performed by adding the individual tendencies of the narrator. In so doing, the phenomenon of "processing the experience by estimating it and reinterpreting the memories roughly" occurs, and this is a significant factor in making the oral literature. However, the question that arises here is: How can we deal with these significant elements that are inevitably captured when performed orally? Text linguistics, the main methodology of this paper, implies the possibility of expressing the impromptu elements of oral literature. Also, textual linguistic analysis of personal narratives provides the possibility of discussing oral characteristics from various angles which have been difficult to analyze, such as on-site atmosphere, speaker mistakes, contradictions in stories, and audience reactions. Hence, it is possible to effectively discuss oral-poetics in oral literature which are based on the one-off of 'words', the 'roughness' of the on-site atmosphere, and the stackability of the 'wisdom of crowds'. Furthermore, it is expected to contribute to the study of personal narrative storytelling that plays an important part in Veabal art in community culture.

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization (Head-Tail 토큰화 기법을 이용한 한국어 품사 태깅)

  • Suh, Hyun-Jae;Kim, Jung-Min;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.