• Title/Summary/Keyword: Datasets as Topic

Search Result 33, Processing Time 0.026 seconds

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.3
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

Identifying Topic-Specific Experts on Microblog

  • Yu, Yan;Mo, Lingfei;Wang, Jian
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.6
    • /
    • pp.2627-2647
    • /
    • 2016
  • With the rapid growth of microblog, expert identification on microblog has been playing a crucial role in many applications. While most previous expert identification studies only assess global authoritativeness of a user, there is no way to differentiate the authoritativeness in a particular aspect of topics. In this paper, we propose a novel model, which jointly models text and following relationship in the same generative process. Furthermore, we integrate a similarity-based weight scheme into the model to address the popular bias problem, and use followee topic distribution as prior information to make user's topic distribution more precisely. Our empirical study on two large real-world datasets shows that our proposed model produces significantly higher quality results than the prior arts.

Too Much Information - Trying to Help or Deceive? An Analysis of Yelp Reviews

  • Hyuk Shin;Hong Joo Lee;Ruth Angelie Cruz
    • Asia pacific journal of information systems
    • /
    • v.33 no.2
    • /
    • pp.261-281
    • /
    • 2023
  • The proliferation of online customer reviews has completely changed how consumers purchase. Consumers now heavily depend on authentic experiences shared by previous customers. However, deceptive reviews that aim to manipulate customer decision-making to promote or defame a product or service pose a risk to businesses and buyers. The studies investigating consumer perception of deceptive reviews found that one of the important cues is based on review content. This study aims to investigate the impact of the information amount of review on the review truthfulness. This study adopted the Information Manipulation Theory (IMT) as an overarching theory, which asserts that the violations of one or more of the Gricean maxim are deceptive behaviors. It is regarded as a quantity violation if the required information amount is not delivered or more information is delivered; that is an attempt at deception. A topic modeling algorithm is implemented to reveal the distribution of each topic embedded in a text. This study measures information amount as topic diversity based on the results of topic modeling, and topic diversity shows how heterogeneous a text review is. Two datasets of restaurant reviews on Yelp.com, which have Filtered (deceptive) and Unfiltered (genuine) reviews, were used to test the hypotheses. Reviews that contain more diverse topics tend to be truthful. However, excessive topic diversity produces an inverted U-shaped relationship with truthfulness. Moreover, we find an interaction effect between topic diversity and reviews' ratings. This result suggests that the impact of topic diversity is strengthened when deceptive reviews have lower ratings. This study contributes to the existing literature on IMT by building the connection between topic diversity in a review and its truthfulness. In addition, the empirical results show that topic diversity is a reliable measure for gauging information amount of reviews.

Digital Epidemiology: Use of Digital Data Collected for Non-epidemiological Purposes in Epidemiological Studies

  • Park, Hyeoun-Ae;Jung, Hyesil;On, Jeongah;Park, Seul Ki;Kang, Hannah
    • Healthcare Informatics Research
    • /
    • v.24 no.4
    • /
    • pp.253-262
    • /
    • 2018
  • Objectives: We reviewed digital epidemiological studies to characterize how researchers are using digital data by topic domain, study purpose, data source, and analytic method. Methods: We reviewed research articles published within the last decade that used digital data to answer epidemiological research questions. Data were abstracted from these articles using a data collection tool that we developed. Finally, we summarized the characteristics of the digital epidemiological studies. Results: We identified six main topic domains: infectious diseases (58.7%), non-communicable diseases (29.4%), mental health and substance use (8.3%), general population behavior (4.6%), environmental, dietary, and lifestyle (4.6%), and vital status (0.9%). We identified four categories for the study purpose: description (22.9%), exploration (34.9%), explanation (27.5%), and prediction and control (14.7%). We identified eight categories for the data sources: web search query (52.3%), social media posts (31.2%), web portal posts (11.9%), webpage access logs (7.3%), images (7.3%), mobile phone network data (1.8%), global positioning system data (1.8%), and others (2.8%). Of these, 50.5% used correlation analyses, 41.3% regression analyses, 25.6% machine learning, and 19.3% descriptive analyses. Conclusions: Digital data collected for non-epidemiological purposes are being used to study health phenomena in a variety of topic domains. Digital epidemiology requires access to large datasets and advanced analytics. Ensuring open access is clearly at odds with the desire to have as little personal data as possible in these large datasets to protect privacy. Establishment of data cooperatives with restricted access may be a solution to this dilemma.

Combining Ego-centric Network Analysis and Dynamic Citation Network Analysis to Topic Modeling for Characterizing Research Trends (자아 중심 네트워크 분석과 동적 인용 네트워크를 활용한 토픽모델링 기반 연구동향 분석에 관한 연구)

  • Yu, So-Young
    • Journal of the Korean Society for information Management
    • /
    • v.32 no.1
    • /
    • pp.153-169
    • /
    • 2015
  • The combined approach of using ego-centric network analysis and dynamic citation network analysis for refining the result of LDA-based topic modeling was suggested and examined in this study. Tow datasets were constructed by collecting Web of Science bibliographic records of White LED and topic modeling was performed by setting a different number of topics on each dataset. The multi-assigned top keywords of each topic were re-assigned to one specific topic by applying an ego-centric network analysis algorithm. It was found that the topical cohesion of the result of topic modeling with the number of topic corresponding to the lowest value of perplexity to the dataset extracted by SPLC network analysis was the strongest with the best values of internal clustering evaluation indices. Furthermore, it demonstrates the possibility of developing the suggested approach as a method of multi-faceted research trend detection.

Investigation of Topic Trends in Computer and Information Science by Text Mining Techniques: From the Perspective of Conferences in DBLP (텍스트 마이닝 기법을 이용한 컴퓨터공학 및 정보학 분야 연구동향 조사: DBLP의 학술회의 데이터를 중심으로)

  • Kim, Su Yeon;Song, Sung Jeon;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.32 no.1
    • /
    • pp.135-152
    • /
    • 2015
  • The goal of this paper is to explore the field of Computer and Information Science with the aid of text mining techniques by mining Computer and Information Science related conference data available in DBLP (Digital Bibliography & Library Project). Although studies based on bibliometric analysis are most prevalent in investigating dynamics of a research field, we attempt to understand dynamics of the field by utilizing Latent Dirichlet Allocation (LDA)-based multinomial topic modeling. For this study, we collect 236,170 documents from 353 conferences related to Computer and Information Science in DBLP. We aim to include conferences in the field of Computer and Information Science as broad as possible. We analyze topic modeling results along with datasets collected over the period of 2000 to 2011 including top authors per topic and top conferences per topic. We identify the following four different patterns in topic trends in the field of computer and information science during this period: growing (network related topics), shrinking (AI and data mining related topics), continuing (web, text mining information retrieval and database related topics), and fluctuating pattern (HCI, information system and multimedia system related topics).

Detects depression-related emotions in user input sentences (사용자 입력 문장에서 우울 관련 감정 탐지)

  • Oh, Jaedong;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.12
    • /
    • pp.1759-1768
    • /
    • 2022
  • This paper proposes a model to detect depression-related emotions in a user's speech using wellness dialogue scripts provided by AI Hub, topic-specific daily conversation datasets, and chatbot datasets published on Github. There are 18 emotions, including depression and lethargy, in depression-related emotions, and emotion classification tasks are performed using KoBERT and KOELECTRA models that show high performance in language models. For model-specific performance comparisons, we build diverse datasets and compare classification results while adjusting batch sizes and learning rates for models that perform well. Furthermore, a person performs a multi-classification task by selecting all labels whose output values are higher than a specific threshold as the correct answer, in order to reflect feeling multiple emotions at the same time. The model with the best performance derived through this process is called the Depression model, and the model is then used to classify depression-related emotions for user utterances.

What Topics Have Been Studied in Korean Mathematics Education for 15 Years: Latent Topic Modeling Analysis

  • Hwang, Jihyun
    • Research in Mathematical Education
    • /
    • v.24 no.4
    • /
    • pp.313-335
    • /
    • 2021
  • The purpose of this research is to identify topics discussed by Korean mathematics education studies and examine research trends for 15 years. I applied latent Dirichlet allocation (LDA) to the original text datasets including English abstracts of 3,157 articles published in eight journals indexed by the Korean Citation Index (KCI) from 1997 to 2019. I identified an LDA model with 60 topics, then research trends in 2,884 articles between 2002 and 2018 were as follows; mathematics educators have paid most attention to teacher education through 2010 to 2015 and curriculum analysis after 2016. The findings in this research can contribute to understand what have been discussed in Korean mathematics education society as well as what will and need to be emphasized more in the future compared to the global research trends. In addition, LDA has potentials to identify topics and keywords of manuscripts newly written and submitted to any journals in addition to information provided by authors.

Human Posture Recognition: Methodology and Implementation

  • Htike, Kyaw Kyaw;Khalifa, Othman O.
    • Journal of Electrical Engineering and Technology
    • /
    • v.10 no.4
    • /
    • pp.1910-1914
    • /
    • 2015
  • Human posture recognition is an attractive and challenging topic in computer vision due to its promising applications in the areas of personal health care, environmental awareness, human-computer-interaction and surveillance systems. Human posture recognition in video sequences consists of two stages: the first stage is training and evaluation and the second is deployment. In the first stage, the system is trained and evaluated using datasets of human postures to ‘teach’ the system to classify human postures for any future inputs. When the training and evaluation process is deemed satisfactory as measured by recognition rates, the trained system is then deployed to recognize human postures in any input video sequence. Different classifiers were used in the training such as Multilayer Perceptron Feedforward Neural networks, Self-Organizing Maps, Fuzzy C Means and K Means. Results show that supervised learning classifiers tend to perform better than unsupervised classifiers for the case of human posture recognition.

Large Language Models: A Guide for Radiologists

  • Sunkyu Kim;Choong-kun Lee;Seung-seob Kim
    • Korean Journal of Radiology
    • /
    • v.25 no.2
    • /
    • pp.126-133
    • /
    • 2024
  • Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.