• Title/Summary/Keyword: Unstructured text data

Search Result 228, Processing Time 0.022 seconds

The Frequency Analysis of Teacher's Emotional Response in Mathematics Class (수학 담화에서 나타나는 교사의 감성적 언어 빈도 분석)

  • Son, Bok Eun;Ko, Ho Kyoung
    • Communications of Mathematical Education
    • /
    • v.32 no.4
    • /
    • pp.555-573
    • /
    • 2018
  • The purpose of this study is to identify the emotional language of math teachers in math class using text mining techniques. For this purpose, we collected the discourse data of the teachers in the class by using the excellent class video. The analysis of the extracted unstructured data proceeded to three stages: data collection, data preprocessing, and text mining analysis. According to text mining analysis, there was few emotional language in teacher's response in mathematics class. This result can infer the characteristics of mathematics class in the aspect of affective domain.

Detecting Spam Data for Securing the Reliability of Text Analysis (텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안)

  • Hyun, Yoonjin;Kim, Namgyu
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.493-504
    • /
    • 2017
  • Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.

Crafting a Quality Performance Evaluation Model Leveraging Unstructured Data (비정형데이터를 활용한 건축현장 품질성과 평가 모델 개발)

  • Lee, Kiseok;Song, Taegeun;Yoo, Wi Sung
    • Journal of the Korea Institute of Building Construction
    • /
    • v.24 no.1
    • /
    • pp.157-168
    • /
    • 2024
  • The frequent occurrence of structural failures at building construction sites in Korea has underscored the critical role of rigorous oversight in the inspection and management of construction projects. As mandated by prevailing regulations and standards, onsite supervision by designated supervisors encompasses thorough documentation of construction quality, material standards, and the history of any reconstructions, among other factors. These reports, predominantly consisting of unstructured data, constitute approximately 80% of the data amassed at construction sites and serve as a comprehensive repository of quality-related information. This research introduces the SL-QPA model, which employs text mining techniques to preprocess supervision reports and establish a sentiment dictionary, thereby enabling the quantification of quality performance. The study's findings, demonstrating a statistically significant Pearson correlation between the quality performance scores derived from the SL-QPA model and various legally defined indicators, were substantiated through a one-way analysis of variance of the correlation coefficients. The SL-QPA model, as developed in this study, offers a supplementary approach to evaluating the quality performance of building construction projects. It holds the promise of enhancing quality inspection and management practices by harnessing the wealth of unstructured data generated throughout the lifecycle of construction projects.

Effectiveness of Fuzzy Graph Based Document Model

  • Aswathy M R;P.C. Reghu Raj;Ajeesh Ramanujan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.8
    • /
    • pp.2178-2198
    • /
    • 2024
  • Graph-based document models have good capabilities to reveal inter-dependencies among unstructured text data. Natural language processing (NLP) systems that use such models as an intermediate representation have shown good performance. This paper proposes a novel fuzzy graph-based document model and to demonstrate its effectiveness by applying fuzzy logic tools for text summarization. The proposed system accepts a text document as input and identifies some of its sentence level features, namely sentence position, sentence length, numerical data, thematic word, proper noun, title feature, upper case feature, and sentence similarity. The fuzzy membership value of each feature is computed from the sentences. We also propose a novel algorithm to construct the fuzzy graph as an intermediate representation of the input document. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric is used to evaluate the model. The evaluation based on different quality metrics was also performed to verify the effectiveness of the model. The ANOVA test confirms the hypothesis that the proposed model improves the summarizer performance by 10% when compared with the state-of-the-art summarizers employing alternate intermediate representations for the input text.

User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis (다계층 이원 네트워크를 활용한 사용자 관점의 이슈 클러스터링)

  • Kim, Jieun;Kim, Namgyu;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.93-107
    • /
    • 2014
  • In this paper, we report what we have observed with regard to user-perspective issue clustering based on multi-layered two-mode network analysis. This work is significant in the context of data collection by companies about customer needs. Most companies have failed to uncover such needs for products or services properly in terms of demographic data such as age, income levels, and purchase history. Because of excessive reliance on limited internal data, most recommendation systems do not provide decision makers with appropriate business information for current business circumstances. However, part of the problem is the increasing regulation of personal data gathering and privacy. This makes demographic or transaction data collection more difficult, and is a significant hurdle for traditional recommendation approaches because these systems demand a great deal of personal data or transaction logs. Our motivation for presenting this paper to academia is our strong belief, and evidence, that most customers' requirements for products can be effectively and efficiently analyzed from unstructured textual data such as Internet news text. In order to derive users' requirements from textual data obtained online, the proposed approach in this paper attempts to construct double two-mode networks, such as a user-news network and news-issue network, and to integrate these into one quasi-network as the input for issue clustering. One of the contributions of this research is the development of a methodology utilizing enormous amounts of unstructured textual data for user-oriented issue clustering by leveraging existing text mining and social network analysis. In order to build multi-layered two-mode networks of news logs, we need some tools such as text mining and topic analysis. We used not only SAS Enterprise Miner 12.1, which provides a text miner module and cluster module for textual data analysis, but also NetMiner 4 for network visualization and analysis. Our approach for user-perspective issue clustering is composed of six main phases: crawling, topic analysis, access pattern analysis, network merging, network conversion, and clustering. In the first phase, we collect visit logs for news sites by crawler. After gathering unstructured news article data, the topic analysis phase extracts issues from each news article in order to build an article-news network. For simplicity, 100 topics are extracted from 13,652 articles. In the third phase, a user-article network is constructed with access patterns derived from web transaction logs. The double two-mode networks are then merged into a quasi-network of user-issue. Finally, in the user-oriented issue-clustering phase, we classify issues through structural equivalence, and compare these with the clustering results from statistical tools and network analysis. An experiment with a large dataset was performed to build a multi-layer two-mode network. After that, we compared the results of issue clustering from SAS with that of network analysis. The experimental dataset was from a web site ranking site, and the biggest portal site in Korea. The sample dataset contains 150 million transaction logs and 13,652 news articles of 5,000 panels over one year. User-article and article-issue networks are constructed and merged into a user-issue quasi-network using Netminer. Our issue-clustering results applied the Partitioning Around Medoids (PAM) algorithm and Multidimensional Scaling (MDS), and are consistent with the results from SAS clustering. In spite of extensive efforts to provide user information with recommendation systems, most projects are successful only when companies have sufficient data about users and transactions. Our proposed methodology, user-perspective issue clustering, can provide practical support to decision-making in companies because it enhances user-related data from unstructured textual data. To overcome the problem of insufficient data from traditional approaches, our methodology infers customers' real interests by utilizing web transaction logs. In addition, we suggest topic analysis and issue clustering as a practical means of issue identification.

Using noise filtering and sufficient dimension reduction method on unstructured economic data (노이즈 필터링과 충분차원축소를 이용한 비정형 경제 데이터 활용에 대한 연구)

  • Jae Keun Yoo;Yujin Park;Beomseok Seo
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.2
    • /
    • pp.119-138
    • /
    • 2024
  • Text indicators are increasingly valuable in economic forecasting, but are often hindered by noise and high dimensionality. This study aims to explore post-processing techniques, specifically noise filtering and dimensionality reduction, to normalize text indicators and enhance their utility through empirical analysis. Predictive target variables for the empirical analysis include monthly leading index cyclical variations, BSI (business survey index) All industry sales performance, BSI All industry sales outlook, as well as quarterly real GDP SA (seasonally adjusted) growth rate and real GDP YoY (year-on-year) growth rate. This study explores the Hodrick and Prescott filter, which is widely used in econometrics for noise filtering, and employs sufficient dimension reduction, a nonparametric dimensionality reduction methodology, in conjunction with unstructured text data. The analysis results reveal that noise filtering of text indicators significantly improves predictive accuracy for both monthly and quarterly variables, particularly when the dataset is large. Moreover, this study demonstrated that applying dimensionality reduction further enhances predictive performance. These findings imply that post-processing techniques, such as noise filtering and dimensionality reduction, are crucial for enhancing the utility of text indicators and can contribute to improving the accuracy of economic forecasts.

Analysis of News Regarding New Southeastern Airport Using Text Mining Techniques (텍스트 마이닝 기법을 활용한 동남권 신공항 신문기사 분석)

  • Han, Mu Moung Cho;Kim, Yang Sok;Lee, Choong Kwon
    • Smart Media Journal
    • /
    • v.6 no.1
    • /
    • pp.47-53
    • /
    • 2017
  • Social issues are important factors that decide government policy and newspapers are critical channels that reflect them. Analysing news articles can contribute to understanding social issues, but it is very difficult to analyse the unstructured large volumes of news data manually. Therefore, this study aims to analyze the different views among stakeholders of a specific social issue by using text analysis, word cloud analysis and associative analysis methods, which systematically transform unstructured news data into structured one. We analyzed a total of 115 news articles and a total of 6,772 comments, collected from the selected newspapers (Chosun-Il-bo, Joongang-Il-bo, Donga-Il-bo, Maeil Newspaper, Busan-Il-bo) for two weeks. We found that there are significant differences in tone between newspapers. While nation-wide daily newspapers focus on political relations with local areas, local daily newspapers tend to write articles to represent local governments' interests.

Identify the Failure Mode of Weapon System (or equipment) using Machine Learning (Machine Learning을 이용한 무기 체계(or 구성품) 고장 유형 식별)

  • Park, Yun-Kyung;Lee, Hye-Won;Kim, Sang-Moon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.8
    • /
    • pp.64-70
    • /
    • 2018
  • The development of weapon systems (or components) is hindered by the number of tests due to the limited development period and cost, which reduces the scale of accumulated data related to failures. Nevertheless, because a large amount of failure data and maintenance details during the operational period are managed by computerized data, the cause of failure of weapon systems (or components) can be analyzed using the data. On the other hand, analyzing the failure and maintenance details of various weapon systems is difficult because of the variation among groups and companies, and details of the cause of failure are described as unstructured text data. Fortunately, the recent developments of big data processing technology, machine learning algorithm, and improved HW computation ability have supported major research into various methods for processing the above unstructured data. In this paper, unstructured data related to the failure / maintenance of defense weapon systems (or components) is presented by applying doc2vec, a machine learning technique, to analyze the failure cases.

Analysis of drama viewership related words through unstructured data collection (비정형데이터 수집을 통한 드라마 시청률 연관어 분석)

  • Kang, Sun-Kyoung;Lee, Hyun-Chang;Shin, Seong-Yoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.8
    • /
    • pp.1567-1574
    • /
    • 2017
  • In this paper, we analyzed the stereotyped and non - stereotyped data in order to analyze the drama 's ratings. The formalized data collection collected 19 items from the four areas of drama information, person information, broadcasting information, and audience rating information of each broadcasting company. Atypical data were collected from bulletin boards, pre - broadcast blogs and post - broadcast blogs operated by each broadcasting company using a crawling technique. As a result of comparing the differences according to the four areas for each broadcaster from the collected regular data, the results were similar to each other. And we derived seven related words by analyzing the correlation of occurrence frequencies from unstructured data collected from bulletin boards and blogs of each broadcasting company. The derived associations were obtained through reliability analysis.

A Study on the General Public's Perceptions of Dental Fear Using Unstructured Big Data

  • Han-A Cho;Bo-Young Park
    • Journal of dental hygiene science
    • /
    • v.23 no.4
    • /
    • pp.255-263
    • /
    • 2023
  • Background: This study used text mining techniques to determine public perceptions of dental fear, extracted keywords related to dental fear, identified the connection between the keywords, and categorized and visualized perceptions related to dental fear. Methods: Keywords in texts posted on Internet portal sites (NAVER and Google) between 1 January, 2000, and 31 December, 2022, were collected. The four stages of analysis were used to explore the keywords: frequency analysis, term frequency-inverse document frequency (TF-IDF), centrality analysis and co-occurrence analysis, and convergent correlations. Results: In the top ten keywords based on frequency analysis, the most frequently used keyword was 'treatment,' followed by 'fear,' 'dental implant,' 'conscious sedation,' 'pain,' 'dental fear,' 'comfort,' 'taking medication,' 'experience,' and 'tooth.' In the TF-IDF analysis, the top three keywords were dental implant, conscious sedation, and dental fear. The co-occurrence analysis was used to explore keywords that appear together and showed that 'fear and treatment' and 'treatment and pain' appeared the most frequently. Conclusion: Texts collected via unstructured big data were analyzed to identify general perceptions related to dental fear, and this study is valuable as a source data for understanding public perceptions of dental fear by grouping associated keywords. The results of this study will be helpful to understand dental fear and used as factors affecting oral health in the future.