• Title/Summary/Keyword: Text-as-data

Search Result 2,005, Processing Time 0.028 seconds

Text summarization of dialogue based on BERT

  • Nam, Wongyung;Lee, Jisoo;Jang, Beakcheol
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.8
    • /
    • pp.41-47
    • /
    • 2022
  • In this paper, we propose how to implement text summaries for colloquial data that are not clearly organized. For this study, SAMSum data, which is colloquial data, was used, and the BERTSumExtAbs model proposed in the previous study of the automatic summary model was applied. More than 70% of the SAMSum dataset consists of conversations between two people, and the remaining 30% consists of conversations between three or more people. As a result, by applying the automatic text summarization model to colloquial data, a result of 42.43 or higher was derived in the ROUGE Score R-1. In addition, a high score of 45.81 was derived by fine-tuning the BERTSum model, which was previously proposed as a text summarization model. Through this study, the performance of colloquial generation summary has been proven, and it is hoped that the computer will understand human natural language as it is and be used as basic data to solve various tasks.

Korean Middle School Students' Epistemic Ideas of Claim, Data, Evidence, and Argument When Evaluating and Critiquing Arguments (한국 중학생들의 주장, 자료, 근거와 과학 논의에 대한 인식론적 이해조사)

  • Ryu, Suna
    • Journal of The Korean Association For Science Education
    • /
    • v.35 no.2
    • /
    • pp.199-208
    • /
    • 2015
  • An enhanced understanding of the nature of scientific knowledge-what counts as a scientific argument and how scientists justify their claims with evidence-has been central in Korean science instruction. However, despite its importance, scholars are generally concerned about the difficulty of both addressing and improving students' epistemic understanding, especially for students of a young age. This study investigated Korean middle school students' epistemic ideas about claim, data, evidence, and argument when they engage in reading both text-based and data-inscription arguments. Compared to previous studies, Korean middle school students show a sophisticated understanding of the role of claim and evidence. Yet, these students think that there is only a single way of interpreting data. When comparing students' ideas from text-based and data-inscription arguments, the majority of Korean students barely perceive text description as evidence and recognize only measured data as evidence.

A Study of on Extension Compression Algorithm of Mixed Text by Hangeul-Alphabet

  • Ji, Kang-yoo;Cho, Mi-nam;Hong, Sung-soo;Park, Soo-bong
    • Proceedings of the IEEK Conference
    • /
    • 2002.07a
    • /
    • pp.446-449
    • /
    • 2002
  • This paper represents a improved data compression algorithm of mixed text file by 2 byte completion Hangout and 1 byte alphabet from. Original LZW algorithm efficiently compress a alphabet text file but inefficiently compress a 2 byte completion Hangout text file. To solve this problem, data compression algorithm using 2 byte prefix field and 2 byte suffix field for compression table have developed. But it have a another problem that is compression ratio of alphabet text file decreased. In this paper, we proposes improved LZW algorithm, that is, compression table in the Extended LZW(ELZW) algorithm uses 2 byte prefix field for pointer of a table and 1 byte suffix field for repeat counter. where, a prefix field uses a pointer(index) of compression table and a suffix field uses a counter of overlapping or recursion text data in compression table. To increase compression ratio, after construction of compression table, table data are properly packed as different bit string in accordance with a alphabet, Hangout, and pointer respectively. Therefore, proposed ELZW algorithm is superior to 1 byte LZW algorithm as 7.0125 percent and superior to 2 byte LZW algorithm as 11.725 percent. This paper represents a improved data Compression algorithm of mixed text file by 2 byte completion Hangout and 1 byte alphabet form. This document is an example of what your camera-ready manuscript to ITC-CSCC 2002 should look like. Authors are asked to conform to the directions reported in this document.

  • PDF

Korean Consumers' Political Consumption of Japanese Fashion Products (국내 소비자의 일본 패션제품에 대한 정치적 소비 연구)

  • Choi, Yeong-Hyeon;Lee, Kyu-Hye
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.44 no.2
    • /
    • pp.295-309
    • /
    • 2020
  • In 2019, Japan announced trade regulations against Korean products; consequently, the sales of Japanese products in Korea dropped due to a Korean consumers' boycott. This study measured the Korean consumers' political consumption behavior toward Japanese fashion products. Unstructured text data from online media sources and consumer posted sources such as blog and SNS were collected. Text mining techniques and semantic network analysis were used to process unstructured data. This study used text mining techniques and semantic network analysis to process data. The results identified boycotting Japanese fashion products and buycotting alternative products and Korean brands due to consumers' political consumption. Two brand cases were investigated in detail. Online text data before and after the political action were compared and significant changes in consumption as well as emotional expressions were identified. Product related industry sectors were identified in terms of the political consumption of fashion: liquor, automobile and tourism industry sectors were closely linked to the fashion sector in terms of boycotting. More "boycott" and "buycott" fashion brands (reflected in consumer attitudes and feelings) were detected in consumer driven texts than in media driven sources.

An Efficient Machine Learning-based Text Summarization in the Malayalam Language

  • P Haroon, Rosna;Gafur M, Abdul;Nisha U, Barakkath
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.6
    • /
    • pp.1778-1799
    • /
    • 2022
  • Automatic text summarization is a procedure that packs enormous content into a more limited book that incorporates significant data. Malayalam is one of the toughest languages utilized in certain areas of India, most normally in Kerala and in Lakshadweep. Natural language processing in the Malayalam language is relatively low due to the complexity of the language as well as the scarcity of available resources. In this paper, a way is proposed to deal with the text summarization process in Malayalam documents by training a model based on the Support Vector Machine classification algorithm. Different features of the text are taken into account for training the machine so that the system can output the most important data from the input text. The classifier can classify the most important, important, average, and least significant sentences into separate classes and based on this, the machine will be able to create a summary of the input document. The user can select a compression ratio so that the system will output that much fraction of the summary. The model performance is measured by using different genres of Malayalam documents as well as documents from the same domain. The model is evaluated by considering content evaluation measures precision, recall, F score, and relative utility. Obtained precision and recall value shows that the model is trustable and found to be more relevant compared to the other summarizers.

Big Data Analysis of News on Purchasing Second-hand Clothing and Second-hand Luxury Goods: Identification of Social Perception and Current Situation Using Text Mining (중고의류와 중고명품 구매 관련 언론 보도 빅데이터 분석: 텍스트마이닝을 활용한 사회적 인식과 현황 파악)

  • Hwa-Sook Yoo
    • Human Ecology Research
    • /
    • v.61 no.4
    • /
    • pp.687-707
    • /
    • 2023
  • This study was conducted to obtain useful information on the development of the future second-hand fashion market by obtaining information on the current situation through unstructured text data distributed as news articles related to 'purchase of second-hand clothing' and 'purchase of second-hand luxury goods'. Text-based unstructured data was collected on a daily basis from Naver news from January 1st to December 31st, 2022, using 'purchase of second-hand clothing' and 'purchase of second-hand luxury goods' as collection keywords. This was analyzed using text mining, and the results are as follows. First, looking at the frequency, the collection data related to the purchase of second-hand luxury goods almost quadrupled compared to the data related to the purchase of second-hand clothing, indicating that the purchase of second-hand luxury goods is receiving more social attention. Second, there were common words between the data obtained by the two collection keywords, but they had different words. Regarding second-hand clothing, words related to donations, sharing, and compensation sales were mainly mentioned, indicating that the purchase of second-hand clothing tends to be recognized as an eco-friendly transaction. In second-hand luxury goods, resale and genuine controversy related to the transaction of second-hand luxury goods, second-hand trading platforms, and luxury brands were frequently mentioned. Third, as a result of clustering, data related to the purchase of second-hand clothing were divided into five groups, and data related to the purchase of second-hand luxury goods were divided into six groups.

A Review on Expressive Materials and Approaches to Text Visualization (텍스트 데이터 시각화의 표현 재료와 접근 방식에 관한 고찰)

  • Kim, Hyoyoung;Park, Jin Wan
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.1
    • /
    • pp.64-72
    • /
    • 2013
  • In this study, we contemplated types, essence, characteristics of text data which is material for visual expression of text visualization part of data visualization research and also analysed the multidirectional means of expressive approach for it. Studies of text visualization are spread dramastically under the influence of computer development, open data, wide use of visualization tools, etc. For these reasons, text visualization works have been creating as art works or output of research through various inter-discipline convergent research with engineering, art, humanities, sociology, etc. Nevertheless the theoretical studies on text data itself and its visualization, and also systematic analysis of its approach are rarely made. Data is target of understanding and interpretation, and it has infinite information and possibility with process and approach for it. Considering the attainable status of data in future human society, text visualization which is convergent academic field of study starting with understanding and interpretation of data needs further methodological research and theoretical accumulate.

A Machine Learning Based Facility Error Pattern Extraction Framework for Smart Manufacturing (스마트제조를 위한 머신러닝 기반의 설비 오류 발생 패턴 도출 프레임워크)

  • Yun, Joonseo;An, Hyeontae;Choi, Yerim
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.97-110
    • /
    • 2018
  • With the advent of the 4-th industrial revolution, manufacturing companies have increasing interests in the realization of smart manufacturing by utilizing their accumulated facilities data. However, most previous research dealt with the structured data such as sensor signals, and only a little focused on the unstructured data such as text, which actually comprises a large portion of the accumulated data. Therefore, we propose an association rule mining based facility error pattern extraction framework, where text data written by operators are analyzed. Specifically, phrases were extracted and utilized as a unit for text data analysis since a word, which normally used as a unit for text data analysis, is unable to deliver the technical meanings of facility errors. Performances of the proposed framework were evaluated by addressing a real-world case, and it is expected that the productivity of manufacturing companies will be enhanced by adopting the proposed framework.

Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

  • William Xiu Shun Wong;Donghoon Lee;Namgyu Kim
    • Asia pacific journal of information systems
    • /
    • v.29 no.4
    • /
    • pp.789-816
    • /
    • 2019
  • Text classification is a challenging task, especially when dealing with a huge amount of text data. The performance of a classification model can be varied depending on what type of words contained in the document corpus and what type of features generated for classification. Aside from proposing a new modified version of the existing algorithm or creating a new algorithm, we attempt to modify the use of data. The classifier performance is usually affected by the quality of learning data as the classifier is built based on these training data. We assume that the data from different domains might have different characteristics of noise, which can be utilized in the process of learning the classifier. Therefore, we attempt to enhance the robustness of the classifier by injecting the heterogeneous data artificially into the learning process in order to improve the classification accuracy. Semi-supervised approach was applied for utilizing the heterogeneous data in the process of learning the document classifier. However, the performance of document classifier might be degraded by the unlabeled data. Therefore, we further proposed an algorithm to extract only the documents that contribute to the accuracy improvement of the classifier.

Quantitative Text Mining for Social Science: Analysis of Immigrant in the Articles (사회과학을 위한 양적 텍스트 마이닝: 이주, 이민 키워드 논문 및 언론기사 분석)

  • Yi, Soo-Jeong;Choi, Doo-Young
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.5
    • /
    • pp.118-127
    • /
    • 2020
  • The paper introduces trends and methodological challenges of quantitative Korean text analysis by using the case studies of academic and news media articles on "migration" and "immigration" within the periods of 2017-2019. The quantitative text analysis based on natural language processing technology (NLP) and this became an essential tool for social science. It is a part of data science that converts documents into structured data and performs hypothesis discovery and verification as the data and visualize data. Furthermore, we examed the commonly applied social scientific statistical models of quantitative text analysis by using Natural Language Processing (NLP) with R programming and Quanteda.