• Title/Summary/Keyword: 텍스트 데이터

Search Result 1,754, Processing Time 0.037 seconds

Case Analysis of Bible Visualization based on Text Data Traits -Focused on Content, Structure, Quotation of Text- (텍스트 데이터의 특성에 따른 성경 시각화 사례 분석 -텍스트의 내용적, 구조적 특성 및 인용 정보를 중심으로-)

  • Kim, Hyoyoung;Park, Jin Wan
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.8
    • /
    • pp.83-92
    • /
    • 2013
  • Text visualization begins with understanding text itself which is material of visual expression. To visualize any text data, sufficient understanding about characteristics of the text first and the expressive approaches can be decided depending on the derived unique characteristics of the text. In this research we aimed to establish theoretical foundation about the approaches for text visualization by diverse examples of text visualization which are derived through the various characteristics of the text. To do this, we chose the 'Bible' text which is well known globally and digital data of it can be accessed easily and thus diverse text visualization examples exist and analyzed the examples of the bible text visualization. We derived the unique characteristics of text-content, structure, quotation- as criteria for analyzing and supported validity of analysis by adopting at least 2-3 examples for each criterion. In the result, we can comprehend that the goals and expressive approaches are decided depending on the unique characteristics of the Bible text. We expect to build theoretical method for choosing the materials and approaches by analyzing more diverse examples with various point of views on the basis of this research.

Summarization of Korean Dialogues through Dialogue Restructuring (대화문 재구조화를 통한 한국어 대화문 요약)

  • Eun Hee Kim;Myung Jin Lim;Ju Hyun Shin
    • Smart Media Journal
    • /
    • v.12 no.11
    • /
    • pp.77-85
    • /
    • 2023
  • After COVID-19, communication through online platforms has increased, leading to an accumulation of massive amounts of conversational text data. With the growing importance of summarizing this text data to extract meaningful information, there has been active research on deep learning-based abstractive summarization. However, conversational data, compared to structured texts like news articles, often contains missing or transformed information, necessitating consideration from multiple perspectives due to its unique characteristics. In particular, vocabulary omissions and unrelated expressions in the conversation can hinder effective summarization. Therefore, in this study, we restructured by considering the characteristics of Korean conversational data, fine-tuning a pre-trained text summarization model based on KoBART, and improved conversation data summary perfomance through a refining operation to remove redundant elements from the summary. By restructuring the sentences based on the order of utterances and extracting a central speaker, we combined methods to restructure the conversation around them. As a result, there was about a 4 point improvement in the Rouge-1 score. This study has demonstrated the significance of our conversation restructuring approach, which considers the characteristics of dialogue, in enhancing Korean conversation summarization performance.

On-Device Gender Prediction Framework Based on the Development of Discriminative Word and Emoticon Sets (특징적 단어 및 이모티콘 집합을 활용한 모바일 기기 내 성별 예측 프레임워크)

  • Kim, Solee;Choi, Yerim;Kim, Yoonjung;Park, Kyuyon;Park, Jonghun
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.11
    • /
    • pp.733-738
    • /
    • 2015
  • User demographic information is necessary in order to improve the quality of personalized services such as recommendation systems. Mobile data, especially text data, is known to be effective for prediction of user demographic information. However, mobile text data has privacy issues so that its utilization is limited. In this regard, we introduce an on-device gender prediction framework utilizing mobile text data while minimizing the privacy issue. Discriminative word and emoticon sets of each gender are constructed from web documents written by authors of each gender. After gender prediction is performed by comparing discriminative word and emoticon sets with a user's mobile text data, an ensemble method that combines two prediction results draws a final result. From experiments conducted on real-world mobile text data, the proposed on-device framework shows promising results for gender prediction.

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data (비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구)

  • Lee, Won-Jo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.891-897
    • /
    • 2022
  • In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.

Numerical Reasoning Dataset Augmentation Using Large Language Model and In-Context Learning (대규모 언어 모델 및 인컨텍스트 러닝을 활용한 수치 추론 데이터셋 증강)

  • Yechan Hwang;Jinsu Lim;Young-Jun Lee;Ho-Jin Choi
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.203-208
    • /
    • 2023
  • 본 논문에서는 대규모 언어 모델의 인컨텍스트 러닝과 프롬프팅을 활용하여 수치 추론 태스크 데이터셋을 효과적으로 증강시킬 수 있는 방법론을 제안한다. 또한 모델로 하여금 수치 추론 데이터의 이해를 도울 수 있는 전처리와 요구사항을 만족하지 못하는 결과물을 필터링 하는 검증 단계를 추가하여 생성되는 데이터의 퀄리티를 보장하고자 하였다. 이렇게 얻어진 증강 절차를 거쳐 증강을 진행한 뒤 추론용 모델 학습을 통해 다른 증강 방법론보다 우리의 방법론으로 증강된 데이터셋으로 학습된 모델이 더 높은 성능을 낼 수 있음을 보였다. 실험 결과 우리의 증강 데이터로 학습된 모델은 원본 데이터로 학습된 모델보다 모든 지표에서 2%p 이상의 성능 향상을 보였으며 다양한 케이스를 통해 우리의 모델이 수치 추론 학습 데이터의 다양성을 크게 향상시킬 수 있음을 확인하였다.

  • PDF

A Text-Based SMI Editor with Real-Time Execution (실시간 실행 기능을 포함한 텍스트기반 SMIL 문서편집기)

  • 김정훈;김은혜;채진석
    • Proceedings of the Korea Multimedia Society Conference
    • /
    • 2000.04a
    • /
    • pp.445-448
    • /
    • 2000
  • XML은 HTML 의 단순성과 SGML의 복잡성을 동시에 극복하기 위한 노력으로 시작되어 인터넷 문서표현과 관련된 여러 분야에서 활발하게 연구되고 있다. SMIL은 멀티미디어 데이터를 XML 기반으로 표현하는 언어로서, 아직은 웹 브라우저 차원에서 지원해주는 브라우저가 많지 않지만, 다양한 멀티미디어 데이터를 동기화 시켜 표현하는 SMIL 의기능으로 볼 때 멀티미디어 데이터의 표현과 전송에 사용되는 중요한 표준으로 자리잡을 것으로 예상된다. 이 논문에서는 이러한 SMIL를 사용하여 멀티미디어 데이터를 편집할 때, 구축된 SMIl 문서의 실행결과를 미리 확인하고 이를 다시 SMIl 문서 편집에 적용할수 있도록 , 실시간 실행 기능이 포함된 텍스트 기반 SMIL 문서편집기를 설계 및 구현하였다.

  • PDF

A Study on the Method for Extracting the Purpose-Specific Customized Information from Online Product Reviews based on Text Mining (텍스트 마이닝 기반의 온라인 상품 리뷰 추출을 통한 목적별 맞춤화 정보 도출 방법론 연구)

  • Kim, Joo Young;Kim, Dong soo
    • The Journal of Society for e-Business Studies
    • /
    • v.21 no.2
    • /
    • pp.151-161
    • /
    • 2016
  • In the era of the Web 2.0, characterized by the openness, sharing and participation, it is easy for internet users to produce and share the data. The amount of the unstructured data which occupies most of the digital world's data has increased exponentially. One of the kinds of the unstructured data called personal online product reviews is necessary for both the company that produces those products and the potential customers who are interested in those products. In order to extract useful information from lots of scattered review data, the process of collecting data, storing, preprocessing, analyzing, and drawing a conclusion is needed. Therefore we introduce the text-mining methodology for applying the natural language process technology to the text format data like product review in order to carry out extracting structured data by using R programming. Also, we introduce the data-mining to derive the purpose-specific customized information from the structured review information drawn by the text-mining.

A Study on Extracting the Document Text for Unallocated Areas of Data Fragments (비할당 영역 데이터 파편의 문서 텍스트 추출 방안에 관한 연구)

  • Yoo, Byeong-Yeong;Park, Jung-Heum;Bang, Je-Wan;Lee, Sang-Jin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.20 no.6
    • /
    • pp.43-51
    • /
    • 2010
  • It is meaningful to investigate data in unallocated space because we can investigate the deleted data. Consecutively complete file recovery using the File Carving is possible in unallocated area, but noncontiguous or incomplete data recovery is impossible. Typically, the analysis of the data fragments are needed because they should contain large amounts of information. Microsoft Word, Excel, PowerPoint and PDF document file's text are stored using compression or specific document format. If the part of aforementioned document file was stored in unallocated data fragment, text extraction is possible using specific document format. In this paper, we suggest the method of extracting a particular document file text in unallocated data fragment.

A Techniques to Conceal Information Using Eojeol in Hangul Text Steganography (한글 텍스트 스테가노그래피에서 어절을 이용한 정보은닉 기법)

  • Ji, Seon Su
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.22 no.5
    • /
    • pp.9-15
    • /
    • 2017
  • In the Digital Age, All Data used in the Internet is Digitized and Transmitted and Received Over a Communications Network. Therefore, it is Important to Transmit Data with Confidentiality and Integrity, Since Digital Data may be Tampered with and Tampered by Illegal users. Steganography is an Efficient Method for Ensuring Confidentiality and Integrity Together with Encryption Techniques. I Propose a Hangul Steganography Method that Inserts a Secret Message based on a Changing Insertion Position and a Changing Eojeol Size in a Cover Medium. Considering the Insertion Capacity of 3.35% and the File Size Change of 0.4% in Hangul Text Steganography, Experimental Results Show that the Jaro_score Value needs to be Maintained at 0.946.

Study of the text analysis and feature selection performance for emotional inference (텍스트 기반 감정 추정을 위한 특징 추출 및 선택기법에 따른 성능 연구)

  • Kim, Hanjoo;Ha, Heonseok;Park, Seunghyun;Yoon, Sungroh
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.11a
    • /
    • pp.876-878
    • /
    • 2014
  • 인터넷 사용량이 급증하고 사용자들이 생성하는 데이터의 양이 증가함에 따라 사용자 데이터 분석은 객관적인 정보 탐색과 분석을 넘어 주관적인 감정을 분석하는 데까지 시도되고 있다. 이러한 감정 분석은 사업, 행정, 외교 등의 다양한 분야에 걸쳐 용용 될 수 있다. 본 연구에서는 텍스트 데이터를 주요 분석 대상으로 하여 문장 구성의 다양한 요소를 특징화하고, 특징화된 문장에 대해 다양한 서포트 벡터머신을 통한 학습을 시도함으로써 텍스트가 내포한 감정을 추측한다. 다양한 특징화 방법을 적용하되, 낮은 밀도가 될 것으로 추측되는 데이터 매트릭스의 차원 감쇄를 위해 정보엔트로피 기반의 특징 선택기법을 적용한다.