• Title/Summary/Keyword: Unstructured data mining

Search Result 179, Processing Time 0.06 seconds

Research Trends in Record Management Using Unstructured Text Data Analysis (비정형 텍스트 데이터 분석을 활용한 기록관리 분야 연구동향)

  • Deokyong Hong;Junseok Heo
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.23 no.4
    • /
    • pp.73-89
    • /
    • 2023
  • This study aims to analyze the frequency of keywords used in Korean abstracts, which are unstructured text data in the domestic record management research field, using text mining techniques to identify domestic record management research trends through distance analysis between keywords. To this end, 1,157 keywords of 77,578 journals were visualized by extracting 1,157 articles from 7 journal types (28 types) searched by major category (complex study) and middle category (literature informatics) from the institutional statistics (registered site, candidate site) of the Korean Citation Index (KCI). Analysis of t-Distributed Stochastic Neighbor Embedding (t-SNE) and Scattertext using Word2vec was performed. As a result of the analysis, first, it was confirmed that keywords such as "record management" (889 times), "analysis" (888 times), "archive" (742 times), "record" (562 times), and "utilization" (449 times) were treated as significant topics by researchers. Second, Word2vec analysis generated vector representations between keywords, and similarity distances were investigated and visualized using t-SNE and Scattertext. In the visualization results, the research area for record management was divided into two groups, with keywords such as "archiving," "national record management," "standardization," "official documents," and "record management systems" occurring frequently in the first group (past). On the other hand, keywords such as "community," "data," "record information service," "online," and "digital archives" in the second group (current) were garnering substantial focus.

Analysis of the National Police Agency business trends using text mining (텍스트 마이닝 기법을 이용한 경찰청 업무 트렌드 분석)

  • Sun, Hyunseok;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.301-317
    • /
    • 2019
  • There has been significant research conducted on how to discover various insights through text data using statistical techniques. In this study we analyzed text data produced by the Korean National Police Agency to identify trends in the work by year and compare work characteristics among local authorities by identifying distinctive keywords in documents produced by each local authority. A preprocessing according to the characteristics of each data was conducted and the frequency of words for each document was calculated in order to draw a meaningful conclusion. The simple term frequency shown in the document is difficult to describe the characteristics of the keywords; therefore, the frequency for each term was newly calculated using the term frequency-inverse document frequency weights. The L2 norm normalization technique was used to compare the frequency of words. The analysis can be used as basic data that can be newly for future police work improvement policies and as a method to improve the efficiency of the police service that also help identify a demand for improvements in indoor work.

Analysis of Domestic Security Solution Market Trend using Big Data (빅데이터를 활용한 국내 보안솔루션 시장 동향 분석)

  • Park, Sangcheon;Park, Dongsoo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.5
    • /
    • pp.492-501
    • /
    • 2019
  • To use the system safely in cyberspace, you need to use a security solution that is appropriate for your situation. In order to strengthen cyber security, it is necessary to accurately understand the flow of security from past to present and to prepare for various future threats. In this study, information security words of security/hacking news of Naver News which is reliable by using text mining were collected and analyzed. First, we checked the number of security news articles for the past seven years and analyzed the trends. Second, after confirming the security/hacking word rankings, we identified major concerns each year. Third, we analyzed the word of each security solution to see which security group is interested. Fourth, after separating the title and the body of the security news, security related words were extracted and analyzed. The fifth confirms trends and trends by detailed security solutions. Lastly, annual revenue and security word frequencies were analyzed. Through this big data news analysis, we will conduct an overall awareness survey on security solutions and analyze many unstructured data to analyze current market trends and provide information that can predict the future.

Performance Comparison of Naive Bayesian Learning and Centroid-Based Classification for e-Mail Classification (전자메일 분류를 위한 나이브 베이지안 학습과 중심점 기반 분류의 성능 비교)

  • Kim, Kuk-Pyo;Kwon, Young-S.
    • IE interfaces
    • /
    • v.18 no.1
    • /
    • pp.10-21
    • /
    • 2005
  • With the increasing proliferation of World Wide Web, electronic mail systems have become very widely used communication tools. Researches on e-mail classification have been very important in that e-mail classification system is a major engine for e-mail response management systems which mine unstructured e-mail messages and automatically categorize them. In this research we compare the performance of Naive Bayesian learning and Centroid-Based Classification using the different data set of an on-line shopping mall and a credit card company. We analyze which method performs better under which conditions. We compared classification accuracy of them which depends on structure and size of train set and increasing numbers of class. The experimental results indicate that Naive Bayesian learning performs better, while Centroid-Based Classification is more robust in terms of classification accuracy.

Pilot Experiment for Named Entity Recognition of Construction-related Organizations from Unstructured Text Data

  • Baek, Seungwon;Han, Seung H.;Jung, Wooyong;Kim, Yuri
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.847-854
    • /
    • 2022
  • The aim of this study is to develop a Named Entity Recognition (NER) model to automatically identify construction-related organizations from news articles. This study collected news articles using web crawling technique and construction-related organizations were labeled within a total of 1,000 news articles. The Bidirectional Encoder Representations from Transformers (BERT) model was used to recognize clients, constructors, consultants, engineers, and others. As a pilot experiment of this study, the best average F1 score of NER was 0.692. The result of this study is expected to contribute to the establishment of international business strategies by collecting timely information and analyzing it automatically.

  • PDF

Investigating the Impact of Corporate Social Responsibility on Firm's Short- and Long-Term Performance with Online Text Analytics (온라인 텍스트 분석을 통해 추정한 기업의 사회적책임 성과가 기업의 단기적 장기적 성과에 미치는 영향 분석)

  • Lee, Heesung;Jin, Yunseon;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.2
    • /
    • pp.13-31
    • /
    • 2016
  • Despite expectations of short- or long-term positive effects of corporate social responsibility (CSR) on firm performance, the results of existing research into this relationship are inconsistent partly due to lack of clarity about subordinate CSR concepts. In this study, keywords related to CSR concepts are extracted from atypical sources, such as newspapers, using text mining techniques to examine the relationship between CSR and firm performance. The analysis is based on data from the New York Times, a major news publication, and Google Scholar. We used text analytics to process unstructured data collected from open online documents to explore the effects of CSR on short- and long-term firm performance. The results suggest that the CSR index computed using the proposed text - online media - analytics predicts long-term performance very well compared to short-term performance in the absence of any internal firm reports or CSR institute reports. Our study demonstrates the text analytics are useful for evaluating CSR performance with respect to convenience and cost effectiveness.

The Effect of Medical Service Design Thinking Teaching-learning on Empathic Problem Solving Ability: Convergence Analysis of Structured and Unstructured Data (의료서비스 디자인싱킹 교육의 공감적 문제해결능력 향상 효과: 정형 및 비정형 데이터 융복합 분석 중심으로)

  • Yoo, Jin-Yeong
    • Journal of Digital Convergence
    • /
    • v.18 no.6
    • /
    • pp.311-321
    • /
    • 2020
  • The purpose of the study is to verify the effectiveness the Freshman Preliminary Health Administrators(FPHA)' Empathic Problem Solving Ability(EPSA) through the application of Medical Service Design Thinking(MSDT) conducted by undergraduate school of SNS hospital marketing education. The pre-post questionnaire survey was conducted on 39 students in the freshman year of the Department of Health Administration after applying MSDT for 15 weeks from September to December, 2019 at a college in Daegu. MSDT was positive influenced on the improvement of Empathic Imagine, Empathic interest, Empathic awakening of the FPHA' EPSA. In the analysis of key common words, the use of neutral and negative words was low, while the use of positive words was high. In order to systematically equip Empathic problem solving job competency in the age of artificial intelligence, it is meaningful to develop a program for the freshmen curriculum and to conduct a analysis of the structured and unstructured data to verify its effectiveness. Additional program development research is needed for the application of theoretical subjects.

A Study on the Quantitative Evaluation of Initial Coin Offering (ICO) Using Unstructured Data (비정형 데이터를 이용한 ICO(Initial Coin Offering) 정량적 평가 방법에 대한 연구)

  • Lee, Han Sol;Ahn, Sangho;Kang, Juyoung
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.63-74
    • /
    • 2022
  • Initial public offering (IPO) has a legal framework for investor protection, and because there are various quantitative evaluation factors, objective analysis is possible, and various studies have been conducted. In addition, crowdfunding also has several devices to prevent indiscriminate funding as the legal system for investor protection. On the other hand, the blockchain-based cryptocurrency white paper (ICO), which has recently been in the spotlight, has ambiguous legal means and standards to protect investors and lacks quantitative evaluation methods to evaluate ICOs objectively. Therefore, this study collects online-published ICO white papers to detect fraud in ICOs, performs ICO fraud predictions based on BERT, a text embedding technique, and compares them with existing Random Forest machine learning techniques, and shows the possibility on fraud detection. Finally, this study is expected to contribute to the study of ICO fraud detection based on quantitative methods by presenting the possibility of using a quantitative approach using unstructured data to identify frauds in ICOs.

A study on Korean language processing using TF-IDF (TF-IDF를 활용한 한글 자연어 처리 연구)

  • Lee, Jong-Hwa;Lee, MoonBong;Kim, Jong-Weon
    • The Journal of Information Systems
    • /
    • v.28 no.3
    • /
    • pp.105-121
    • /
    • 2019
  • Purpose One of the reasons for the expansion of information systems in the enterprise is the increased efficiency of data analysis. In particular, the rapidly increasing data types which are complex and unstructured such as video, voice, images, and conversations in and out of social networks. The purpose of this study is the customer needs analysis from customer voices, ie, text data, in the web environment.. Design/methodology/approach As previous study results, the word frequency of the sentence is extracted as a word that interprets the sentence has better affects than frequency analysis. In this study, we applied the TF-IDF method, which extracts important keywords in real sentences, not the TF method, which is a word extraction technique that expresses sentences with simple frequency only, in Korean language research. We visualized the two techniques by cluster analysis and describe the difference. Findings TF technique and TF-IDF technique are applied for Korean natural language processing, the research showed the value from frequency analysis technique to semantic analysis and it is expected to change the technique by Korean language processing researcher.

A Research on Difference Between Consumer Perception of Slow Fashion and Consumption Behavior of Fast Fashion: Application of Topic Modelling with Big Data

  • YANG, Oh-Suk;WOO, Young-Mok;YANG, Yae-Rim
    • The Journal of Economics, Marketing and Management
    • /
    • v.9 no.1
    • /
    • pp.1-14
    • /
    • 2021
  • Purpose: The article deals with the proposition that consumers' fashion consumption behavior will still follow the consumption behavior of fast fashion, despite recognizing the importance of slow fashion. Research design, data and methodology: The research model to verify this proposition is topic modelling with big data including unstructured textual data. we combined 5,506 news articles posted on Naver news search platform during the 2003-2019 period about fast fashion and slow fashion, high-frequency words have been derived, and topics have been found using LDA model. Based on these, we examined consumers' perception and consumption behavior on slow fashion through the analysis of Topic Network. Results: (1) Looking at the status of annual article collection, consumers' interest in slow fashion mainly began in 2005 and showed a steady increase up to 2019. (2) Term Frequency analysis showed that the keywords for slow fashion are the lowest, with consumers' consumption patterns continuing around 'brand.' (3) Each topic's weight in articles showed that 'social value' - which includes slow fashion - ranked sixth among the 9 topics, low linkage with other topics. (4) Lastly, 'brand' and 'fashion trend' were key topics, and the topic 'social value' accounted for a low proportion. Conclusion: Slow fashion was not a considerable factor of consumption behavior. Consumption patterns in fashion sector are still dominated by general consumption patterns centered on brands and fast fashion.