• Title/Summary/Keyword: Unstructured data analysis

Search Result 422, Processing Time 0.026 seconds

A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data (텍스트 데이터 워드클라우드 분석을 위한 데이터 정제기법에 관한 연구)

  • Lee, Won-Jo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.4
    • /
    • pp.745-750
    • /
    • 2021
  • In Big data visualization analysis of unstructured text data, raw data is mostly large-capacity, and analysis techniques cannot be applied without cleansing it unstructured. Therefore, from the collected raw data, unnecessary data is removed through the first heuristic cleansing process and Stopwords are removed through the second machine cleansing process. Then, the frequency of the vocabulary is calculated, visualized using the word cloud technique, and key issues are extracted and informationalized, and the results are analyzed. In this study, we propose a new Stopword cleansing technique using an external Stopword set (DB) in Python word cloud, and derive the problems and effectiveness of this technique through practical case analysis. And, through this verification result, the utility of the practical application of word cloud analysis applying the proposed cleansing technique is presented.

User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis (다계층 이원 네트워크를 활용한 사용자 관점의 이슈 클러스터링)

  • Kim, Jieun;Kim, Namgyu;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.93-107
    • /
    • 2014
  • In this paper, we report what we have observed with regard to user-perspective issue clustering based on multi-layered two-mode network analysis. This work is significant in the context of data collection by companies about customer needs. Most companies have failed to uncover such needs for products or services properly in terms of demographic data such as age, income levels, and purchase history. Because of excessive reliance on limited internal data, most recommendation systems do not provide decision makers with appropriate business information for current business circumstances. However, part of the problem is the increasing regulation of personal data gathering and privacy. This makes demographic or transaction data collection more difficult, and is a significant hurdle for traditional recommendation approaches because these systems demand a great deal of personal data or transaction logs. Our motivation for presenting this paper to academia is our strong belief, and evidence, that most customers' requirements for products can be effectively and efficiently analyzed from unstructured textual data such as Internet news text. In order to derive users' requirements from textual data obtained online, the proposed approach in this paper attempts to construct double two-mode networks, such as a user-news network and news-issue network, and to integrate these into one quasi-network as the input for issue clustering. One of the contributions of this research is the development of a methodology utilizing enormous amounts of unstructured textual data for user-oriented issue clustering by leveraging existing text mining and social network analysis. In order to build multi-layered two-mode networks of news logs, we need some tools such as text mining and topic analysis. We used not only SAS Enterprise Miner 12.1, which provides a text miner module and cluster module for textual data analysis, but also NetMiner 4 for network visualization and analysis. Our approach for user-perspective issue clustering is composed of six main phases: crawling, topic analysis, access pattern analysis, network merging, network conversion, and clustering. In the first phase, we collect visit logs for news sites by crawler. After gathering unstructured news article data, the topic analysis phase extracts issues from each news article in order to build an article-news network. For simplicity, 100 topics are extracted from 13,652 articles. In the third phase, a user-article network is constructed with access patterns derived from web transaction logs. The double two-mode networks are then merged into a quasi-network of user-issue. Finally, in the user-oriented issue-clustering phase, we classify issues through structural equivalence, and compare these with the clustering results from statistical tools and network analysis. An experiment with a large dataset was performed to build a multi-layer two-mode network. After that, we compared the results of issue clustering from SAS with that of network analysis. The experimental dataset was from a web site ranking site, and the biggest portal site in Korea. The sample dataset contains 150 million transaction logs and 13,652 news articles of 5,000 panels over one year. User-article and article-issue networks are constructed and merged into a user-issue quasi-network using Netminer. Our issue-clustering results applied the Partitioning Around Medoids (PAM) algorithm and Multidimensional Scaling (MDS), and are consistent with the results from SAS clustering. In spite of extensive efforts to provide user information with recommendation systems, most projects are successful only when companies have sufficient data about users and transactions. Our proposed methodology, user-perspective issue clustering, can provide practical support to decision-making in companies because it enhances user-related data from unstructured textual data. To overcome the problem of insufficient data from traditional approaches, our methodology infers customers' real interests by utilizing web transaction logs. In addition, we suggest topic analysis and issue clustering as a practical means of issue identification.

Analysis of Adverse Drug Reaction Reports using Text Mining (텍스트마이닝을 이용한 약물유해반응 보고자료 분석)

  • Kim, Hyon Hee;Rhew, Kiyon
    • Korean Journal of Clinical Pharmacy
    • /
    • v.27 no.4
    • /
    • pp.221-227
    • /
    • 2017
  • Background: As personalized healthcare industry has attracted much attention, big data analysis of healthcare data is essential. Lots of healthcare data such as product labeling, biomedical literature and social media data are unstructured, extracting meaningful information from the unstructured text data are becoming important. In particular, text mining for adverse drug reactions (ADRs) reports is able to provide signal information to predict and detect adverse drug reactions. There has been no study on text analysis of expert opinion on Korea Adverse Event Reporting System (KAERS) databases in Korea. Methods: Expert opinion text of KAERS database provided by Korea Institute of Drug Safety & Risk Management (KIDS-KD) are analyzed. To understand the whole text, word frequency analysis are performed, and to look for important keywords from the text TF-IDF weight analysis are performed. Also, related keywords with the important keywords are presented by calculating correlation coefficient. Results: Among total 90,522 reports, 120 insulin ADR report and 858 tramadol ADR report were analyzed. The ADRs such as dizziness, headache, vomiting, dyspepsia, and shock were ranked in order in the insulin data, while the ADR symptoms such as vomiting, 어지러움, dizziness, dyspepsia and constipation were ranked in order in the tramadol data as the most frequently used keywords. Conclusion: Using text mining of the expert opinion in KIDS-KD, frequently mentioned ADRs and medications are easily recovered. Text mining in ADRs research is able to play an important role in detecting signal information and prediction of ADRs.

An Automatically Extracting Formal Information from Unstructured Security Intelligence Report (비정형 Security Intelligence Report의 정형 정보 자동 추출)

  • Hur, Yuna;Lee, Chanhee;Kim, Gyeongmin;Jo, Jaechoon;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.17 no.11
    • /
    • pp.233-240
    • /
    • 2019
  • In order to predict and respond to cyber attacks, a number of security companies quickly identify the methods, types and characteristics of attack techniques and are publishing Security Intelligence Reports(SIRs) on them. However, the SIRs distributed by each company are huge and unstructured. In this paper, we propose a framework that uses five analytic techniques to formulate a report and extract key information in order to reduce the time required to extract information on large unstructured SIRs efficiently. Since the SIRs data do not have the correct answer label, we propose four analysis techniques, Keyword Extraction, Topic Modeling, Summarization, and Document Similarity, through Unsupervised Learning. Finally, has built the data to extract threat information from SIRs, analysis applies to the Named Entity Recognition (NER) technology to recognize the words belonging to the IP, Domain/URL, Hash, Malware and determine if the word belongs to which type We propose a framework that applies a total of five analysis techniques, including technology.

Qualitative Data Analysis using Computers (컴퓨터를 이용한 질적 자료 분석)

  • Yi Myung-Sun
    • Journal of Korean Academy of Fundamentals of Nursing
    • /
    • v.6 no.3
    • /
    • pp.570-582
    • /
    • 1999
  • Although computers cannot analyze textual data in the same way as they analyze numerical data. they can nevertheless be of great assistance to qualitative researchers. Thus, the use of computers in analyzing qualitative data has increased since the 1980s. The purpose of this article was to explore advantages and disadvanteges of using computers to analyze textual data and to suggest strategies to prevent problems of using computers. In additon, it illustrated characteristics and functions of softwares designed to analyze qualitative data to help researchers choose the program wisely. It also demonstrated precise functions and procedures of the NUDIST program which was designed to develop a conceptual framework or grounded theory from unstructured data. Major advantage of using computers in qualitative research is the management of huge amount of unstructured data. By managing overloaded data, researcher can keep track of the emerging ideas, arguments and theoretical concepts and can organize these tasks mope efficiently than the traditional method of 'cut-and-paste' technique. Additional advantages are the abilities to increase trustworthiness of research, transparency of research process, and intuitional creativity of the researcher, and to facilitate team and secondary research. On the other hand, disvantages of using computers were identified as worries that the machine could conquer the human understanding and as probability of these problems. it suggested strategies such as 1) deep understanding of orthodoxy in analytical process. To overcome philosophical and theoretical background of qualitative research method, 2) deep understanding of the data as a whole before using software, 3) use of software after familiarity with it, 4) continuous evaluation of software and feedback from them, and 5) continuous awareness of the limitation of the machine, that is computer, in the interpretive analysis.

  • PDF

Comparative Analysis of Low Fertility Response Policies (Focusing on Unstructured Data on Parental Leave and Child Allowance) (저출산 대응 정책 비교분석 (육아휴직과 아동수당의 비정형 데이터 중심으로))

  • Eun-Young Keum;Do-Hee Kim
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.5
    • /
    • pp.769-778
    • /
    • 2023
  • This study compared and analyzed parental leave and child allowance, two major policies among solutions to the current serious low fertility rate problem, using unstructured data, and sought future directions and implications for related response policies based on this. The collection keywords were "low fertility + parental leave" and "low fertility + child allowance", and data analysis was conducted in the following order: text frequency analysis, centrality analysis, network visualization, and CONCOR analysis. As a result of the analysis, first, parental leave was found to be a realistic and practical policy in response to low fertility rates, as data analysis showed more diverse and systematic discussions than child allowance. Second, in terms of child allowance, data analysis showed that there was a high level of information and interest in the cash grant benefit system, including child allowance, but there were no other unique features or active discussions. As a future improvement plan, both policies need to utilize the existing system. First, parental leave requires improvement in the working environment and blind spots in order to expand the system, and second, child allowance requires a change in the form of payment that deviates from the uniform and biased system. should be sought, and it was proposed to expand the target age.

Suggestions on how to convert official documents to Machine Readable (공문서의 기계가독형(Machine Readable) 전환 방법 제언)

  • Yim, Jin Hee
    • The Korean Journal of Archival Studies
    • /
    • no.67
    • /
    • pp.99-138
    • /
    • 2021
  • In the era of big data, analyzing not only structured data but also unstructured data is emerging as an important task. Official documents produced by government agencies are also subject to big data analysis as large text-based unstructured data. From the perspective of internal work efficiency, knowledge management, records management, etc, it is necessary to analyze big data of public documents to derive useful implications. However, since many of the public documents currently held by public institutions are not in open format, a pre-processing process of extracting text from a bitstream is required for big data analysis. In addition, since contextual metadata is not sufficiently stored in the document file, separate efforts to secure metadata are required for high-quality analysis. In conclusion, the current official documents have a low level of machine readability, so big data analysis becomes expensive.

An Exploratory Study on the Prediction of Business Survey Index Using Data Mining (기업경기실사지수 예측에 대한 탐색적 연구: 데이터 마이닝을 이용하여)

  • Kyungbo Park;Mi Ryang Kim
    • Journal of Information Technology Services
    • /
    • v.22 no.4
    • /
    • pp.123-140
    • /
    • 2023
  • In recent times, the global economy has been subject to increasing volatility, which has made it considerably more difficult to accurately predict economic indicators compared to previous periods. In response to this challenge, the present study conducts an exploratory investigation that aims to predict the Business Survey Index (BSI) by leveraging data mining techniques on both structured and unstructured data sources. For the structured data, we have collected information regarding foreign, domestic, and industrial conditions, while the unstructured data consists of content extracted from newspaper articles. By employing an extensive set of 44 distinct data mining techniques, our research strives to enhance the BSI prediction accuracy and provide valuable insights. The results of our analysis demonstrate that the highest predictive power was attained when using data exclusively from the t-1 period. Interestingly, this suggests that previous timeframes play a vital role in forecasting the BSI effectively. The findings of this study hold significant implications for economic decision-makers, as they will not only facilitate better-informed decisions but also serve as a robust foundation for predicting a wide range of other economic indicators. By improving the prediction of crucial economic metrics, this study ultimately aims to contribute to the overall efficacy of economic policy-making and decision processes.

A Study on Ontology Modeling for Weapon Parts Development Information (무기체계 부품국산화 정보의 온톨로지 구축방안 연구)

  • Jang, Woo Hyuk
    • Journal of Korea Multimedia Society
    • /
    • v.18 no.7
    • /
    • pp.873-885
    • /
    • 2015
  • Today, It is difficult to search the various and numerous information efficiently. For this reason, Semantic Web emerged to provide searching services more easily through the structuring of a variety of unstructured format data and the definition of meaningful relationships between information. Especially, definition of relationship and meaning among resources is significant to share and infer related information. Ontology modeling plays just that role. Weapon parts development information is unstructured and dispersed all over. There are many difficulties in finding desired information, leading to getting improper outcomes. In this paper, we present an intuitive ontology model with weapon parts development information including the multi-dimensional information analysis and expansion of the relevant information. This study build up a ontology model through creating class and hierarchy about parts information and defining the properties of classes with Ontology Development 101[1] procedures using Protégé tools. The ontology model provides users with a platform on which search of needed information can be easy and efficient.

Numerical Analysis of Helicopter Rotor Blade in Forward Flight Using Unstructured Adaptive Meshes (비정렬 적응격자 기법을 이용한 전진비행하는 헬리콥터 로터 블레이드의 수치 해석)

  • Park Y. M.;Lee J. Y.;Kwon O. J.
    • 한국전산유체공학회:학술대회논문집
    • /
    • 2003.08a
    • /
    • pp.95-101
    • /
    • 2003
  • A three dimensional inviscid parallel flow solver has been developed for the simulation of rotor blades in forward flight. The computational domain is divided into stationary and rotating zones for the more efficient mesh adaptation. The conservative mesh treatment algorithm is used for the convection of flow variables and fluxes across the sliding boundary. A deforming mesh algorithm using modified spring analogy is used for the blade motion. In the present paper, detail descriptions of numerical analysis for forward flight are introduced. Some results are presented for a two bladed AH-1G rotor and compared with experimental data.

  • PDF