• Title/Summary/Keyword: 비정형 데이터 분석

Search Result 405, Processing Time 0.022 seconds

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

An Investigation on the Periodical Transition of News related to North Korea using Text Mining (텍스트마이닝을 활용한 북한 관련 뉴스의 기간별 변화과정 고찰)

  • Park, Chul-Soo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.63-88
    • /
    • 2019
  • The goal of this paper is to investigate changes in North Korea's domestic and foreign policies through automated text analysis over North Korea represented in South Korean mass media. Based on that data, we then analyze the status of text mining research, using a text mining technique to find the topics, methods, and trends of text mining research. We also investigate the characteristics and method of analysis of the text mining techniques, confirmed by analysis of the data. In this study, R program was used to apply the text mining technique. R program is free software for statistical computing and graphics. Also, Text mining methods allow to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud. This study proposes a procedure to find meaningful tendencies based on a combination of word cloud, and co-occurrence networks. This study aims to more objectively explore the images of North Korea represented in South Korean newspapers by quantitatively reviewing the patterns of language use related to North Korea from 2016. 11. 1 to 2019. 5. 23 newspaper big data. In this study, we divided into three periods considering recent inter - Korean relations. Before January 1, 2018, it was set as a Before Phase of Peace Building. From January 1, 2018 to February 24, 2019, we have set up a Peace Building Phase. The New Year's message of Kim Jong-un and the Olympics of Pyeong Chang formed an atmosphere of peace on the Korean peninsula. After the Hanoi Pease summit, the third period was the silence of the relationship between North Korea and the United States. Therefore, it was called Depression Phase of Peace Building. This study analyzes news articles related to North Korea of the Korea Press Foundation database(www.bigkinds.or.kr) through text mining, to investigate characteristics of the Kim Jong-un regime's South Korea policy and unification discourse. The main results of this study show that trends in the North Korean national policy agenda can be discovered based on clustering and visualization algorithms. In particular, it examines the changes in the international circumstances, domestic conflicts, the living conditions of North Korea, the South's Aid project for the North, the conflicts of the two Koreas, North Korean nuclear issue, and the North Korean refugee problem through the co-occurrence word analysis. It also offers an analysis of South Korean mentality toward North Korea in terms of the semantic prosody. In the Before Phase of Peace Building, the results of the analysis showed the order of 'Missiles', 'North Korea Nuclear', 'Diplomacy', 'Unification', and ' South-North Korean'. The results of Peace Building Phase are extracted the order of 'Panmunjom', 'Unification', 'North Korea Nuclear', 'Diplomacy', and 'Military'. The results of Depression Phase of Peace Building derived the order of 'North Korea Nuclear', 'North and South Korea', 'Missile', 'State Department', and 'International'. There are 16 words adopted in all three periods. The order is as follows: 'missile', 'North Korea Nuclear', 'Diplomacy', 'Unification', 'North and South Korea', 'Military', 'Kaesong Industrial Complex', 'Defense', 'Sanctions', 'Denuclearization', 'Peace', 'Exchange and Cooperation', and 'South Korea'. We expect that the results of this study will contribute to analyze the trends of news content of North Korea associated with North Korea's provocations. And future research on North Korean trends will be conducted based on the results of this study. We will continue to study the model development for North Korea risk measurement that can anticipate and respond to North Korea's behavior in advance. We expect that the text mining analysis method and the scientific data analysis technique will be applied to North Korea and unification research field. Through these academic studies, I hope to see a lot of studies that make important contributions to the nation.

Structural and functional characteristics of rock-boring clam Barnea manilensis (암석을 천공하는 돌맛조개(Barnea manilensis)의 구조 및 기능)

  • Ji Yeong Kim;Yun Jeon Ahn;Tae Jin Kim;Seung Min Won;Seung Won Lee;Jongwon Song;Jeongeun Bak
    • Korean Journal of Environmental Biology
    • /
    • v.40 no.4
    • /
    • pp.413-422
    • /
    • 2022
  • Barnea manilensis is a bivalve which bores soft rocks, such as, limestone or mudstone in the low intertidal zone. They make burrows which have narrow entrances and wide interiors and live in these burrows for a lifetime. In this study, the morphology and the microstructure of the valve of rock-boring clam B. manilensis were observed using a stereoscopic microscope and FE-SEM, respectively. The chemical composition of specific part of the valve was assessed by energy dispersive X-ray spectroscopy (EDS) analysis. 3D modeling and structural dynamic analysis were used to simulate the boring behavior of B. manilensis. Microscopy results showed that the valve was asymmetric with plow-like spikes which were located on the anterior surface of the valve and were distributed in a specific direction. The anterior parts of the valve were thicker than the posterior parts. EDS results indicated that the valve mainly consisted of calcium carbonate, while metal elements, such as, Al, Si, Mn, Fe, and Mg were detected on the outer surface of the anterior spikes. It was assumed that the metal elements increased the strength of the valve, thus helping the B. manilensis to bore sediment. The simulation showed that spikes located on the anterior part of the valve received a load at all angles. It was suggested that the anterior part of the shell received the load while drilling rocks. The boring mechanism using the amorphous valve of B. manilensis is expected to be used as basic data to devise an efficient drilling mechanism.

The Effect of Expert Reviews on Consumer Product Evaluations: A Text Mining Approach (전문가 제품 후기가 소비자 제품 평가에 미치는 영향: 텍스트마이닝 분석을 중심으로)

  • Kang, Taeyoung;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.63-82
    • /
    • 2016
  • Individuals gather information online to resolve problems in their daily lives and make various decisions about the purchase of products or services. With the revolutionary development of information technology, Web 2.0 has allowed more people to easily generate and use online reviews such that the volume of information is rapidly increasing, and the usefulness and significance of analyzing the unstructured data have also increased. This paper presents an analysis on the lexical features of expert product reviews to determine their influence on consumers' purchasing decisions. The focus was on how unstructured data can be organized and used in diverse contexts through text mining. In addition, diverse lexical features of expert reviews of contents provided by a third-party review site were extracted and defined. Expert reviews are defined as evaluations by people who have expert knowledge about specific products or services in newspapers or magazines; this type of review is also called a critic review. Consumers who purchased products before the widespread use of the Internet were able to access expert reviews through newspapers or magazines; thus, they were not able to access many of them. Recently, however, major media also now provide online services so that people can more easily and affordably access expert reviews compared to the past. The reason why diverse reviews from experts in several fields are important is that there is an information asymmetry where some information is not shared among consumers and sellers. The information asymmetry can be resolved with information provided by third parties with expertise to consumers. Then, consumers can read expert reviews and make purchasing decisions by considering the abundant information on products or services. Therefore, expert reviews play an important role in consumers' purchasing decisions and the performance of companies across diverse industries. If the influence of qualitative data such as reviews or assessment after the purchase of products can be separately identified from the quantitative data resources, such as the actual quality of products or price, it is possible to identify which aspects of product reviews hamper or promote product sales. Previous studies have focused on the characteristics of the experts themselves, such as the expertise and credibility of sources regarding expert reviews; however, these studies did not suggest the influence of the linguistic features of experts' product reviews on consumers' overall evaluation. However, this study focused on experts' recommendations and evaluations to reveal the lexical features of expert reviews and whether such features influence consumers' overall evaluations and purchasing decisions. Real expert product reviews were analyzed based on the suggested methodology, and five lexical features of expert reviews were ultimately determined. Specifically, the "review depth" (i.e., degree of detail of the expert's product analysis), and "lack of assurance" (i.e., degree of confidence that the expert has in the evaluation) have statistically significant effects on consumers' product evaluations. In contrast, the "positive polarity" (i.e., the degree of positivity of an expert's evaluations) has an insignificant effect, while the "negative polarity" (i.e., the degree of negativity of an expert's evaluations) has a significant negative effect on consumers' product evaluations. Finally, the "social orientation" (i.e., the degree of how many social expressions experts include in their reviews) does not have a significant effect on consumers' product evaluations. In summary, the lexical properties of the product reviews were defined according to each relevant factor. Then, the influence of each linguistic factor of expert reviews on the consumers' final evaluations was tested. In addition, a test was performed on whether each linguistic factor influencing consumers' product evaluations differs depending on the lexical features. The results of these analyses should provide guidelines on how individuals process massive volumes of unstructured data depending on lexical features in various contexts and how companies can use this mechanism from their perspective. This paper provides several theoretical and practical contributions, such as the proposal of a new methodology and its application to real data.

Artificial Intelligence In Wheelchair: From Technology for Autonomy to Technology for Interdependence and Care (휠체어 탄 인공지능: 자율적 기술에서 상호의존과 돌봄의 기술로)

  • HA, Dae-Cheong
    • Journal of Science and Technology Studies
    • /
    • v.19 no.2
    • /
    • pp.169-206
    • /
    • 2019
  • This article seeks to explore new relationships and ethics of human and technology by analyzing a cultural imaginary produced by artificial intelligence. Drawing on theoretical reflections of the Feminist Scientific and Technological Studies which understand science and technology as the matter of care(Puig de la Bellacas, 2011), this paper focuses on the fact that artificial intelligence and robots materialize cultural imaginary such as autonomy. This autonomy, defined as the capacity to adapt to a new environment through self-learning, is accepted as a way to conceptualize an authentic human or an ideal subject. However, this article argues that artificial intelligence is mediated by and dependent on invisible human labor and complex material devices, suggesting that such autonomy is close to fiction. The recent growth of the so-called 'assistant technology' shows that it is differentially visualizing the care work of both machines and humans. Technology and its cultural imaginary hide the care work of human workers and actively visualize the one of the machine. And they make autonomy and agency ideal humanness, leaving disabled bodies and dependency as unworthy. Artificial intelligence and its cultural imaginary negate the value of disabled bodies while idealizing abled-bodies, and result in eliminating the real relationship between man and technology as mutually dependent beings. In conclusion, the author argues that the technology we need is not the one to exclude the non-typical bodies and care work of others, but the one to include them as they are. This technology responsibly empathizes marginalized beings and encourages solidarity between fragile beings. Inspired by an art performance of artist Sue Austin, the author finally comes up with and suggests 'artificial intelligence in wheelchair' as an alternative figuration for the currently dominant 'autonomous artificial intelligence'.