• Title/Summary/Keyword: Stream Data Mining

Search Result 97, Processing Time 0.026 seconds

Analysis of Twitter for 2012 South Korea Presidential Election by Text Mining Techniques (텍스트 마이닝을 이용한 2012년 한국대선 관련 트위터 분석)

  • Bae, Jung-Hwan;Son, Ji-Eun;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.141-156
    • /
    • 2013
  • Social media is a representative form of the Web 2.0 that shapes the change of a user's information behavior by allowing users to produce their own contents without any expert skills. In particular, as a new communication medium, it has a profound impact on the social change by enabling users to communicate with the masses and acquaintances their opinions and thoughts. Social media data plays a significant role in an emerging Big Data arena. A variety of research areas such as social network analysis, opinion mining, and so on, therefore, have paid attention to discover meaningful information from vast amounts of data buried in social media. Social media has recently become main foci to the field of Information Retrieval and Text Mining because not only it produces massive unstructured textual data in real-time but also it serves as an influential channel for opinion leading. But most of the previous studies have adopted broad-brush and limited approaches. These approaches have made it difficult to find and analyze new information. To overcome these limitations, we developed a real-time Twitter trend mining system to capture the trend in real-time processing big stream datasets of Twitter. The system offers the functions of term co-occurrence retrieval, visualization of Twitter users by query, similarity calculation between two users, topic modeling to keep track of changes of topical trend, and mention-based user network analysis. In addition, we conducted a case study on the 2012 Korean presidential election. We collected 1,737,969 tweets which contain candidates' name and election on Twitter in Korea (http://www.twitter.com/) for one month in 2012 (October 1 to October 31). The case study shows that the system provides useful information and detects the trend of society effectively. The system also retrieves the list of terms co-occurred by given query terms. We compare the results of term co-occurrence retrieval by giving influential candidates' name, 'Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn' as query terms. General terms which are related to presidential election such as 'Presidential Election', 'Proclamation in Support', Public opinion poll' appear frequently. Also the results show specific terms that differentiate each candidate's feature such as 'Park Jung Hee' and 'Yuk Young Su' from the query 'Guen Hae Park', 'a single candidacy agreement' and 'Time of voting extension' from the query 'Jae In Moon' and 'a single candidacy agreement' and 'down contract' from the query 'Chul Su Ahn'. Our system not only extracts 10 topics along with related terms but also shows topics' dynamic changes over time by employing the multinomial Latent Dirichlet Allocation technique. Each topic can show one of two types of patterns-Rising tendency and Falling tendencydepending on the change of the probability distribution. To determine the relationship between topic trends in Twitter and social issues in the real world, we compare topic trends with related news articles. We are able to identify that Twitter can track the issue faster than the other media, newspapers. The user network in Twitter is different from those of other social media because of distinctive characteristics of making relationships in Twitter. Twitter users can make their relationships by exchanging mentions. We visualize and analyze mention based networks of 136,754 users. We put three candidates' name as query terms-Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn'. The results show that Twitter users mention all candidates' name regardless of their political tendencies. This case study discloses that Twitter could be an effective tool to detect and predict dynamic changes of social issues, and mention-based user networks could show different aspects of user behavior as a unique network that is uniquely found in Twitter.

The controversial points for the assessment of soil contamination related to the change of pH of extraction solution in using partial extraction in standard method in Korea (국내 토양오염 공정시험방법의 용출법 사용시 용출액의 pH의 변화가 토양 오염 평가에 미치는 문제점)

  • 오창환;유연희;이평구;이영엽
    • Proceedings of the Korean Society of Soil and Groundwater Environment Conference
    • /
    • 2000.11a
    • /
    • pp.294-297
    • /
    • 2000
  • Heavy metals are extracted from Chonju stream sediment, roadside soils and sediments along Honam expressway, soils and tailings from mining area using partial ectraction in Standard Method, partial ectraction method with maintaining 0.1N of extraction solution and acid digestion. In samples having buffer capacity against acid, 0.1N of extraction solution can not be maintained and pH of extraction solution increases up to 8.0 when partial extraction in Standard Method is used. The averages and ranges of (heavy metals extracted using partial extraction in standard method, HPE)/(heavy metals extracted using partial extraction method with maintaining 0.1N of extraction solution, HPEM) values are 0.506 and 0.145~1.126 in Cd, 0.534~ and 0.078~0.928 in Zn, 0.461 and 0.041~1.715 in Mn, 0.359 and 0.011~0.874 in Cu, 0.195 and 0.018~1.785 in Cr, 0.710 and 0.003~3.075 in Pb, and 0.088 and 1.73$\times$10$^{-5}$ ~0.303 in Fe. These data indicate that the difference between HPE and HPEM is big in the order of Fe, Cr, Cu, Mn, Cd, Zn and Pb. It is quite possible that the partial extraction method in Standard Method of soil in Korea is not adequate for an assessment of contamination in area where buffer capacity of soil will be decreased or lost after a long term exposure of soils to environmental damage.

  • PDF

A Study on Analysis of consumer perception of YouTube advertising using text mining (텍스트 마이닝을 활용한 Youtube 광고에 대한 소비자 인식 분석)

  • Eum, Seong-Won
    • Management & Information Systems Review
    • /
    • v.39 no.2
    • /
    • pp.181-193
    • /
    • 2020
  • This study is a study that analyzes consumer perception by utilizing text mining, which is a recent issue. we analyzed the consumer's perception of Samsung Galaxy by analyzing consumer reviews of Samsung Galaxy YouTube ads. for analysis, 1,819 consumer reviews of YouTube ads were extracted. through this data pre-processing, keywords for advertisements were classified and extracted into nouns, adjectives, and adverbs. after that, frequency analysis and emotional analysis were performed. Finally, clustering was performed through CONCOR. the summary of this study is as follows. the first most frequently mentioned words were Galaxy Note (n = 217), Good (n = 135), Pen (n = 40), and Function (n = 29). it can be judged through the advertisement that consumers "Galaxy Note", "Good", "Pen", and "Features" have good functional aspects for Samsung mobile phone products and positively recognize the Note Pen. in addition, the recognition of "Samsung Pay", "Innovation", "Design", and "iPhone" shows that Samsung's mobile phone is highly regarded for its innovative design and functional aspects of Samsung Pay. second, it is the result of sentiment analysis on YouTube advertising. As a result of emotional analysis, the ratio of emotional intensity was positive (75.95%) and higher than negative (24.05%). this means that consumers are positively aware of Samsung Galaxy mobile phones. As a result of the emotional keyword analysis, positive keywords were "good", "good", "innovative", "highest", "fast", "pretty", etc., negative keywords were "frightening", "I want to cry", "discomfort", "sorry", "no", etc. were extracted. the implication of this study is that most of the studies by quantitative analysis methods were considered when looking at the consumer perception study of existing advertisements. In this study, we deviated from quantitative research methods for advertising and attempted to analyze consumer perception through qualitative research. this is expected to have a great influence on future research, and I am sure that it will be a starting point for consumer awareness research through qualitative research.

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

  • Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.109-122
    • /
    • 2014
  • People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.

Analysis of Research Trends of 'Word of Mouth (WoM)' through Main Path and Word Co-occurrence Network (주경로 분석과 연관어 네트워크 분석을 통한 '구전(WoM)' 관련 연구동향 분석)

  • Shin, Hyunbo;Kim, Hea-Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.179-200
    • /
    • 2019
  • Word-of-mouth (WoM) is defined by consumer activities that share information concerning consumption. WoM activities have long been recognized as important in corporate marketing processes and have received much attention, especially in the marketing field. Recently, according to the development of the Internet, the way in which people exchange information in online news and online communities has been expanded, and WoM is diversified in terms of word of mouth, score, rating, and liking. Social media makes online users easy access to information and online WoM is considered a key source of information. Although various studies on WoM have been preceded by this phenomenon, there is no meta-analysis study that comprehensively analyzes them. This study proposed a method to extract major researches by applying text mining techniques and to grasp the main issues of researches in order to find the trend of WoM research using scholarly big data. To this end, a total of 4389 documents were collected by the keyword 'Word-of-mouth' from 1941 to 2018 in Scopus (www.scopus.com), a citation database, and the data were refined through preprocessing such as English morphological analysis, stopwords removal, and noun extraction. To carry out this study, we adopted main path analysis (MPA) and word co-occurrence network analysis. MPA detects key researches and is used to track the development trajectory of academic field, and presents the research trend from a macro perspective. For this, we constructed a citation network based on the collected data. The node means a document and the link means a citation relation in citation network. We then detected the key-route main path by applying SPC (Search Path Count) weights. As a result, the main path composed of 30 documents extracted from a citation network. The main path was able to confirm the change of the academic area which was developing along with the change of the times reflecting the industrial change such as various industrial groups. The results of MPA revealed that WoM research was distinguished by five periods: (1) establishment of aspects and critical elements of WoM, (2) relationship analysis between WoM variables, (3) beginning of researches of online WoM, (4) relationship analysis between WoM and purchase, and (5) broadening of topics. It was found that changes within the industry was reflected in the results such as online development and social media. Very recent studies showed that the topics and approaches related WoM were being diversified to circumstantial changes. However, the results showed that even though WoM was used in diverse fields, the main stream of the researches of WoM from the start to the end, was related to marketing and figuring out the influential factors that proliferate WoM. By applying word co-occurrence network analysis, the research trend is presented from a microscopic point of view. Word co-occurrence network was constructed to analyze the relationship between keywords and social network analysis (SNA) was utilized. We divided the data into three periods to investigate the periodic changes and trends in discussion of WoM. SNA showed that Period 1 (1941~2008) consisted of clusters regarding relationship, source, and consumers. Period 2 (2009~2013) contained clusters of satisfaction, community, social networks, review, and internet. Clusters of period 3 (2014~2018) involved satisfaction, medium, review, and interview. The periodic changes of clusters showed transition from offline to online WoM. Media of WoM have become an important factor in spreading the words. This study conducted a quantitative meta-analysis based on scholarly big data regarding WoM. The main contribution of this study is that it provides a micro perspective on the research trend of WoM as well as the macro perspective. The limitation of this study is that the citation network constructed in this study is a network based on the direct citation relation of the collected documents for MPA.

A Generation and Matching Method of Normal-Transient Dictionary for Realtime Topic Detection (실시간 이슈 탐지를 위한 일반-급상승 단어사전 생성 및 매칭 기법)

  • Choi, Bongjun;Lee, Hanjoo;Yong, Wooseok;Lee, Wonsuk
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.5
    • /
    • pp.7-18
    • /
    • 2017
  • Recently, the number of SNS user has rapidly increased due to smart device industry development and also the amount of generated data is exponentially increasing. In the twitter, Text data generated by user is a key issue to research because it involves events, accidents, reputations of products, and brand images. Twitter has become a channel for users to receive and exchange information. An important characteristic of Twitter is its realtime. Earthquakes, floods and suicides event among the various events should be analyzed rapidly for immediately applying to events. It is necessary to collect tweets related to the event in order to analyze the events. But it is difficult to find all tweets related to the event using normal keywords. In order to solve such a mentioned above, this paper proposes A Generation and Matching Method of Normal-Transient Dictionary for realtime topic detection. Normal dictionaries consist of general keywords(event: suicide-death-loop, death, die, hang oneself, etc) related to events. Whereas transient dictionaries consist of transient keywords(event: suicide-names and information of celebrities, information of social issues) related to events. Experimental results show that matching method using two dictionary finds more tweets related to the event than a simple keyword search.

The Effects of pH Change in Extraction Solution on the Heavy Metals Extraction from Soil and Controversial Points for Partial Extraction in Korean Standard Method (용출액의 pH 변화가 토양내 중금속 용출에 미치는 영향과 그에 따른 국내 토양 오염 공정시험방법의 문제점)

  • 오창환;유연희;이평구;이영엽
    • Economic and Environmental Geology
    • /
    • v.36 no.3
    • /
    • pp.159-170
    • /
    • 2003
  • Heavy metals are extracted from Chonju stream sediment, roadside soils and sediments along Honam expressway, soils and tailings from mining area using three different methods (partial extraction in Standard Method, partial extraction method with maintaining 0.1 N of extraction solution and Sequential Extraction Method). In samples having buffer capacity against acid, pH 1 (0.1 N HCl) of extraction solution can not be maintained and pH of extraction solution increases up to 8.0 when partial extraction in Standard Method is used. The averages and ranges of HPE(heavy metals extracted using partial extraction in Standard Method)/HPEM(heavy metals extracted using partial extraction method with maintaining 0.1 N of extraction solution) values are 0.479 and 0.145~0.929 for Cd, 0.534 and 0.078~0.928 for Zn, 0.432 and 0.041~0.992 for Mn, 0.359 and 0.011~0.874 for Cu, 0.150 and 0.018~0.530 for Cr, 0.219 and 0.003~0.853 for Pb, and 0.088 and 1.73${\times}$10$^{-5}$~0.303 for Fe. These data indicate that the difference between HPE and HPEM is large in the order of Fe, Cr, Pb, Cu, Mn, Cd and Zn. The amounts of heavy metals extracted decreases in the follow order; Sum III(sum of fraction I, II, III in sequential extraction)>HPEM>Sum III (sum of fraction I and II)>HPE for Zn, Cd and Mn and Sum III>HPEM>HPE for Cr and Fe. In the case Cr, Sum II is lower than HPEM and higher than HPE. In case of Cu, extracted heavy metals is large in the order Sum IV>HPEM>Sum III HPE. HPE/HPEM value decreases with increasing the amount of HCl used for maintaining 0.1 N of extraction solution. For samples with high buffer capacity, HPE/HPEM value in all elements is lower than 0.2. On the other hand, for samples with low buffer capacity, HPE/HPEM value are over 0.2 and many samples have values higher than 0.6 for Zn, Cd Mn and Cu due to the small difference between Sum II and Sum III, and relatively higher mobility. However, for Fe and Cr, HPE/HPEM value is below 0.2 even for samples with low buffer capacity due to their low mobility and big difference between Sum II and Sum III. This study indicates that the partial extraction method in Korean Standard Method of soil is not suitable for an assessment of soil contamination in area where buffer capacity of soil can be decreased or lost because of a long term exposure to environmental damage such as acidic rain.