• Title/Summary/Keyword: Social Big Data Mining

Search Result 261, Processing Time 0.032 seconds

Derivation of Green Infrastructure Planning Factors for Reducing Particulate Matter - Using Text Mining - (미세먼지 저감을 위한 그린인프라 계획요소 도출 - 텍스트 마이닝을 활용하여 -)

  • Seok, Youngsun;Song, Kihwan;Han, Hyojoo;Lee, Junga
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.49 no.5
    • /
    • pp.79-96
    • /
    • 2021
  • Green infrastructure planning represents landscape planning measures to reduce particulate matter. This study aimed to derive factors that may be used in planning green infrastructure for particulate matter reduction using text mining techniques. A range of analyses were carried out by focusing on keywords such as 'particulate matter reduction plan' and 'green infrastructure planning elements'. The analyses included Term Frequency-Inverse Document Frequency (TF-IDF) analysis, centrality analysis, related word analysis, and topic modeling analysis. These analyses were carried out via text mining by collecting information on previous related research, policy reports, and laws. Initially, TF-IDF analysis results were used to classify major keywords relating to particulate matter and green infrastructure into three groups: (1) environmental issues (e.g., particulate matter, environment, carbon, and atmosphere), target spaces (e.g., urban, park, and local green space), and application methods (e.g., analysis, planning, evaluation, development, ecological aspect, policy management, technology, and resilience). Second, the centrality analysis results were found to be similar to those of TF-IDF; it was confirmed that the central connectors to the major keywords were 'Green New Deal' and 'Vacant land'. The results from the analysis of related words verified that planning green infrastructure for particulate matter reduction required planning forests and ventilation corridors. Additionally, moisture must be considered for microclimate control. It was also confirmed that utilizing vacant space, establishing mixed forests, introducing particulate matter reduction technology, and understanding the system may be important for the effective planning of green infrastructure. Topic analysis was used to classify the planning elements of green infrastructure based on ecological, technological, and social functions. The planning elements of ecological function were classified into morphological (e.g., urban forest, green space, wall greening) and functional aspects (e.g., climate control, carbon storage and absorption, provision of habitats, and biodiversity for wildlife). The planning elements of technical function were classified into various themes, including the disaster prevention functions of green infrastructure, buffer effects, stormwater management, water purification, and energy reduction. The planning elements of the social function were classified into themes such as community function, improving the health of users, and scenery improvement. These results suggest that green infrastructure planning for particulate matter reduction requires approaches related to key concepts, such as resilience and sustainability. In particular, there is a need to apply green infrastructure planning elements in order to reduce exposure to particulate matter.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (비정형 텍스트 분석을 활용한 이슈의 동적 변이과정 고찰)

  • Lim, Myungsu;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.1-18
    • /
    • 2016
  • Owing to the extensive use of Web media and the development of the IT industry, a large amount of data has been generated, shared, and stored. Nowadays, various types of unstructured data such as image, sound, video, and text are distributed through Web media. Therefore, many attempts have been made in recent years to discover new value through an analysis of these unstructured data. Among these types of unstructured data, text is recognized as the most representative method for users to express and share their opinions on the Web. In this sense, demand for obtaining new insights through text analysis is steadily increasing. Accordingly, text mining is increasingly being used for different purposes in various fields. In particular, issue tracking is being widely studied not only in the academic world but also in industries because it can be used to extract various issues from text such as news, (SocialNetworkServices) to analyze the trends of these issues. Conventionally, issue tracking is used to identify major issues sustained over a long period of time through topic modeling and to analyze the detailed distribution of documents involved in each issue. However, because conventional issue tracking assumes that the content composing each issue does not change throughout the entire tracking period, it cannot represent the dynamic mutation process of detailed issues that can be created, merged, divided, and deleted between these periods. Moreover, because only keywords that appear consistently throughout the entire period can be derived as issue keywords, concrete issue keywords such as "nuclear test" and "separated families" may be concealed by more general issue keywords such as "North Korea" in an analysis over a long period of time. This implies that many meaningful but short-lived issues cannot be discovered by conventional issue tracking. Note that detailed keywords are preferable to general keywords because the former can be clues for providing actionable strategies. To overcome these limitations, we performed an independent analysis on the documents of each detailed period. We generated an issue flow diagram based on the similarity of each issue between two consecutive periods. The issue transition pattern among categories was analyzed by using the category information of each document. In this study, we then applied the proposed methodology to a real case of 53,739 news articles. We derived an issue flow diagram from the articles. We then proposed the following useful application scenarios for the issue flow diagram presented in the experiment section. First, we can identify an issue that actively appears during a certain period and promptly disappears in the next period. Second, the preceding and following issues of a particular issue can be easily discovered from the issue flow diagram. This implies that our methodology can be used to discover the association between inter-period issues. Finally, an interesting pattern of one-way and two-way transitions was discovered by analyzing the transition patterns of issues through category analysis. Thus, we discovered that a pair of mutually similar categories induces two-way transitions. In contrast, one-way transitions can be recognized as an indicator that issues in a certain category tend to be influenced by other issues in another category. For practical application of the proposed methodology, high-quality word and stop word dictionaries need to be constructed. In addition, not only the number of documents but also additional meta-information such as the read counts, written time, and comments of documents should be analyzed. A rigorous performance evaluation or validation of the proposed methodology should be performed in future works.

Identifying Landscape Perceptions of Visitors' to the Taean Coast National Park Using Social Media Data - Focused on Kkotji Beach, Sinduri Coastal Sand Dune, and Manlipo Beach - (소셜미디어 데이터를 활용한 태안해안국립공원 방문객의 경관인식 파악 - 꽃지해수욕장·신두리해안사구·만리포해수욕장을 대상으로 -)

  • Lee, Sung-Hee;Son, Yong-Hoon
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.46 no.5
    • /
    • pp.10-21
    • /
    • 2018
  • This study used text mining methodology to focus on the perceptions of the landscape embedded in text that users spontaneously uploaded to the "Taean Travel"blogpost. The study area is the Taean Coast National Park. Most of the places that are searched by 'Taean Travel' on the blog were located in the Taean Coast National Park. We conducted a network analysis on the top three places and extracted keywords related to the landscape. Finally, using a centrality and cohesion analysis, we derived landscape perceptions and the major characteristics of those landscapes. As a result of the study, it was possible to identify the main tourist places in Taean, the individual landscape experience, and the landscape perception in specific places. There were three different types of landscape characteristics: atmosphere-related keywords, which appeared in Kkotji Beach, symbolic image-related keywords appeared in Sinduri Coastal Sand Dune, and landscape objects-related appeared in Manlipo Beach. It can be inferred that the characteristics of these three places are perceived differently. Kkotji Beach is recognized as a place to appreciate a view the sunset and is a base for the Taean Coast National Park's trekking course. Sinduri Coastal Sand Dune is recognized as a place with unusual scenery, and is an ecologically valuable space. Finally, Manlipo Beach is adjacent to the Chunlipo Arboretum, which is often visited by tourists, and the beach itself is recognized as a place with an impressive appearance. Social media data is very useful because it can enable analysis of various types of contents that are not from an expert's point of view. In this study, we used social media data to analyze various aspects of how people perceive and enjoy landscapes by integrating various content, such as landscape objects, images, and activities. However, because social media data may be amplified or distorted by users' memories and perceptions, field surveys are needed to verify the results of this study.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

An Analysis of the Internal Marketing Impact on the Market Capitalization Fluctuation Rate based on the Online Company Reviews from Jobplanet (직원을 위한 내부마케팅이 기업의 시가 총액 변동률에 미치는 영향 분석: 잡플래닛 기업 리뷰를 중심으로)

  • Kichul Choi;Sang-Yong Tom Lee
    • Information Systems Review
    • /
    • v.20 no.2
    • /
    • pp.39-62
    • /
    • 2018
  • Thanks to the growth of computing power and the recent development of data analytics, researchers have started to work on the data produced by users through the Internet or social media. This study is in line with these recent research trends and attempts to adopt data analytical techniques. We focus on the impact of "internal marketing" factors on firm performance, which is typically studied through survey methodologies. We looked into the job review platform Jobplanet (www.jobplanet.co.kr), which is a website where employees and former employees anonymously review companies and their management. With web crawling processes, we collected over 40K data points and performed morphological analysis to classify employees' reviews for internal marketing data. We then implemented econometric analysis to see the relationship between internal marketing and market capitalization. Contrary to the findings of extant survey studies, internal marketing is positively related to a firm's market capitalization only within a limited area. In most of the areas, the relationships are negative. Particularly, female-friendly environment and human resource development (HRD) are the areas exhibiting positive relations with market capitalization in the manufacturing industry. In the service industry, most of the areas, such as employ welfare and work-life balance, are negatively related with market capitalization. When firm size is small (or the history is short), female-friendly environment positively affect firm performance. On the contrary, when firm size is big (or the history is long), most of the internal marketing factors are either negative or insignificant. We explain the theoretical contributions and managerial implications with these results.

Analysis of the Importance and Satisfaction of Viewing Quality Factors among Non-Audience in Professional Baseball According to Corona 19 (코로나 19에 따른 프로야구 무관중 시청품질요인의 중요도, 만족도 분석)

  • Baek, Seung-Heon;Kim, Gi-Tak
    • Journal of Korea Entertainment Industry Association
    • /
    • v.15 no.2
    • /
    • pp.123-135
    • /
    • 2021
  • The data processing of this study is focused on keywords related to 'Corona 19 and professional baseball' and 'Corona 19 and professional baseball no spectators', using text mining and social network analysis of textom program to identify problems and view quality. It was used to set the variable of For quantitative analysis, a questionnaire on viewing quality was constructed, and out of 270 survey respondents, 250 questionnaires were used for the final study. As a tool for securing the validity and reliability of the questionnaire, exploratory factor analysis and reliability analysis were conducted, and IPA analysis (importance-satisfaction) was conducted based on the questionnaire that secured validity and reliability, and the results and strategies were presented. As a result of IPA analysis, factors related to the image (image composition, image coloration, image clarity, image enlargement and composition, high-quality image) were found in the first quadrant, and the second quadrant was the game situation (support team game level, support player game level, star). Player discovery, competition with rival teams), game information (match schedule information, player information check, team performance and player performance, game information), interaction (consensus with the supporting team), and some factors appeared. The factors of commentator (baseball-related knowledge, communication ability, pronunciation and voice, use of standard language, introduction of game-related information) and interaction (real-time communication with the front desk, sympathy with viewers, information exchange such as chatting) appeared.

Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site (사용자 리뷰의 평가기준 별 이슈 식별 방법론: 호텔 리뷰 사이트를 중심으로)

  • Byun, Sungho;Lee, Donghoon;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.23-43
    • /
    • 2016
  • As a result of the growth of Internet data and the rapid development of Internet technology, "big data" analysis has gained prominence as a major approach for evaluating and mining enormous data for various purposes. Especially, in recent years, people tend to share their experiences related to their leisure activities while also reviewing others' inputs concerning their activities. Therefore, by referring to others' leisure activity-related experiences, they are able to gather information that might guarantee them better leisure activities in the future. This phenomenon has appeared throughout many aspects of leisure activities such as movies, traveling, accommodation, and dining. Apart from blogs and social networking sites, many other websites provide a wealth of information related to leisure activities. Most of these websites provide information of each product in various formats depending on different purposes and perspectives. Generally, most of the websites provide the average ratings and detailed reviews of users who actually used products/services, and these ratings and reviews can actually support the decision of potential customers in purchasing the same products/services. However, the existing websites offering information on leisure activities only provide the rating and review based on one stage of a set of evaluation criteria. Therefore, to identify the main issue for each evaluation criterion as well as the characteristics of specific elements comprising each criterion, users have to read a large number of reviews. In particular, as most of the users search for the characteristics of the detailed elements for one or more specific evaluation criteria based on their priorities, they must spend a great deal of time and effort to obtain the desired information by reading more reviews and understanding the contents of such reviews. Although some websites break down the evaluation criteria and direct the user to input their reviews according to different levels of criteria, there exist excessive amounts of input sections that make the whole process inconvenient for the users. Further, problems may arise if a user does not follow the instructions for the input sections or fill in the wrong input sections. Finally, treating the evaluation criteria breakdown as a realistic alternative is difficult, because identifying all the detailed criteria for each evaluation criterion is a challenging task. For example, if a review about a certain hotel has been written, people tend to only write one-stage reviews for various components such as accessibility, rooms, services, or food. These might be the reviews for most frequently asked questions, such as distance between the nearest subway station or condition of the bathroom, but they still lack detailed information for these questions. In addition, in case a breakdown of the evaluation criteria was provided along with various input sections, the user might only fill in the evaluation criterion for accessibility or fill in the wrong information such as information regarding rooms in the evaluation criteria for accessibility. Thus, the reliability of the segmented review will be greatly reduced. In this study, we propose an approach to overcome the limitations of the existing leisure activity information websites, namely, (1) the reliability of reviews for each evaluation criteria and (2) the difficulty of identifying the detailed contents that make up the evaluation criteria. In our proposed methodology, we first identify the review content and construct the lexicon for each evaluation criterion by using the terms that are frequently used for each criterion. Next, the sentences in the review documents containing the terms in the constructed lexicon are decomposed into review units, which are then reconstructed by using the evaluation criteria. Finally, the issues of the constructed review units by evaluation criteria are derived and the summary results are provided. Apart from the derived issues, the review units are also provided. Therefore, this approach aims to help users save on time and effort, because they will only be reading the relevant information they need for each evaluation criterion rather than go through the entire text of review. Our proposed methodology is based on the topic modeling, which is being actively used in text analysis. The review is decomposed into sentence units rather than considering the whole review as a document unit. After being decomposed into individual review units, the review units are reorganized according to each evaluation criterion and then used in the subsequent analysis. This work largely differs from the existing topic modeling-based studies. In this paper, we collected 423 reviews from hotel information websites and decomposed these reviews into 4,860 review units. We then reorganized the review units according to six different evaluation criteria. By applying these review units in our methodology, the analysis results can be introduced, and the utility of proposed methodology can be demonstrated.

Trend Analysis of Sports for All-Related Issues in Early Stage of COVID-19 Using Topic Modeling (토픽 모델링을 활용한 코로나19 초기 생활체육 이슈 분석)

  • Chung, Yunkil;Seo, Sumin;Kang, Hyunmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.3
    • /
    • pp.57-79
    • /
    • 2022
  • COVID-19, which started in December 2019, has had a great impact on our lives in general, including politics, economy, society, and culture, and activities in sports and arts have also been significantly reduced. In the case of sports, sports for all fields in which ordinary citizens participate were particularly affected, and cases of infection in places closely related to people's lives, such as gyms, table tennis, and badminton clubs, also amplified the social fear of the spread of COVID-19. Therefore, in this study, we analyzed news articles related to sports for all at the time when COVID-19 was first spread, and investigated what issues were emerging and being discussed in the sports for all field under the COVID-19 situation. Specifically, we collected news articles dealt with sports for all issues under the COVID-19 situation from Korea's leading portal news sites and identified key sports for all issues by performing topic modeling on these articles. Through the analysis, we found meaningful issues such as COVID-19 outbreak in sports facilities and support for sports activities. In addition, through wordcloud analysis of these major issues, we visually understood the issues and identified the changes in these issues over time.

A Study on the Landscape Cognition of Wind Power Plant in Social Media (소셜미디어에 나타난 풍력발전시설의 경관 인식 연구)

  • Woo, Kyung-Sook;Suh, Joo-Hwan
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.50 no.5
    • /
    • pp.69-79
    • /
    • 2022
  • This study aims to assess the current understanding of the landscape of wind power facilities as renewable energy sources that supply sightseeing, tourism, and other opportunities. Therefore, social media data related to the landscape of wind power facilities experienced by visitors from different regions was analyzed. The analysis results showed that the common characteristics of the landscape of wind power facilities are based on the scale of wind power facilities, the distance between overlook points of wind power facilities, the visual openness of the wind power facilities from the overlook points, and the terrain where the wind power facilities are located. In addition, the preference for wind power facilities is higher in places where the shape of wind power facilities and the surrounding landscape can be clearly seen- flat ground or the sea are considered better landscapes. Negative keywords about the landscape appear on Gade Mountain in Taibai, Meifeng Mountain in Taibai, Taiqi Mountain, and Gyeongju Wind Power Generation Facilities on Gyeongshang Road in Gangwon. The keyword 'negation' occurs when looking at wind power facilities at close range. Because of the high angle of the view, viewers can feel overwhelmed seeing the size of the facility and the ridge simultaneously, feeling psychological pressure. On the contrary, positive landscape adjectives are obtained from wind power facilities on flat ground or the sea. Visitors think that the visual volume of the landscape is fully ensured on flat ground or the sea, and it is a symbolic element that can represent the site. This study analyzes landscape awareness based on the opinions of visitors who have experienced wind power facilities. However, wind power facilities are built in different areas. Therefore, landscape characteristics are different, and there are many variables, such as viewpoints and observers, so the research results are difficult to popularize and have limitations. In recent years, landscape damage due to the construction of wind power facilities has become a hot issue, and the domestic methods of landscape evaluation of wind power facilities are unsatisfactory. Therefore, when evaluating the landscape of wind power facilities, the scale of wind power facilities, the inherent natural characteristics of the area where wind power facilities are set up, and the distance between wind power facilities and overlook points are important elements to consider. In addition, wind power facilities are set in the natural environment, which needs to be protected. Therefore, from the landscape perspective, it is necessary to study the landscape of wind power facilities and the surrounding environment.

A Method for Evaluating News Value based on Supply and Demand of Information Using Text Analysis (텍스트 분석을 활용한 정보의 수요 공급 기반 뉴스 가치 평가 방안)

  • Lee, Donghoon;Choi, Hochang;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.45-67
    • /
    • 2016
  • Given the recent development of smart devices, users are producing, sharing, and acquiring a variety of information via the Internet and social network services (SNSs). Because users tend to use multiple media simultaneously according to their goals and preferences, domestic SNS users use around 2.09 media concurrently on average. Since the information provided by such media is usually textually represented, recent studies have been actively conducting textual analysis in order to understand users more deeply. Earlier studies using textual analysis focused on analyzing a document's contents without substantive consideration of the diverse characteristics of the source medium. However, current studies argue that analytical and interpretive approaches should be applied differently according to the characteristics of a document's source. Documents can be classified into the following types: informative documents for delivering information, expressive documents for expressing emotions and aesthetics, operational documents for inducing the recipient's behavior, and audiovisual media documents for supplementing the above three functions through images and music. Further, documents can be classified according to their contents, which comprise facts, concepts, procedures, principles, rules, stories, opinions, and descriptions. Documents have unique characteristics according to the source media by which they are distributed. In terms of newspapers, only highly trained people tend to write articles for public dissemination. In contrast, with SNSs, various types of users can freely write any message and such messages are distributed in an unpredictable way. Again, in the case of newspapers, each article exists independently and does not tend to have any relation to other articles. However, messages (original tweets) on Twitter, for example, are highly organized and regularly duplicated and repeated through replies and retweets. There have been many studies focusing on the different characteristics between newspapers and SNSs. However, it is difficult to find a study that focuses on the difference between the two media from the perspective of supply and demand. We can regard the articles of newspapers as a kind of information supply, whereas messages on various SNSs represent a demand for information. By investigating traditional newspapers and SNSs from the perspective of supply and demand of information, we can explore and explain the information dilemma more clearly. For example, there may be superfluous issues that are heavily reported in newspaper articles despite the fact that users seldom have much interest in these issues. Such overproduced information is not only a waste of media resources but also makes it difficult to find valuable, in-demand information. Further, some issues that are covered by only a few newspapers may be of high interest to SNS users. To alleviate the deleterious effects of information asymmetries, it is necessary to analyze the supply and demand of each information source and, accordingly, provide information flexibly. Such an approach would allow the value of information to be explored and approximated on the basis of the supply-demand balance. Conceptually, this is very similar to the price of goods or services being determined by the supply-demand relationship. Adopting this concept, media companies could focus on the production of highly in-demand issues that are in short supply. In this study, we selected Internet news sites and Twitter as representative media for investigating information supply and demand, respectively. We present the notion of News Value Index (NVI), which evaluates the value of news information in terms of the magnitude of Twitter messages associated with it. In addition, we visualize the change of information value over time using the NVI. We conducted an analysis using 387,014 news articles and 31,674,795 Twitter messages. The analysis results revealed interesting patterns: most issues show lower NVI than average of the whole issue, whereas a few issues show steadily higher NVI than the average.