• Title/Summary/Keyword: Web data mining

Search Result 409, Processing Time 0.024 seconds

Building Hierarchical Knowledge Base of Research Interests and Learning Topics for Social Computing Support (소셜 컴퓨팅을 위한 연구·학습 주제의 계층적 지식기반 구축)

  • Kim, Seonho;Kim, Kang-Hoe;Yeo, Woondong
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.12
    • /
    • pp.489-498
    • /
    • 2012
  • This paper consists of two parts: In the first part, we describe our work to build hierarchical knowledge base of digital library patron's research interests and learning topics in various scholarly areas through analyzing well classified Electronic Theses and Dissertations (ETDs) of NDLTD Union catalog. Journal articles from ACM Transactions and conference web sites of computing areas also are added in the analysis to specialize computing fields. This hierarchical knowledge base would be a useful tool for many social computing and information service applications, such as personalization, recommender system, text mining, technology opportunity mining, information visualization, and so on. In the second part, we compare four grouping algorithms to select best one for our data mining researches by testing each one with the hierarchical knowledge base we described in the first part. From these two studies, we intent to show traditional verification methods for social community miming researches, based on interviewing and answering questionnaires, which are expensive, slow, and privacy threatening, can be replaced with systematic, consistent, fast, and privacy protecting methods by using our suggested hierarchical knowledge base.

The Image of Ruralism in Korea through a Text Mining for Online News Media analysis (인터넷 뉴스 데이터 텍스트 분석을 통해 본 우리나라 농촌다움에 대한 이미지 연구)

  • Son, Yong-hoon;Kim, Young-jin
    • Journal of Korean Society of Rural Planning
    • /
    • v.25 no.4
    • /
    • pp.13-26
    • /
    • 2019
  • The rural areas in South Korea have changed rapidly in the process of national land development. Rural landscapes have become discoloured, and their attractiveness has decreased as cities have expanded. But the attractiveness or multifunctional values of rural areas has become more important in contemporary society around the world. According to this social demand, the efforts of conserving the rural landscape are of high priority and the recovery of ruralism in the area is required. This study has tried to understand how the public image of ruralism in South Korea has been influenced by the news media. The study retrieved news articles using the web searching portal site from the six keywords, commonly used to refer to ruralism, including 'rural landscape', 'rural community', 'rural tourism', 'rural life', 'rural amenity', and 'rural environment'. News data from the six keywords were also collected respectively from within the year-period of 2004-05, 2007-08, 2012-13, and 2016-17. In the text mining analysis, the nouns with high Degree Centrality were figured out, and the changes by year-period were identified. Then, LDA topic analysis was performed for text datasets of six keywords. As a result, the study found that the news articles gave an informed focus on only a handful of issues such as 'poor rural living condition', 'regional or village improvement projects', 'rural tourism promotion projects', and 'other government support projects'. On the other hand, nouns related to virtues and values in the rural landscape were less shown in news articles. These results have become more apparent in recent years. In the topic analysis, 35 topics were identified. 'village development projects', 'rural tourism', and 'urban-rural exchange projects' were appeared repeatedly in several keywords. Among the topics, there are also topics closely related to ruralism such as 'rural landscape conservation', 'eco-friendly rural areas', 'local amenity resources', 'public interest values of agriculture', and 'rural life and communities'. The study presented an image map showing ruralism in South Korea using a network map between all topics and keywords. At the end of the study, implications for Korean rural area policy and research directions were discussed.

Latent topics-based product reputation mining (잠재 토픽 기반의 제품 평판 마이닝)

  • Park, Sang-Min;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.39-70
    • /
    • 2017
  • Data-drive analytics techniques have been recently applied to public surveys. Instead of simply gathering survey results or expert opinions to research the preference for a recently launched product, enterprises need a way to collect and analyze various types of online data and then accurately figure out customer preferences. In the main concept of existing data-based survey methods, the sentiment lexicon for a particular domain is first constructed by domain experts who usually judge the positive, neutral, or negative meanings of the frequently used words from the collected text documents. In order to research the preference for a particular product, the existing approach collects (1) review posts, which are related to the product, from several product review web sites; (2) extracts sentences (or phrases) in the collection after the pre-processing step such as stemming and removal of stop words is performed; (3) classifies the polarity (either positive or negative sense) of each sentence (or phrase) based on the sentiment lexicon; and (4) estimates the positive and negative ratios of the product by dividing the total numbers of the positive and negative sentences (or phrases) by the total number of the sentences (or phrases) in the collection. Furthermore, the existing approach automatically finds important sentences (or phrases) including the positive and negative meaning to/against the product. As a motivated example, given a product like Sonata made by Hyundai Motors, customers often want to see the summary note including what positive points are in the 'car design' aspect as well as what negative points are in thesame aspect. They also want to gain more useful information regarding other aspects such as 'car quality', 'car performance', and 'car service.' Such an information will enable customers to make good choice when they attempt to purchase brand-new vehicles. In addition, automobile makers will be able to figure out the preference and positive/negative points for new models on market. In the near future, the weak points of the models will be improved by the sentiment analysis. For this, the existing approach computes the sentiment score of each sentence (or phrase) and then selects top-k sentences (or phrases) with the highest positive and negative scores. However, the existing approach has several shortcomings and is limited to apply to real applications. The main disadvantages of the existing approach is as follows: (1) The main aspects (e.g., car design, quality, performance, and service) to a product (e.g., Hyundai Sonata) are not considered. Through the sentiment analysis without considering aspects, as a result, the summary note including the positive and negative ratios of the product and top-k sentences (or phrases) with the highest sentiment scores in the entire corpus is just reported to customers and car makers. This approach is not enough and main aspects of the target product need to be considered in the sentiment analysis. (2) In general, since the same word has different meanings across different domains, the sentiment lexicon which is proper to each domain needs to be constructed. The efficient way to construct the sentiment lexicon per domain is required because the sentiment lexicon construction is labor intensive and time consuming. To address the above problems, in this article, we propose a novel product reputation mining algorithm that (1) extracts topics hidden in review documents written by customers; (2) mines main aspects based on the extracted topics; (3) measures the positive and negative ratios of the product using the aspects; and (4) presents the digest in which a few important sentences with the positive and negative meanings are listed in each aspect. Unlike the existing approach, using hidden topics makes experts construct the sentimental lexicon easily and quickly. Furthermore, reinforcing topic semantics, we can improve the accuracy of the product reputation mining algorithms more largely than that of the existing approach. In the experiments, we collected large review documents to the domestic vehicles such as K5, SM5, and Avante; measured the positive and negative ratios of the three cars; showed top-k positive and negative summaries per aspect; and conducted statistical analysis. Our experimental results clearly show the effectiveness of the proposed method, compared with the existing method.

A Study on the Open Platform Architecture for the Integrated Utilization of Spatial Information and Statistics (공간정보와 통계정보의 융합 활용을 위한 오픈플랫폼 아키텍처에 관한 연구)

  • Kim, Min-Soo;Yoo, Jeong-Ki
    • Journal of Cadastre & Land InformatiX
    • /
    • v.46 no.2
    • /
    • pp.211-224
    • /
    • 2016
  • Based on the 'Government 3.0', the government opens the public data and encourages the active use in the private sector. Recently, the spatial and statistical information that is one of the public data is being widely used in the various web business as a high value-added information. In this study, we propose an architecture of high-availability, high-reliability and high-performance open platform which can provide a variety of services such as searching, analysis, data mining, and thematic mapping. In particular, we present two different system architectures for the government and the public services, by reflecting the importance of the information security and the respective utilization in the private and public sectors. We also compared a variety of server architecture configurations such as a clustered server configuration, a cloud-based virtual server configuration, and a CDN server configuration, in order to design a cost- and performance-effective spatial-statistical information open platform.

The Development of GIS-based Small Hydropower Package Tool (GIS기반 소수력 Package Tool 개발)

  • Heo, June-Ho;Jeong, Sang-Man;Park, Wan-Soon;Lee, Chul-Hyung
    • 한국태양에너지학회:학술대회논문집
    • /
    • 2009.04a
    • /
    • pp.139-144
    • /
    • 2009
  • The generation of small hydropower as compared to other different developed environmental methods produces one of the clean energies. In such manner, various application system development through IT technique is being developed for an advanced small hydropower energy resources data mining. However, existing data analysis of New & Renewable Information System for small hydropower resources application is incomplete therefore it limits expressing these information on the Web. Thus for positive usage of small hydropower resources, a more systematic and precise analysis system should be built. This study seeks to develop a map of the domestic small hydropower resources problems to further improve small hydropower resources, developed through Package Tool which can accurately evaluate a wide range of small hydropower basin in a short period of time. Small hydropower Package Tool was calculated using existing Analysis System small hydropower resources which did not provide diverse capabilities resulting to 840 standard basin classified by A and facility capacity, etc., and to assume a 40% annual capacity, expected annual electricity production was calculated. Small hydropower for the national water system of small hydropower resources potential calculated in terms of resources for the development of small hydropower will be utilized as basic data.

  • PDF

A Scalable Clustering Method for Categorical Sequences (범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법)

  • Oh, Seung-Joon;Kim, Jae-Yearn
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.136-141
    • /
    • 2004
  • There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Development of Prototype for Screening Anti-Inflammation Effects concerning p38 MAPK Signal Pathway (p38 MAPK을 이용한 항염증 효능 규명 프로토타입 개발)

  • Kim, Chul;Yae, Sang-Jun;Nam, Ky-Youb;Kim, Sang-Kyun;Jang, Hyun-Chul;Kim, Jin-Hyun;Kim, Young-Eun;Song, Mi-Young
    • Korean Journal of Oriental Medicine
    • /
    • v.17 no.3
    • /
    • pp.77-85
    • /
    • 2011
  • Objectives : The purpose of this study was to develop a simulator which can analyze the anti-inflammatory effects of medical herbs based on e-cell concerning p38 MAPK signal pathway. Methods : We collected data concerning medical herbs with anti-inflammatory effects and the active compounds to provide as a fundamental databse and to validate the newly developed algorithm. At this time, we used the target database as pubmed and gathered the data by data mining tool, pathway studio. Also we have developed the web-based search system for confirming database related to anti-inflammation. We researched the mechanism of actions of proteins in p38 MAPK signal pathway when active compound has been inserted into the network. We reduced total network into TAK-MKK3-p38 and made the two types of mathematical model about active compounds' interaction. Results & Conclusion : We constructed the database which have 69 cases of medical herbs, 71 cases of active compounds, about 8,000 cases of URL(Uniform Resource Locator) related to papers and reports. We designed the ordinary differential equations for response of TAK, MKK3, p38 in e-cell's cytosol and nucleus. We used this formular as measure whether an active compound of medicinal plants which is inputted by an user would have an anti-inflammation effects. We developed the visualization program which could show the change of concentration over time.

An Analysis of Keywords Related to Neighborhood Healing Gardens Using Big Data (빅데이터를 활용한 생활밀착형 치유정원 연관키워드 분석)

  • Huang, Zhirui;Lee, Ai-Ran
    • Land and Housing Review
    • /
    • v.13 no.2
    • /
    • pp.81-90
    • /
    • 2022
  • This study is based on social needs for green healing spaces assumed to enhance mental health in a city. This study proposes development directions through the analysis of modern social recognition factors for neighborhood gardens. As a research method, web information data was collected using Textom among big data tools. Text Mining was conducted to extract elements and analyze their relationship through keyword analysis, network analysis, and cluster analysis. As a result, first, the healing space and the healing environment were creating an eco-friendly healthy environment in a space close to the neighborhood within the city. Second, neighborhood gardens included projects and activities that involved government, local administration, and citizens by linking facilities as well as living culture and urban environments. These gardens have been reinforced through green welfare and service programs. In conclusion, friendly gardens in the neighborhood for the purpose of public interest, which are beneficial to mental health, are green infrastructures as a healing environment that can produce positive effects.

An Efficient Estimation of Place Brand Image Power Based on Text Mining Technology (텍스트마이닝 기반의 효율적인 장소 브랜드 이미지 강도 측정 방법)

  • Choi, Sukjae;Jeon, Jongshik;Subrata, Biswas;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.113-129
    • /
    • 2015
  • Location branding is a very important income making activity, by giving special meanings to a specific location while producing identity and communal value which are based around the understanding of a place's location branding concept methodology. Many other areas, such as marketing, architecture, and city construction, exert an influence creating an impressive brand image. A place brand which shows great recognition to both native people of S. Korea and foreigners creates significant economic effects. There has been research on creating a strategically and detailed place brand image, and the representative research has been carried out by Anholt who surveyed two million people from 50 different countries. However, the investigation, including survey research, required a great deal of effort from the workforce and required significant expense. As a result, there is a need to make more affordable, objective and effective research methods. The purpose of this paper is to find a way to measure the intensity of the image of the brand objective and at a low cost through text mining purposes. The proposed method extracts the keyword and the factors constructing the location brand image from the related web documents. In this way, we can measure the brand image intensity of the specific location. The performance of the proposed methodology was verified through comparison with Anholt's 50 city image consistency index ranking around the world. Four methods are applied to the test. First, RNADOM method artificially ranks the cities included in the experiment. HUMAN method firstly makes a questionnaire and selects 9 volunteers who are well acquainted with brand management and at the same time cities to evaluate. Then they are requested to rank the cities and compared with the Anholt's evaluation results. TM method applies the proposed method to evaluate the cities with all evaluation criteria. TM-LEARN, which is the extended method of TM, selects significant evaluation items from the items in every criterion. Then the method evaluates the cities with all selected evaluation criteria. RMSE is used to as a metric to compare the evaluation results. Experimental results suggested by this paper's methodology are as follows: Firstly, compared to the evaluation method that targets ordinary people, this method appeared to be more accurate. Secondly, compared to the traditional survey method, the time and the cost are much less because in this research we used automated means. Thirdly, this proposed methodology is very timely because it can be evaluated from time to time. Fourthly, compared to Anholt's method which evaluated only for an already specified city, this proposed methodology is applicable to any location. Finally, this proposed methodology has a relatively high objectivity because our research was conducted based on open source data. As a result, our city image evaluation text mining approach has found validity in terms of accuracy, cost-effectiveness, timeliness, scalability, and reliability. The proposed method provides managers with clear guidelines regarding brand management in public and private sectors. As public sectors such as local officers, the proposed method could be used to formulate strategies and enhance the image of their places in an efficient manner. Rather than conducting heavy questionnaires, the local officers could monitor the current place image very shortly a priori, than may make decisions to go over the formal place image test only if the evaluation results from the proposed method are not ordinary no matter what the results indicate opportunity or threat to the place. Moreover, with co-using the morphological analysis, extracting meaningful facets of place brand from text, sentiment analysis and more with the proposed method, marketing strategy planners or civil engineering professionals may obtain deeper and more abundant insights for better place rand images. In the future, a prototype system will be implemented to show the feasibility of the idea proposed in this paper.

The Effect of Expert Reviews on Consumer Product Evaluations: A Text Mining Approach (전문가 제품 후기가 소비자 제품 평가에 미치는 영향: 텍스트마이닝 분석을 중심으로)

  • Kang, Taeyoung;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.63-82
    • /
    • 2016
  • Individuals gather information online to resolve problems in their daily lives and make various decisions about the purchase of products or services. With the revolutionary development of information technology, Web 2.0 has allowed more people to easily generate and use online reviews such that the volume of information is rapidly increasing, and the usefulness and significance of analyzing the unstructured data have also increased. This paper presents an analysis on the lexical features of expert product reviews to determine their influence on consumers' purchasing decisions. The focus was on how unstructured data can be organized and used in diverse contexts through text mining. In addition, diverse lexical features of expert reviews of contents provided by a third-party review site were extracted and defined. Expert reviews are defined as evaluations by people who have expert knowledge about specific products or services in newspapers or magazines; this type of review is also called a critic review. Consumers who purchased products before the widespread use of the Internet were able to access expert reviews through newspapers or magazines; thus, they were not able to access many of them. Recently, however, major media also now provide online services so that people can more easily and affordably access expert reviews compared to the past. The reason why diverse reviews from experts in several fields are important is that there is an information asymmetry where some information is not shared among consumers and sellers. The information asymmetry can be resolved with information provided by third parties with expertise to consumers. Then, consumers can read expert reviews and make purchasing decisions by considering the abundant information on products or services. Therefore, expert reviews play an important role in consumers' purchasing decisions and the performance of companies across diverse industries. If the influence of qualitative data such as reviews or assessment after the purchase of products can be separately identified from the quantitative data resources, such as the actual quality of products or price, it is possible to identify which aspects of product reviews hamper or promote product sales. Previous studies have focused on the characteristics of the experts themselves, such as the expertise and credibility of sources regarding expert reviews; however, these studies did not suggest the influence of the linguistic features of experts' product reviews on consumers' overall evaluation. However, this study focused on experts' recommendations and evaluations to reveal the lexical features of expert reviews and whether such features influence consumers' overall evaluations and purchasing decisions. Real expert product reviews were analyzed based on the suggested methodology, and five lexical features of expert reviews were ultimately determined. Specifically, the "review depth" (i.e., degree of detail of the expert's product analysis), and "lack of assurance" (i.e., degree of confidence that the expert has in the evaluation) have statistically significant effects on consumers' product evaluations. In contrast, the "positive polarity" (i.e., the degree of positivity of an expert's evaluations) has an insignificant effect, while the "negative polarity" (i.e., the degree of negativity of an expert's evaluations) has a significant negative effect on consumers' product evaluations. Finally, the "social orientation" (i.e., the degree of how many social expressions experts include in their reviews) does not have a significant effect on consumers' product evaluations. In summary, the lexical properties of the product reviews were defined according to each relevant factor. Then, the influence of each linguistic factor of expert reviews on the consumers' final evaluations was tested. In addition, a test was performed on whether each linguistic factor influencing consumers' product evaluations differs depending on the lexical features. The results of these analyses should provide guidelines on how individuals process massive volumes of unstructured data depending on lexical features in various contexts and how companies can use this mechanism from their perspective. This paper provides several theoretical and practical contributions, such as the proposal of a new methodology and its application to real data.