• Title/Summary/Keyword: Web data mining

Search Result 409, Processing Time 0.029 seconds

Analysis of Twitter for 2012 South Korea Presidential Election by Text Mining Techniques (텍스트 마이닝을 이용한 2012년 한국대선 관련 트위터 분석)

  • Bae, Jung-Hwan;Son, Ji-Eun;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.141-156
    • /
    • 2013
  • Social media is a representative form of the Web 2.0 that shapes the change of a user's information behavior by allowing users to produce their own contents without any expert skills. In particular, as a new communication medium, it has a profound impact on the social change by enabling users to communicate with the masses and acquaintances their opinions and thoughts. Social media data plays a significant role in an emerging Big Data arena. A variety of research areas such as social network analysis, opinion mining, and so on, therefore, have paid attention to discover meaningful information from vast amounts of data buried in social media. Social media has recently become main foci to the field of Information Retrieval and Text Mining because not only it produces massive unstructured textual data in real-time but also it serves as an influential channel for opinion leading. But most of the previous studies have adopted broad-brush and limited approaches. These approaches have made it difficult to find and analyze new information. To overcome these limitations, we developed a real-time Twitter trend mining system to capture the trend in real-time processing big stream datasets of Twitter. The system offers the functions of term co-occurrence retrieval, visualization of Twitter users by query, similarity calculation between two users, topic modeling to keep track of changes of topical trend, and mention-based user network analysis. In addition, we conducted a case study on the 2012 Korean presidential election. We collected 1,737,969 tweets which contain candidates' name and election on Twitter in Korea (http://www.twitter.com/) for one month in 2012 (October 1 to October 31). The case study shows that the system provides useful information and detects the trend of society effectively. The system also retrieves the list of terms co-occurred by given query terms. We compare the results of term co-occurrence retrieval by giving influential candidates' name, 'Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn' as query terms. General terms which are related to presidential election such as 'Presidential Election', 'Proclamation in Support', Public opinion poll' appear frequently. Also the results show specific terms that differentiate each candidate's feature such as 'Park Jung Hee' and 'Yuk Young Su' from the query 'Guen Hae Park', 'a single candidacy agreement' and 'Time of voting extension' from the query 'Jae In Moon' and 'a single candidacy agreement' and 'down contract' from the query 'Chul Su Ahn'. Our system not only extracts 10 topics along with related terms but also shows topics' dynamic changes over time by employing the multinomial Latent Dirichlet Allocation technique. Each topic can show one of two types of patterns-Rising tendency and Falling tendencydepending on the change of the probability distribution. To determine the relationship between topic trends in Twitter and social issues in the real world, we compare topic trends with related news articles. We are able to identify that Twitter can track the issue faster than the other media, newspapers. The user network in Twitter is different from those of other social media because of distinctive characteristics of making relationships in Twitter. Twitter users can make their relationships by exchanging mentions. We visualize and analyze mention based networks of 136,754 users. We put three candidates' name as query terms-Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn'. The results show that Twitter users mention all candidates' name regardless of their political tendencies. This case study discloses that Twitter could be an effective tool to detect and predict dynamic changes of social issues, and mention-based user networks could show different aspects of user behavior as a unique network that is uniquely found in Twitter.

Consumers Perceptions on Sodium Saccharin in Social Media (소셜미디어 분석을 통한 삭카린나트륨 소비자 인식 조사)

  • Lee, Sooyeon;Lee, Wonsung;Moon, Il-Chul;Kwon, Hoonjeong
    • Journal of Food Hygiene and Safety
    • /
    • v.30 no.4
    • /
    • pp.329-342
    • /
    • 2015
  • The purpose of this study was to investigate consumers' perceptions of sodium saccharin in social media. Data was collected from Naver blogs and Naver web communities (Korean representative portal web-site), and media reports including comment sections on a Yonhap news website (Korean largest news agency). The results from Naver blogs and Naver web communities showed that it was primarily mentioned 'sodium saccharin-no added' products, properties of sodium saccharin, and methods of reducing sodium saccharin in food. When media reported the expansion of food categories permitted to use sodium saccharin, search volume for sodium saccharin has increased in both PC and mobile search engines. Also, it was mainly commented about distrust of government, criticism of food product price, and distrust of food companies below the news on the news site. The label of sodium saccharin-no added products in market emphasized "no added-sodium saccharin". These results suggest that consumers are interested in sodium saccharin and especially when media reported the expansion of food categories permitted to use it. Consumers were able to search various information on sodium saccharin except safety or acceptable daily intake through social media. Therefore media or competent authority should report item on sodium saccharin with information including safety or acceptable daily intake based on scientific background and reference or experts' interview for consumers to get reliable information.

A Case Study of e-Business Implementation in Part Manufacturing Industry(B2B in PCB Industry) (부품 제조 산업에서의 e-Business 구축 사례(PCB 산업의 B2B))

  • Bae, Joon-Soo;Bae, Eun-Hae;Cheong, Min-Chang;Shin, In-Ki;Park, Young-Chul
    • IE interfaces
    • /
    • v.13 no.3
    • /
    • pp.503-511
    • /
    • 2000
  • The main theme of this research is a case of e-Business implementation in part manufacturing industry, especially in a PCB manufacturing company. The characteristics of part manufacturing industry are as follows. First, an ERP system runs as a legacy system that is ready to be combined with e-Business system. Secondly, the number of customers is very small. The customers are not many individuals but only a few big electronic enterprises that are strategically affiliated with the part manufacturing company. This means that the e-Business of the part manufacturing industry needs to focus on sharing pertinent information throughout the transactions with the customers, not on data-warehousing or data-mining customers' potential needs or requests. In this paper, we extracted e-Business opportunity domains from a PCB manufacturing company, a typical part manufacturing industry. We are intended to enhance information sharing between customers and the company, and provide functions of transactions necessary in the whole value chain from order to shipment. Implementing the e-Business system on the Web can increase the visibility of customers, and further, the company can be transformed into an extended enterprise where the relationship with the customers becomes very close and interleaved. Also, the Cyber Office functionality of the e-Business system can support the salespersons effectively, so that they can spend more time on customer satisfaction. Such efforts, in the future, can be a basis for active adaptation to the industry transformations such as forming e-community and participating in the marketplace.

  • PDF

Construction of the Digital Archive System from the Records of Westerners Who Stayed in Korea during the Enlightenment Period of Chosun (개화기 조선 체류 서양인 기록물의 디지털 아카이브 시스템 구축)

  • Chung, Heesun;Kim, Heesoon;Song, Hyun-Sook;Lee, Myeong-Hee
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.27 no.4
    • /
    • pp.229-249
    • /
    • 2016
  • This study was conducted to create a digital archive for local cultural contents compiled from the records of westerners who stayed in Korea during the Enlightenment Period of Chosun. The compiled information were gathered from 22 records, and 10 main subjects, 40 sub-subjects and 239 mini-subjects were derived through the subject classification scheme. Item analysis was conducted through 38 metadata and input data types were classified and databased in Excel. Finally, a web-based digital archiving system was developed for searching and providing information through various access points. Suggestions for future research were made to expand archive contents through continuous excavation of westerners' records, to build an integrated information system of Korean digital archives incorporating individual archive systems, to develop standardization of classification schemes and a multidimensional classification system considering facet structure in cultural heritage areas, to keep consistency of contents through standardization of metadata format, and to build ontology using semantic search functions and data mining functions.

The Korean HapMap Project Website

  • Kim, Young-Uk;Kim, Seung-Ho;Jin, Hoon;Park, Young-Kyu;Ji, Mi-Hyun;Kim, Young-Joo
    • Genomics & Informatics
    • /
    • v.6 no.2
    • /
    • pp.91-94
    • /
    • 2008
  • Single nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and are a resource for mapping complex genetic traits. A genome is covered by millions of these markers, and researchers are able to compare which SNPs predominate in people who have a certain disease. The International HapMap Project, launched in October, 2002, motivated us to start the Korean HapMap Project in order to support Korean HapMap infrastructure development and to accelerate the finding of genes that affect health, disease, and individual responses to medications and environmental factors. A Korean SNP and haplotype database system was developed through the Korean HapMap Project to provide Korean researchers with useful data-mining information about disease-associated biomarkers for studies on complex diseases, such as diabetes, cancer, and stroke. Also, we have developed a series of software programs for association studies as well as the comparison and analysis of Korean HapMap data with other populations, such as European, Chinese, Japanese, and African populations. The developed software includes HapMapSNPAnalyzer, SNPflank, HWE Test, FESD, D2GSNP, SNP@Domain, KMSD, KFOD, KFRG, and SNP@WEB. We developed a disease-related SNP retrieval system, in which OMIM, GeneCards, and MeSH information were integrated and analyzed for medical research scientists. The kHapMap Browser system that we developed and integrated provides haplotype retrieval and comparative study tools of human ethnicities for comprehensive disease association studies (http://www.khapmap.org). It is expected that researchers may be able to retrieve useful information from the kHapMap Browser to find useful biomarkers and genes in complex disease association studies and use these biomarkers and genes to study and develop new drugs for personalized medicine.

A Literature Review and Classification of Recommender Systems on Academic Journals (추천시스템관련 학술논문 분석 및 분류)

  • Park, Deuk-Hee;Kim, Hyea-Kyeong;Choi, Il-Young;Kim, Jae-Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.139-152
    • /
    • 2011
  • Recommender systems have become an important research field since the emergence of the first paper on collaborative filtering in the mid-1990s. In general, recommender systems are defined as the supporting systems which help users to find information, products, or services (such as books, movies, music, digital products, web sites, and TV programs) by aggregating and analyzing suggestions from other users, which mean reviews from various authorities, and user attributes. However, as academic researches on recommender systems have increased significantly over the last ten years, more researches are required to be applicable in the real world situation. Because research field on recommender systems is still wide and less mature than other research fields. Accordingly, the existing articles on recommender systems need to be reviewed toward the next generation of recommender systems. However, it would be not easy to confine the recommender system researches to specific disciplines, considering the nature of the recommender system researches. So, we reviewed all articles on recommender systems from 37 journals which were published from 2001 to 2010. The 37 journals are selected from top 125 journals of the MIS Journal Rankings. Also, the literature search was based on the descriptors "Recommender system", "Recommendation system", "Personalization system", "Collaborative filtering" and "Contents filtering". The full text of each article was reviewed to eliminate the article that was not actually related to recommender systems. Many of articles were excluded because the articles such as Conference papers, master's and doctoral dissertations, textbook, unpublished working papers, non-English publication papers and news were unfit for our research. We classified articles by year of publication, journals, recommendation fields, and data mining techniques. The recommendation fields and data mining techniques of 187 articles are reviewed and classified into eight recommendation fields (book, document, image, movie, music, shopping, TV program, and others) and eight data mining techniques (association rule, clustering, decision tree, k-nearest neighbor, link analysis, neural network, regression, and other heuristic methods). The results represented in this paper have several significant implications. First, based on previous publication rates, the interest in the recommender system related research will grow significantly in the future. Second, 49 articles are related to movie recommendation whereas image and TV program recommendation are identified in only 6 articles. This result has been caused by the easy use of MovieLens data set. So, it is necessary to prepare data set of other fields. Third, recently social network analysis has been used in the various applications. However studies on recommender systems using social network analysis are deficient. Henceforth, we expect that new recommendation approaches using social network analysis will be developed in the recommender systems. So, it will be an interesting and further research area to evaluate the recommendation system researches using social method analysis. This result provides trend of recommender system researches by examining the published literature, and provides practitioners and researchers with insight and future direction on recommender systems. We hope that this research helps anyone who is interested in recommender systems research to gain insight for future research.

Detecting Weak Signals for Carbon Neutrality Technology using Text Mining of Web News (탄소중립 기술의 미래신호 탐색연구: 국내 뉴스 기사 텍스트데이터를 중심으로)

  • Jisong Jeong;Seungkook Roh
    • Journal of Industrial Convergence
    • /
    • v.21 no.5
    • /
    • pp.1-13
    • /
    • 2023
  • Carbon neutrality is the concept of reducing greenhouse gases emitted by human activities and making actual emissions zero through removal of remaining gases. It is also called "Net-Zero" and "carbon zero". Korea has declared a "2050 Carbon Neutrality policy" to cope with the climate change crisis. Various carbon reduction legislative processes are underway. Since carbon neutrality requires changes in industrial technology, it is important to prepare a system for carbon zero. This paper aims to understand the status and trends of global carbon neutrality technology. Therefore, ROK's web platform "www.naver.com." was selected as the data collection scope. Korean online articles related to carbon neutrality were collected. Carbon neutrality technology trends were analyzed by future signal methodology and Word2Vec algorithm which is a neural network deep learning technology. As a result, technology advancement in the steel and petrochemical sectors, which are carbon over-release industries, was required. Investment feasibility in the electric vehicle sector and technology advancement were on the rise. It seems that the government's support for carbon neutrality and the creation of global technology infrastructure should be supported. In addition, it is urgent to cultivate human resources, and possible to confirm the need to prepare support policies for carbon neutrality.

Efficient Topic Modeling by Mapping Global and Local Topics (전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안)

  • Choi, Hochang;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.69-94
    • /
    • 2017
  • Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out. As the result, big data analysis is expected to be performed by demanders of analysis themselves. Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document. Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents. However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling. In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.

Improving Performance of Recommendation Systems Using Topic Modeling (사용자 관심 이슈 분석을 통한 추천시스템 성능 향상 방안)

  • Choi, Seongi;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.101-116
    • /
    • 2015
  • Recently, due to the development of smart devices and social media, vast amounts of information with the various forms were accumulated. Particularly, considerable research efforts are being directed towards analyzing unstructured big data to resolve various social problems. Accordingly, focus of data-driven decision-making is being moved from structured data analysis to unstructured one. Also, in the field of recommendation system, which is the typical area of data-driven decision-making, the need of using unstructured data has been steadily increased to improve system performance. Approaches to improve the performance of recommendation systems can be found in two aspects- improving algorithms and acquiring useful data with high quality. Traditionally, most efforts to improve the performance of recommendation system were made by the former approach, while the latter approach has not attracted much attention relatively. In this sense, efforts to utilize unstructured data from variable sources are very timely and necessary. Particularly, as the interests of users are directly connected with their needs, identifying the interests of the user through unstructured big data analysis can be a crew for improving performance of recommendation systems. In this sense, this study proposes the methodology of improving recommendation system by measuring interests of the user. Specially, this study proposes the method to quantify interests of the user by analyzing user's internet usage patterns, and to predict user's repurchase based upon the discovered preferences. There are two important modules in this study. The first module predicts repurchase probability of each category through analyzing users' purchase history. We include the first module to our research scope for comparing the accuracy of traditional purchase-based prediction model to our new model presented in the second module. This procedure extracts purchase history of users. The core part of our methodology is in the second module. This module extracts users' interests by analyzing news articles the users have read. The second module constructs a correspondence matrix between topics and news articles by performing topic modeling on real world news articles. And then, the module analyzes users' news access patterns and then constructs a correspondence matrix between articles and users. After that, by merging the results of the previous processes in the second module, we can obtain a correspondence matrix between users and topics. This matrix describes users' interests in a structured manner. Finally, by using the matrix, the second module builds a model for predicting repurchase probability of each category. In this paper, we also provide experimental results of our performance evaluation. The outline of data used our experiments is as follows. We acquired web transaction data of 5,000 panels from a company that is specialized to analyzing ranks of internet sites. At first we extracted 15,000 URLs of news articles published from July 2012 to June 2013 from the original data and we crawled main contents of the news articles. After that we selected 2,615 users who have read at least one of the extracted news articles. Among the 2,615 users, we discovered that the number of target users who purchase at least one items from our target shopping mall 'G' is 359. In the experiments, we analyzed purchase history and news access records of the 359 internet users. From the performance evaluation, we found that our prediction model using both users' interests and purchase history outperforms a prediction model using only users' purchase history from a view point of misclassification ratio. In detail, our model outperformed the traditional one in appliance, beauty, computer, culture, digital, fashion, and sports categories when artificial neural network based models were used. Similarly, our model outperformed the traditional one in beauty, computer, digital, fashion, food, and furniture categories when decision tree based models were used although the improvement is very small.

Design and Implementation of High-dimensional Index Structure for the support of Concurrency Control (필터링에 기반한 고차원 색인구조의 동시성 제어기법의 설계 및 구현)

  • Lee, Yong-Ju;Chang, Jae-Woo;Kim, Hang-Young;Kim, Myung-Joon
    • The KIPS Transactions:PartD
    • /
    • v.10D no.1
    • /
    • pp.1-12
    • /
    • 2003
  • Recently, there have been many indexing schemes for multimedia data such as image, video data. But recent database applications, for example data mining and multimedia database, are required to support multi-user environment. In order for indexing schemes to be useful in multi-user environment, a concurrency control algorithm is required to handle it. So we propose a concurrency control algorithm that can be applied to CBF (cell-based filtering method), which uses the signature of the cell for alleviating the dimensional curse problem. In addition, we extend the SHORE storage system of Wisconsin university in order to handle high-dimensional data. This extended SHORE storage system provides conventional storage manager functions, guarantees the integrity of high-dimensional data and is flexible to the large scale of feature vectors for preventing the usage of large main memory. Finally, we implement the web-based image retrieval system by using the extended SHORE storage system. The key feature of this system is platform-independent access to the high-dimensional data as well as functionality of efficient content-based queries. Lastly. We evaluate an average response time of point query, range query and k-nearest query in terms of the number of threads.