• Title/Summary/Keyword: Modeling System

Search Result 10,844, Processing Time 0.033 seconds

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

Analysis of the Time-dependent Relation between TV Ratings and the Content of Microblogs (TV 시청률과 마이크로블로그 내용어와의 시간대별 관계 분석)

  • Choeh, Joon Yeon;Baek, Haedeuk;Choi, Jinho
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.163-176
    • /
    • 2014
  • Social media is becoming the platform for users to communicate their activities, status, emotions, and experiences to other people. In recent years, microblogs, such as Twitter, have gained in popularity because of its ease of use, speed, and reach. Compared to a conventional web blog, a microblog lowers users' efforts and investment for content generation by recommending shorter posts. There has been a lot research into capturing the social phenomena and analyzing the chatter of microblogs. However, measuring television ratings has been given little attention so far. Currently, the most common method to measure TV ratings uses an electronic metering device installed in a small number of sampled households. Microblogs allow users to post short messages, share daily updates, and conveniently keep in touch. In a similar way, microblog users are interacting with each other while watching television or movies, or visiting a new place. In order to measure TV ratings, some features are significant during certain hours of the day, or days of the week, whereas these same features are meaningless during other time periods. Thus, the importance of features can change during the day, and a model capturing the time sensitive relevance is required to estimate TV ratings. Therefore, modeling time-related characteristics of features should be a key when measuring the TV ratings through microblogs. We show that capturing time-dependency of features in measuring TV ratings is vitally necessary for improving their accuracy. To explore the relationship between the content of microblogs and TV ratings, we collected Twitter data using the Get Search component of the Twitter REST API from January 2013 to October 2013. There are about 300 thousand posts in our data set for the experiment. After excluding data such as adverting or promoted tweets, we selected 149 thousand tweets for analysis. The number of tweets reaches its maximum level on the broadcasting day and increases rapidly around the broadcasting time. This result is stems from the characteristics of the public channel, which broadcasts the program at the predetermined time. From our analysis, we find that count-based features such as the number of tweets or retweets have a low correlation with TV ratings. This result implies that a simple tweet rate does not reflect the satisfaction or response to the TV programs. Content-based features extracted from the content of tweets have a relatively high correlation with TV ratings. Further, some emoticons or newly coined words that are not tagged in the morpheme extraction process have a strong relationship with TV ratings. We find that there is a time-dependency in the correlation of features between the before and after broadcasting time. Since the TV program is broadcast at the predetermined time regularly, users post tweets expressing their expectation for the program or disappointment over not being able to watch the program. The highly correlated features before the broadcast are different from the features after broadcasting. This result explains that the relevance of words with TV programs can change according to the time of the tweets. Among the 336 words that fulfill the minimum requirements for candidate features, 145 words have the highest correlation before the broadcasting time, whereas 68 words reach the highest correlation after broadcasting. Interestingly, some words that express the impossibility of watching the program show a high relevance, despite containing a negative meaning. Understanding the time-dependency of features can be helpful in improving the accuracy of TV ratings measurement. This research contributes a basis to estimate the response to or satisfaction with the broadcasted programs using the time dependency of words in Twitter chatter. More research is needed to refine the methodology for predicting or measuring TV ratings.

Spatial effect on the diffusion of discount stores (대형할인점 확산에 대한 공간적 영향)

  • Joo, Young-Jin;Kim, Mi-Ae
    • Journal of Distribution Research
    • /
    • v.15 no.4
    • /
    • pp.61-85
    • /
    • 2010
  • Introduction: Diffusion is process by which an innovation is communicated through certain channel overtime among the members of a social system(Rogers 1983). Bass(1969) suggested the Bass model describing diffusion process. The Bass model assumes potential adopters of innovation are influenced by mass-media and word-of-mouth from communication with previous adopters. Various expansions of the Bass model have been conducted. Some of them proposed a third factor affecting diffusion. Others proposed multinational diffusion model and it stressed interactive effect on diffusion among several countries. We add a spatial factor in the Bass model as a third communication factor. Because of situation where we can not control the interaction between markets, we need to consider that diffusion within certain market can be influenced by diffusion in contiguous market. The process that certain type of retail extends is a result that particular market can be described by the retail life cycle. Diffusion of retail has pattern following three phases of spatial diffusion: adoption of innovation happens in near the diffusion center first, spreads to the vicinity of the diffusing center and then adoption of innovation is completed in peripheral areas in saturation stage. So we expect spatial effect to be important to describe diffusion of domestic discount store. We define a spatial diffusion model using multinational diffusion model and apply it to the diffusion of discount store. Modeling: In this paper, we define a spatial diffusion model and apply it to the diffusion of discount store. To define a spatial diffusion model, we expand learning model(Kumar and Krishnan 2002) and separate diffusion process in diffusion center(market A) from diffusion process in the vicinity of the diffusing center(market B). The proposed spatial diffusion model is shown in equation (1a) and (1b). Equation (1a) is the diffusion process in diffusion center and equation (1b) is one in the vicinity of the diffusing center. $$\array{{S_{i,t}=(p_i+q_i{\frac{Y_{i,t-1}}{m_i}})(m_i-Y_{i,t-1})\;i{\in}\{1,{\cdots},I\}\;(1a)}\\{S_{j,t}=(p_j+q_j{\frac{Y_{j,t-1}}{m_i}}+{\sum\limits_{i=1}^I}{\gamma}_{ij}{\frac{Y_{i,t-1}}{m_i}})(m_j-Y_{j,t-1})\;i{\in}\{1,{\cdots},I\},\;j{\in}\{I+1,{\cdots},I+J\}\;(1b)}}$$ We rise two research questions. (1) The proposed spatial diffusion model is more effective than the Bass model to describe the diffusion of discount stores. (2) The more similar retail environment of diffusing center with that of the vicinity of the contiguous market is, the larger spatial effect of diffusing center on diffusion of the vicinity of the contiguous market is. To examine above two questions, we adopt the Bass model to estimate diffusion of discount store first. Next spatial diffusion model where spatial factor is added to the Bass model is used to estimate it. Finally by comparing Bass model with spatial diffusion model, we try to find out which model describes diffusion of discount store better. In addition, we investigate the relationship between similarity of retail environment(conceptual distance) and spatial factor impact with correlation analysis. Result and Implication: We suggest spatial diffusion model to describe diffusion of discount stores. To examine the proposed spatial diffusion model, 347 domestic discount stores are used and we divide nation into 5 districts, Seoul-Gyeongin(SG), Busan-Gyeongnam(BG), Daegu-Gyeongbuk(DG), Gwan- gju-Jeonla(GJ), Daejeon-Chungcheong(DC), and the result is shown

    . In a result of the Bass model(I), the estimates of innovation coefficient(p) and imitation coefficient(q) are 0.017 and 0.323 respectively. While the estimate of market potential is 384. A result of the Bass model(II) for each district shows the estimates of innovation coefficient(p) in SG is 0.019 and the lowest among 5 areas. This is because SG is the diffusion center. The estimates of imitation coefficient(q) in BG is 0.353 and the highest. The imitation coefficient in the vicinity of the diffusing center such as BG is higher than that in the diffusing center because much information flows through various paths more as diffusion is progressing. A result of the Bass model(II) shows the estimates of innovation coefficient(p) in SG is 0.019 and the lowest among 5 areas. This is because SG is the diffusion center. The estimates of imitation coefficient(q) in BG is 0.353 and the highest. The imitation coefficient in the vicinity of the diffusing center such as BG is higher than that in the diffusing center because much information flows through various paths more as diffusion is progressing. In a result of spatial diffusion model(IV), we can notice the changes between coefficients of the bass model and those of the spatial diffusion model. Except for GJ, the estimates of innovation and imitation coefficients in Model IV are lower than those in Model II. The changes of innovation and imitation coefficients are reflected to spatial coefficient(${\gamma}$). From spatial coefficient(${\gamma}$) we can infer that when the diffusion in the vicinity of the diffusing center occurs, the diffusion is influenced by one in the diffusing center. The difference between the Bass model(II) and the spatial diffusion model(IV) is statistically significant with the ${\chi}^2$-distributed likelihood ratio statistic is 16.598(p=0.0023). Which implies that the spatial diffusion model is more effective than the Bass model to describe diffusion of discount stores. So the research question (1) is supported. In addition, we found that there are statistically significant relationship between similarity of retail environment and spatial effect by using correlation analysis. So the research question (2) is also supported.

  • PDF
  • An Analytical Approach Using Topic Mining for Improving the Service Quality of Hotels (호텔 산업의 서비스 품질 향상을 위한 토픽 마이닝 기반 분석 방법)

    • Moon, Hyun Sil;Sung, David;Kim, Jae Kyeong
      • Journal of Intelligence and Information Systems
      • /
      • v.25 no.1
      • /
      • pp.21-41
      • /
      • 2019
    • Thanks to the rapid development of information technologies, the data available on Internet have grown rapidly. In this era of big data, many studies have attempted to offer insights and express the effects of data analysis. In the tourism and hospitality industry, many firms and studies in the era of big data have paid attention to online reviews on social media because of their large influence over customers. As tourism is an information-intensive industry, the effect of these information networks on social media platforms is more remarkable compared to any other types of media. However, there are some limitations to the improvements in service quality that can be made based on opinions on social media platforms. Users on social media platforms represent their opinions as text, images, and so on. Raw data sets from these reviews are unstructured. Moreover, these data sets are too big to extract new information and hidden knowledge by human competences. To use them for business intelligence and analytics applications, proper big data techniques like Natural Language Processing and data mining techniques are needed. This study suggests an analytical approach to directly yield insights from these reviews to improve the service quality of hotels. Our proposed approach consists of topic mining to extract topics contained in the reviews and the decision tree modeling to explain the relationship between topics and ratings. Topic mining refers to a method for finding a group of words from a collection of documents that represents a document. Among several topic mining methods, we adopted the Latent Dirichlet Allocation algorithm, which is considered as the most universal algorithm. However, LDA is not enough to find insights that can improve service quality because it cannot find the relationship between topics and ratings. To overcome this limitation, we also use the Classification and Regression Tree method, which is a kind of decision tree technique. Through the CART method, we can find what topics are related to positive or negative ratings of a hotel and visualize the results. Therefore, this study aims to investigate the representation of an analytical approach for the improvement of hotel service quality from unstructured review data sets. Through experiments for four hotels in Hong Kong, we can find the strengths and weaknesses of services for each hotel and suggest improvements to aid in customer satisfaction. Especially from positive reviews, we find what these hotels should maintain for service quality. For example, compared with the other hotels, a hotel has a good location and room condition which are extracted from positive reviews for it. In contrast, we also find what they should modify in their services from negative reviews. For example, a hotel should improve room condition related to soundproof. These results mean that our approach is useful in finding some insights for the service quality of hotels. That is, from the enormous size of review data, our approach can provide practical suggestions for hotel managers to improve their service quality. In the past, studies for improving service quality relied on surveys or interviews of customers. However, these methods are often costly and time consuming and the results may be biased by biased sampling or untrustworthy answers. The proposed approach directly obtains honest feedback from customers' online reviews and draws some insights through a type of big data analysis. So it will be a more useful tool to overcome the limitations of surveys or interviews. Moreover, our approach easily obtains the service quality information of other hotels or services in the tourism industry because it needs only open online reviews and ratings as input data. Furthermore, the performance of our approach will be better if other structured and unstructured data sources are added.


    (34141) Korea Institute of Science and Technology Information, 245, Daehak-ro, Yuseong-gu, Daejeon
    Copyright (C) KISTI. All Rights Reserved.