• Title/Summary/Keyword: Model validation

Search Result 3,188, Processing Time 0.033 seconds

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

The Classification System and Information Service for Establishing a National Collaborative R&D Strategy in Infectious Diseases: Focusing on the Classification Model for Overseas Coronavirus R&D Projects (국가 감염병 공동R&D전략 수립을 위한 분류체계 및 정보서비스에 대한 연구: 해외 코로나바이러스 R&D과제의 분류모델을 중심으로)

  • Lee, Doyeon;Lee, Jae-Seong;Jun, Seung-pyo;Kim, Keun-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.127-147
    • /
    • 2020
  • The world is suffering from numerous human and economic losses due to the novel coronavirus infection (COVID-19). The Korean government established a strategy to overcome the national infectious disease crisis through research and development. It is difficult to find distinctive features and changes in a specific R&D field when using the existing technical classification or science and technology standard classification. Recently, a few studies have been conducted to establish a classification system to provide information about the investment research areas of infectious diseases in Korea through a comparative analysis of Korea government-funded research projects. However, these studies did not provide the necessary information for establishing cooperative research strategies among countries in the infectious diseases, which is required as an execution plan to achieve the goals of national health security and fostering new growth industries. Therefore, it is inevitable to study information services based on the classification system and classification model for establishing a national collaborative R&D strategy. Seven classification - Diagnosis_biomarker, Drug_discovery, Epidemiology, Evaluation_validation, Mechanism_signaling pathway, Prediction, and Vaccine_therapeutic antibody - systems were derived through reviewing infectious diseases-related national-funded research projects of South Korea. A classification system model was trained by combining Scopus data with a bidirectional RNN model. The classification performance of the final model secured robustness with an accuracy of over 90%. In order to conduct the empirical study, an infectious disease classification system was applied to the coronavirus-related research and development projects of major countries such as the STAR Metrics (National Institutes of Health) and NSF (National Science Foundation) of the United States(US), the CORDIS (Community Research & Development Information Service)of the European Union(EU), and the KAKEN (Database of Grants-in-Aid for Scientific Research) of Japan. It can be seen that the research and development trends of infectious diseases (coronavirus) in major countries are mostly concentrated in the prediction that deals with predicting success for clinical trials at the new drug development stage or predicting toxicity that causes side effects. The intriguing result is that for all of these nations, the portion of national investment in the vaccine_therapeutic antibody, which is recognized as an area of research and development aimed at the development of vaccines and treatments, was also very small (5.1%). It indirectly explained the reason of the poor development of vaccines and treatments. Based on the result of examining the investment status of coronavirus-related research projects through comparative analysis by country, it was found that the US and Japan are relatively evenly investing in all infectious diseases-related research areas, while Europe has relatively large investments in specific research areas such as diagnosis_biomarker. Moreover, the information on major coronavirus-related research organizations in major countries was provided by the classification system, thereby allowing establishing an international collaborative R&D projects.

Prediction of Growth of Escherichia coli O157 : H7 in Lettuce Treated with Alkaline Electrolyzed Water at Different Temperatures

  • Ding, Tian;Jin, Yong-Guo;Rahman, S.M.E.;Kim, Jai-Moung;Choi, Kang-Hyun;Choi, Gye-Sun;Oh, Deog-Hwan
    • Journal of Food Hygiene and Safety
    • /
    • v.24 no.3
    • /
    • pp.232-237
    • /
    • 2009
  • This study was conducted to develop a model for describing the effect of storage temperature (4, 10, 15, 20, 25, 30 and $35^{\circ}C$) on the growth of Escherichia coli O157 : H7 in ready-to-eat (RTE) lettuce treated with or without (control) alkaline electrolyzed water (AIEW). The growth curves were well fitted with the Gompertz equation, which was used to determine the specific growth rate (SGR) and lag time (LT) of E. coli O157 : H7 ($R^2$ = 0.994). Results showed that the obtained SGR and LT were dependent on the storage temperature. The growth rate increased with increasing temperature from 4 to $35^{\circ}C$. The square root models were used to evaluate the effect of storage temperature on the growth of E. coli O157 : H7 in lettuce samples treated without or with AIEW. The coefficient of determination ($R^2$), adjusted determination coefficient ($R^2_{Adj}$), and mean square error (MSE) were employed to validate the established models. It showed that $R^2$ and $R^_{Adj}$ were close to 1 (> 0.93), and MSE calculated from models of untreated and treated lettuce were 0.031 and 0.025, respectively. The results demonstrated that the overall predictions of the growth of E. coli O157: H7 agreed with the observed data.

Development of Work-related Musculoskeletal Disorder Questionnaire Using Receiver Operating Characteristic Analysis (Receiver Operating Characteristic 분석법을 이용한 업무관련성 근골격계질환 설문지 개발)

  • Kwon, Ho-Jang;Ju, Yeong-Su;Cho, Soo-Hun;Kang, Dae-Hee;Sung, Joo-Hon;Choi, Seong-Woo;Choi, Jae-Wook;Kim, Jae-Young;Kim, Don-Gyu;Kim, Jai-Yong
    • Journal of Preventive Medicine and Public Health
    • /
    • v.32 no.3
    • /
    • pp.361-373
    • /
    • 1999
  • Objectives: Receive Operating Characteristic(ROC) curve with the area under the ROC curve(AUC) is one of the most popular indicator to evaluate the criterion validity of the measurement tool. This study was conducted to develop a standardized questionnaire to discriminate workers at high-risk of work-related musculoskeletal disorders using ROC analysis. Methods: The diagnostic results determined by rehabilitation medicine specialists in 370 persons(89 shipyard CAD workers, 113 telephone directory assistant operators, 79 women with occupation, and 89 housewives) were compared with participant's own replies to 'the questionnair on the worker's subjective physical symptoms'(Kwon, 1996). The AUC's from four models with different methods in item selection and weighting were compared with each other. These 4 models were applied to 225 persons, working in an assembly line of motor vehicle, for the purpose of AUC reliability test. Results: In a weighted model with 11 items, the AUC was 0.8155 in the primary study population, and 0.8026 in the secondary study population(p=0.3780). It was superior in the aspects of discriminability, reliability and convenience. A new questionnaire of musculoskeletal disorder could be constructed by this model. Conclusion: A more valid questionnaire with a small number of items and the quantitative weight scores useful for the relative comparisons are the main results of this study. While the absolute reference value applicable to the wide range of populations was not estimated, the basic intent of this study, developing a surveillance fool through quantitative validation of the measures, would serve for the systematic disease prevention activities.

  • PDF

A Management Plan According to the Estimation of Nutria (Myocastorcoypus) Distribution Density and Potential Suitable Habitat (뉴트리아(Myocastor coypus) 분포밀도 및 잠재적 서식가능지역 예측에 따른 관리방향)

  • Kim, Areum;Kim, Young-Chae;Lee, Do-Hun
    • Journal of Environmental Impact Assessment
    • /
    • v.27 no.2
    • /
    • pp.203-214
    • /
    • 2018
  • The purpose of this study is to estimate the concentrated distribution area of nutria (Myocastor coypus) and potential suitable habitat and to provide useful data for the effective management direction setting. Based on the nationwide distribution data of nutria, the cross-validation value was applied to analyze the distribution density. As a result, the concentrated distribution areas thatrequired preferential elimination is found in 14 administrative areas including Busan Metropolitan City, Daegu Metropolitan City, 11 cities and counties in Gyeongsangnam-do and 1 county in Gyeongsangbuk-do. In the potential suitable habitat estimation using a MaxEnt (Maximum Entropy) model, the possibility of emergency was found in the Nakdong River middle and lower stream area and the Seomjin riverlower stream area and Gahwacheon River area. As for the contribution by variables of a model, it showed DEM, precipitation of driest month, min temperature of coldest month and distance from river had contribution from the highest order. In terms of the relation with the probability of appearance, the probability of emergence was higher than the threshold value in areas with less than 34m of altitude, with $-5.7^{\circ}C{\sim}-0.6^{\circ}C$ of min temperature of the coldest month, with 15-30mm of precipitation of the driest month and with less than 1,373m away from the river. Variables that Altitude, existence of water and wintertemperature affected settlement and expansion of nutria, considering the research results and the physiological and ecological characteristics of nutria. Therefore, it is necessary to reflect them as important variables in the future habitable area detection and expansion estimation modeling. It must be essential to distinguish the concentrated distribution area and the management area of invasive alien species such as nutria and to establish and apply a suitable management strategy to the management site for the permanent control. The results in this study can be used as useful data for a strategic management such as rapid management on the preferential management area and preemptive and preventive management on the possible spreading area.

Estimation of Soil Surface Temperature by Heat Flux in Soil (Heat flux를 이용한 토양 표면 온도 예측)

  • Hur, Seung-Oh;Kim, Won-Tae;Jung, Kang-Ho;Ha, Sang-Keon
    • Korean Journal of Soil Science and Fertilizer
    • /
    • v.37 no.3
    • /
    • pp.131-135
    • /
    • 2004
  • This study was carried out for the analysis of temperature characteristics on soil surface using soil heat flux which is one of the important parameters forming soil temperature. Soil surface temperature was estimated by using the soil temperature measured at 10 cm soil depth and the soil heat flux measured by flux plate at 5 cm soil depth. There was time lag of two hours between soil temperature and soil heat flux. Temperature changes over time showed a positive correlation with soil heat flux. Soil surface temperature was estimated by the equation using variable separation method for soil surface temperature. Arithmetic mean using temperatures measured at soil surface and 10 cm depth, and soil temperature measured at 5 cm depth were compared for accuracy of the value. To validate the regression model through this comparison, F-validation was used. Usefulness of deductive regression model was admitted because intended F-value was smaller than 0.001 and the determination coefficient was 0.968. It can be concluded that the estimated surface soil temperatures obtained by variable separation method were almost equal to the measured surface soil temperature.

Airborne Hyperspectral Imagery availability to estimate inland water quality parameter (수질 매개변수 추정에 있어서 항공 초분광영상의 가용성 고찰)

  • Kim, Tae-Woo;Shin, Han-Sup;Suh, Yong-Cheol
    • Korean Journal of Remote Sensing
    • /
    • v.30 no.1
    • /
    • pp.61-73
    • /
    • 2014
  • This study reviewed an application of water quality estimation using an Airborne Hyperspectral Imagery (A-HSI) and tested a part of Han River water quality (especially suspended solid) estimation with available in-situ data. The estimation of water quality was processed two methods. One is using observation data as downwelling radiance to water surface and as scattering and reflectance into water body. Other is linear regression analysis with water quality in-situ measurement and upwelling data as at-sensor radiance (or reflectance). Both methods drive meaningful results of RS estimation. However it has more effects on the auxiliary dataset as water quality in-situ measurement and water body scattering measurement. The test processed a part of Han River located Paldang-dam downstream. We applied linear regression analysis with AISA eagle hyperspectral sensor data and water quality measurement in-situ data. The result of linear regression for a meaningful band combination shows $-24.847+0.013L_{560}$ as 560 nm in radiance (L) with 0.985 R-square. To comparison with Multispectral Imagery (MSI) case, we make simulated Landsat TM by spectral resampling. The regression using MSI shows -55.932 + 33.881 (TM1/TM3) as radiance with 0.968 R-square. Suspended Solid (SS) concentration was about 3.75 mg/l at in-situ data and estimated SS concentration by A-HIS was about 3.65 mg/l, and about 5.85mg/l with MSI with same location. It shows overestimation trends case of estimating using MSI. In order to upgrade value for practical use and to estimate more precisely, it needs that minimizing sun glint effect into whole image, constructing elaborate flight plan considering solar altitude angle, and making good pre-processing and calibration system. We found some limitations and restrictions such as precise atmospheric correction, sample count of water quality measurement, retrieve spectral bands into A-HSI, adequate linear regression model selection, and quantitative calibration/validation method through the literature review and test adopted general methods.

Change Detection of land-surface Environment in Gongju Areas Using Spatial Relationships between Land-surface Change and Geo-spatial Information (지표변화와 지리공간정보의 연관성 분석을 통한 공주지역 지표환경 변화 분석)

  • Jang Dong-Ho
    • Journal of the Korean Geographical Society
    • /
    • v.40 no.3 s.108
    • /
    • pp.296-309
    • /
    • 2005
  • In this study, we investigated the change of future land-surface and relationships of land-surface change with geo-spatial information, using a Bayesian prediction model based on a likelihood ratio function, for analysing the land-surface change of the Gongju area. We classified the land-surface satellite images, and then extracted the changing area using a way of post classification comparison. land-surface information related to the land-surface change is constructed in a GIS environment, and the map of land-surface change prediction is made using the likelihood ratio function. As the results of this study, the thematic maps which definitely influence land-surface change of rural or urban areas are elevation, water system, population density, roads, population moving, the number of establishments, land price, etc. Also, thematic maps which definitely influence the land-surface change of forests areas are elevation, slope, population density, population moving, land price, etc. As a result of land-surface change analysis, center proliferation of old and new downtown is composed near Gum-river, and the downtown area will spread around the local roads and interchange areas in the urban area. In case of agricultural areas, a small tributary of Gum-river or an area of local roads which are attached with adjacent areas showed the high probability of change. Most of the forest areas are located in southeast and from this result we can guess why the wide chestnut-tree cultivation complex is located in these areas and the capability of forest damage is very high. As a result of validation using a prediction rate curve, a capability of prediction of urban area is $80\%$, agriculture area is $55\%$, forest area is $40\%$ in higher $10\%$ of possibility which the land-surface change would occur. This integration model is unsatisfactory to Predict the forest area in the study area and thus as a future work, it is necessary to apply new thematic maps or prediction models In conclusion, we can expect that this way can be one of the most essential land-surface change studies in a few years.

The Effect of Social Entrepreneurship on Market Orientation (사회적 기업가정신이 시장지향성에 미치는 영향)

  • Oh, Sang-Hwan;Yun, Dae-Hong;Ock, Jung-Won
    • Management & Information Systems Review
    • /
    • v.36 no.5
    • /
    • pp.27-44
    • /
    • 2017
  • The purpose of this study was to empirically verify the effect of social entrepreneurship on market orientation. total of 500 questionnaires were distributed to workers in social enterprise and preliminary social enterprise. 202 questionnaires were used for final validation of research model, The hypotheses set in this study were validated through SPSS18.0 and LISREL8.3 based on the research model. The results showed that all hypotheses were accepted, except for 5 hypotheses(Hypothesis 1-1, Hypothesis 1-2, Hypothesis 1-3, Hypothesis 1-6, Hypothesis 1-9). First, we examined the effect that empathy might have on market orientation in connection with social entrepreneurship. The results suggested that empathy did not have a statistically significant effect on customer-orientation, inter-department cooperation and coordination, and competitor orientation. Second, we examined the effect that innovativeness might have on market orientation in connection with social entrepreneurship. The results showed that innovativeness had a positive(+) effect on customer-orientation and inter-department cooperation and coordination but did not have a statistically significant effect on competitor-orientation. Third, we examined the effect that risk-taking might have on market orientation in connection with social entrepreneurship. The results implied that risk-taking had a positive(+) effect on customer-orientation and inter-department cooperation and coordination but did not have a statistically significant effect on competitor-orientation. Finally, the relationship among market orientation variables was like this: The inter-department cooperation and coordination had a positive(+) effect on both customer-orientation and competitor-orientation. The results of this study are expected to provide a useful basis for overall understanding about the effect of social entrepreneurship on market orientation and present important theoretical and practical implications.

Temporal and Spatial Characteristics of Sediment Yields from the Chungju Dam Upstream Watershed (충주댐 상류유역의 유사 발생에 대한 시공간적인 특성)

  • Kim, Chul-Gyum;Lee, Jeong-Eun;Kim, Nam-Won
    • Journal of Korea Water Resources Association
    • /
    • v.40 no.11
    • /
    • pp.887-898
    • /
    • 2007
  • A physically based semi-distributed model, SWAT was applied to the Chungju Dam upstream watershed in order to investigate the spatial and temporal characteristics of watershed sediment yields. For this, general features of the SWAT and sediment simulation algorithm within the model were described briefly, and watershed sediment modeling system was constructed after calibration and validation of parameters related to the runoff and sediment. With this modeling system, temporal and spatial variation of soil loss and sediment yields according to watershed scales, land uses, and reaches was analyzed. Sediment yield rates with drainage areas resulted in $0.5{\sim}0.6ton/ha/yr$ excluding some upstream sub-watersheds and showed around 0.51 ton/ha/yr above the areas of $1,000km^2$. Annual average soil loss according to land use represented the higher values in upland areas, but relatively lower in paddy and forest areas which were similar to the previous results from other researchers. Among the upstream reaches, Pyeongchanggang and Jucheongang showed higher sediment yields which was thought to be caused by larger area and higher fraction of upland than other upstream sub-areas. Monthly sediment yields at the main outlet showed same trend with seasonal rainfall distribution, that is, approximately 62% of annual yield was generated during July to August and the amount was about 208 ton/yr. From the results, we could obtain the uniform value of sediment yield rate and could roughly evaluate the effect of soil loss with land uses, and also could analyze the temporal and spatial characteristics of sediment yields from each reach and monthly variation for the Chungju Dam upstream watershed.