• Title/Summary/Keyword: similarity calculation

Search Result 208, Processing Time 0.021 seconds

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

A Study on the Effect of Network Centralities on Recommendation Performance (네트워크 중심성 척도가 추천 성능에 미치는 영향에 대한 연구)

  • Lee, Dongwon
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.23-46
    • /
    • 2021
  • Collaborative filtering, which is often used in personalization recommendations, is recognized as a very useful technique to find similar customers and recommend products to them based on their purchase history. However, the traditional collaborative filtering technique has raised the question of having difficulty calculating the similarity for new customers or products due to the method of calculating similaritiesbased on direct connections and common features among customers. For this reason, a hybrid technique was designed to use content-based filtering techniques together. On the one hand, efforts have been made to solve these problems by applying the structural characteristics of social networks. This applies a method of indirectly calculating similarities through their similar customers placed between them. This means creating a customer's network based on purchasing data and calculating the similarity between the two based on the features of the network that indirectly connects the two customers within this network. Such similarity can be used as a measure to predict whether the target customer accepts recommendations. The centrality metrics of networks can be utilized for the calculation of these similarities. Different centrality metrics have important implications in that they may have different effects on recommended performance. In this study, furthermore, the effect of these centrality metrics on the performance of recommendation may vary depending on recommender algorithms. In addition, recommendation techniques using network analysis can be expected to contribute to increasing recommendation performance even if they apply not only to new customers or products but also to entire customers or products. By considering a customer's purchase of an item as a link generated between the customer and the item on the network, the prediction of user acceptance of recommendation is solved as a prediction of whether a new link will be created between them. As the classification models fit the purpose of solving the binary problem of whether the link is engaged or not, decision tree, k-nearest neighbors (KNN), logistic regression, artificial neural network, and support vector machine (SVM) are selected in the research. The data for performance evaluation used order data collected from an online shopping mall over four years and two months. Among them, the previous three years and eight months constitute social networks composed of and the experiment was conducted by organizing the data collected into the social network. The next four months' records were used to train and evaluate recommender models. Experiments with the centrality metrics applied to each model show that the recommendation acceptance rates of the centrality metrics are different for each algorithm at a meaningful level. In this work, we analyzed only four commonly used centrality metrics: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Eigenvector centrality records the lowest performance in all models except support vector machines. Closeness centrality and betweenness centrality show similar performance across all models. Degree centrality ranking moderate across overall models while betweenness centrality always ranking higher than degree centrality. Finally, closeness centrality is characterized by distinct differences in performance according to the model. It ranks first in logistic regression, artificial neural network, and decision tree withnumerically high performance. However, it only records very low rankings in support vector machine and K-neighborhood with low-performance levels. As the experiment results reveal, in a classification model, network centrality metrics over a subnetwork that connects the two nodes can effectively predict the connectivity between two nodes in a social network. Furthermore, each metric has a different performance depending on the classification model type. This result implies that choosing appropriate metrics for each algorithm can lead to achieving higher recommendation performance. In general, betweenness centrality can guarantee a high level of performance in any model. It would be possible to consider the introduction of proximity centrality to obtain higher performance for certain models.

Comparative Study of Commercial CFD Software Performance for Prediction of Reactor Internal Flow (원자로 내부유동 예측을 위한 상용 전산유체역학 소프트웨어 성능 비교 연구)

  • Lee, Gong Hee;Bang, Young Seok;Woo, Sweng Woong;Kim, Do Hyeong;Kang, Min Ku
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.37 no.12
    • /
    • pp.1175-1183
    • /
    • 2013
  • Even if some CFD software developers and its users think that a state-of-the-art CFD software can be used to reasonably solve at least single-phase nuclear reactor safety problems, there remain limitations and uncertainties in the calculation result. From a regulatory perspective, the Korea Institute of Nuclear Safety (KINS) is presently conducting the performance assessment of commercial CFD software for nuclear reactor safety problems. In this study, to examine the prediction performance of commercial CFD software with the porous model in the analysis of the scale-down APR (Advanced Power Reactor Plus) internal flow, a simulation was conducted with the on-board numerical models in ANSYS CFX R.14 and FLUENT R.14. It was concluded that depending on the CFD software, the internal flow distribution of the scale-down APR was locally somewhat different. Although there was a limitation in estimating the prediction performance of the commercial CFD software owing to the limited amount of measured data, CFX R.14 showed more reasonable prediction results in comparison with FLUENT R.14. Meanwhile, owing to the difference in discretization methodology, FLUENT R.14 required more computational memory than CFX R.14 for the same grid system. Therefore, the CFD software suitable to the available computational resource should be selected for massively parallel computations.

A study of calculate a time to peak enhancement of contrast level by using blood flow (혈류에 의한 조영제 peak time의 산출에 관한 연구)

  • Choi, Kwan-Woo;Son, Soon-Yong;Lee, Ho-Beom
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.5
    • /
    • pp.2315-2321
    • /
    • 2013
  • This study attempt to develope and suggest a new, minimize side effects process for calculate a time to peak enhancement of contrast level by using blood flow instead of current mathematical process. We conducted a studies 127 patients who performed the CE MRA by using test-contrast inject way. We used measurements of a contrast inflow time and time to peak enhancement of contrast level of each cerebrovascular branch for similarity of witch cerebrovascular branch calculate a time to peak enhancement of contrast level by using blood flow in image compared with calculation a time to peak enhancement of contrast level by using current mathematical process after contrast enhancement. In this study, confidence interval were used if the variable is continuous variable; there is differences between 4 groups exist but in group 1, there is no difference with time in peak enhancement of contrast level by using mathematical method to inflow time in sinus sigmoideus. it was significant statistically, in addition there was significant low heterogeneity in Bland Altman plot. Thus, apply a new calculate a time to peak enhancement of contrast level by using blood flow method will minimize damage caused by side effect, maintain quality of image, easy and fast access. It should provide a space for the exchange of current calculate a time to peak enhancement of contrast level by using mathematical process.

Evaluation of Spatio-temporal Fusion Models of Multi-sensor High-resolution Satellite Images for Crop Monitoring: An Experiment on the Fusion of Sentinel-2 and RapidEye Images (작물 모니터링을 위한 다중 센서 고해상도 위성영상의 시공간 융합 모델의 평가: Sentinel-2 및 RapidEye 영상 융합 실험)

  • Park, Soyeon;Kim, Yeseul;Na, Sang-Il;Park, No-Wook
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_1
    • /
    • pp.807-821
    • /
    • 2020
  • The objective of this study is to evaluate the applicability of representative spatio-temporal fusion models developed for the fusion of mid- and low-resolution satellite images in order to construct a set of time-series high-resolution images for crop monitoring. Particularly, the effects of the characteristics of input image pairs on the prediction performance are investigated by considering the principle of spatio-temporal fusion. An experiment on the fusion of multi-temporal Sentinel-2 and RapidEye images in agricultural fields was conducted to evaluate the prediction performance. Three representative fusion models, including Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM), SParse-representation-based SpatioTemporal reflectance Fusion Model (SPSTFM), and Flexible Spatiotemporal DAta Fusion (FSDAF), were applied to this comparative experiment. The three spatio-temporal fusion models exhibited different prediction performance in terms of prediction errors and spatial similarity. However, regardless of the model types, the correlation between coarse resolution images acquired on the pair dates and the prediction date was more significant than the difference between the pair dates and the prediction date to improve the prediction performance. In addition, using vegetation index as input for spatio-temporal fusion showed better prediction performance by alleviating error propagation problems, compared with using fused reflectance values in the calculation of vegetation index. These experimental results can be used as basic information for both the selection of optimal image pairs and input types, and the development of an advanced model in spatio-temporal fusion for crop monitoring.

Content Based Image Retrieval using 8AB Representation of Spatial Relations between Objects (객체 위치 관계의 8AB 표현을 이용한 내용 기반 영상 검색 기법)

  • Joo, Chan-Hye;Chung, Chin-Wan;Park, Ho-Hyun;Lee, Seok-Lyong;Kim, Sang-Hee
    • Journal of KIISE:Databases
    • /
    • v.34 no.4
    • /
    • pp.304-314
    • /
    • 2007
  • Content Based Image Retrieval (CBIR) is to store and retrieve images using the feature description of image contents. In order to support more accurate image retrieval, it has become necessary to develop features that can effectively describe image contents. The commonly used low-level features, such as color, texture, and shape features may not be directly mapped to human visual perception. In addition, such features cannot effectively describe a single image that contains multiple objects of interest. As a result, the research on feature descriptions has shifted to focus on higher-level features, which support representations more similar to human visual perception like spatial relationships between objects. Nevertheless, the prior works on the representation of spatial relations still have shortcomings, particularly with respect to supporting rotational invariance, Rotational invariance is a key requirement for a feature description to provide robust and accurate retrieval of images. This paper proposes a high-level feature named 8AB (8 Angular Bin) that effectively describes the spatial relations of objects in an image while providing rotational invariance. With this representation, a similarity calculation and a retrieval technique are also proposed. In addition, this paper proposes a search-space pruning technique, which supports efficient image retrieval using the 8AB feature. The 8AB feature is incorporated into a CBIR system, and the experiments over both real and synthetic image sets show the effectiveness of 8AB as a high-level feature and the efficiency of the pruning technique.

Scaleup of Electrolytic Reactors in Pyroprocessing (Pyroprocessing 공정에 사용되는 전해반응장치의 규모 확대)

  • Yoo, Jae-Hyung;Kim, Jeong-Guk;Lee, Han-Soo
    • Journal of Nuclear Fuel Cycle and Waste Technology(JNFCWT)
    • /
    • v.7 no.4
    • /
    • pp.237-242
    • /
    • 2009
  • In the pyroprocessing of spent nuclear fuels, fuel materials are recovered by electrochemical reactions on the surface of electrodes as well as stirring the electrolyte in electrolytic cells such as electrorefiner, electroreducer and electrowinner. The system with this equipment should first be scaled-up in order to commercialize the pyroprocessing. So in this study, the scale-up for those electrolytic cells was studied to design a large-scale system which can be employed in a commercial process in the future. Basically the dimensions of both electrolytic cells and electrodes should be enlarged on the basis of the geometrical similarity. Then the criterion of constant power input per unit volume, characterizing the fluid behavior in the cells, was introduced in this study and a calculation process based on trial-and-error methode was derived, which makes it possible to seek a proper speed of agitation in the electrolytic cells. Consequently examples of scale-up for an arbitrary small scale system were shown when the criterion of constant power input per unit volume and another criterion of constant impeller tip speed were respectively applied.

  • PDF

Representative Labels Selection Technique for Document Cluster using WordNet (문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법)

  • Kim, Tae-Hoon;Sohn, Mye
    • Journal of Internet Computing and Services
    • /
    • v.18 no.2
    • /
    • pp.61-73
    • /
    • 2017
  • In this paper, we propose a Documents Cluster Labeling method using information content of words in clusters to understand what the clusters imply. To do so, we calculate the weight and frequency of the words. These two measures are used to determine the weight among the words in the cluster. As a nest step, we identify the candidate labels using the WordNet. At this time, the candidate labels are matched to least common hypernym of the words in the cluster. Finally, the representative labels are determined with respect to information content of the words and the weight of the words. To prove the superiority of our method, we perform the heuristic experiment using two kinds of measures, named the suitability of the candidate label ($Suitability_{cl}$) and the appropriacy of representative label ($Appropriacy_{rl}$). In applying the method proposed in this research, in case of suitability of the candidate label, it decreases slightly compared with existing methods, but the computational cost is about 20% of the conventional methods. And we confirmed that appropriacy of the representative label is better results than the existing methods. As a result, it is expected to help data analysts to interpret the document cluster easier.

On Estimation of Zero Plane Displacement from Single-Level Wind Measurement above a Coniferous Forest (침엽수림 상부의 단일층 풍속 관측으로부터의 영면변위 추정에 관하여)

  • Yoo, Jae-Ill;Hong, Jin-Kyu;Kwon, Hyo-Jung;Lim, Jong-Hwan;Kim, Joon
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.12 no.1
    • /
    • pp.45-62
    • /
    • 2010
  • Zero plane displacement (d) is the elevated height of the apparent momentum sink exerted by the vegetation on the air. For a vegetative canopy, d depends on the roughness structure of a plant canopy such as leaf area index, canopy height and canopy density, and thus is critical for the analysis of canopy turbulence and the calculation of surface scalar fluxes. In this research note, we estimated d at the Gwangneung coniferous forest by employing two independent methods of Rotach (1994) and Martano (2000), which require only a single-level eddy-covariance measurement. In general, these two methods provided comparable estimates of $d/h_c$ (where $h_c$ is the canopy height, i.e., ~23m), which ranged from 0.51 to 0.97 depending on wind directions. These estimates of $d/h_c$ were within the ranges (i.e., 0.64~0.94) reported from other forests in the literature but were sensitive to the forms of the nondimensional functions for atmospheric stability. Our finding indicates that one should be careful in interepreation of zero plane displacement estimated from a single-level eddy covariance measurement that is conductaed within the roughness sublayer.

A Three-Dimensional Galerkin-FEM Model with Density Variation (밀도 변화를 포함하는 3차원 연직함수 전개모형)

  • 이호진;정경태;소재귀;강관수;정종율
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.8 no.2
    • /
    • pp.123-136
    • /
    • 1996
  • A three-dimensional Galerkin-FEM model which can handle the temporal and spatial variation of density is presented. The hydrostatic approximation is used and density effects are included by means of conservation equation of heat and the equation of state. The finite difference grids are used in the horizontal plane and a set of linear-shape functions is used for the vertical expansion. The similarity transform is introduced to solve resultant matrix equations. The proposed model was first applied to the density-driven circulation in an idealized basin in the presence of the heat exchange between the air and the sea. The advection terms in the momentum equation were ignored, while the convection terms were retained in the heat equation. Coefficients of the vertical eddy viscosity and diffusivity were fixed to be constant. Calculation in a non-rotating idealized basin shows that the difference in heat capacity with depth gives rise to the horizontal gradient of temperature. Consequently, there is a steady new in the upper layer in the direction of increasing depth with compensatory counter flow .in the lower layer. With Coriolis force, geostrophic flow was predominant due to the balance between the pressure gradient and the Coriolis force. As a test in region of irregular topography, the model is applied to the Yellow Sea. Although the resultant flow was very complex, the character of the flow Showed to be geostrophic on the whole.

  • PDF