• Title/Summary/Keyword: Data Clustering

Search Result 2,747, Processing Time 0.031 seconds

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

A Study on the Satisfaction of Self-Employed (만족도를 이용한 자영업에 관한 연구)

  • Oh, Yu-Jin
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.2
    • /
    • pp.281-296
    • /
    • 2009
  • This study examines the job and life satisfactions of the self-employed. It uses the Korean Labour and Income Panel Study(KLIPS, hereafter) data for 1998 and 2004. We examine the phases of satisfaction and what variables influence satisfaction for both years and compare the results in order to see what changed between the two regimes. We make use of k-means clustering to divide self-employed into similar degrees of satisfaction. As a result, we are able to classify the self-employed into three groups(low, medium and high) both for the two regimes. High groups consists of relatively younger, well-educated, low working dates, higher proportion of woman than other groups. As a result of regression analysis, we have some evidence that women are more satisfied than men for job satisfaction and that the existence of income is more important than the amount of income for life satisfaction. The age, education, satisfaction for working place, and health are significant to both satisfactions.

A Grouping Method of Photographic Advertisement Information Based on the Efficient Combination of Features (특징의 효과적 병합에 의한 광고영상정보의 분류 기법)

  • Jeong, Jae-Kyong;Jeon, Byeung-Woo
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.2
    • /
    • pp.66-77
    • /
    • 2011
  • We propose a framework for grouping photographic advertising images that employs a hierarchical indexing scheme based on efficient feature combinations. The study provides one specific application of effective tools for monitoring photographic advertising information through online and offline channels. Specifically, it develops a preprocessor for advertising image information tracking. We consider both global features that contain general information on the overall image and local features that are based on local image characteristics. The developed local features are invariant under image rotation and scale, the addition of noise, and change in illumination. Thus, they successfully achieve reliable matching between different views of a scene across affine transformations and exhibit high accuracy in the search for matched pairs of identical images. The method works with global features in advance to organize coarse clusters that consist of several image groups among the image data and then executes fine matching with local features within each cluster to construct elaborate clusters that are separated by identical image groups. In order to decrease the computational time, we apply a conventional clustering method to group images together that are similar in their global characteristics in order to overcome the drawback of excessive time for fine matching time by using local features between identical images.

A STUDY OF MANDIBULAR DENIAL ARCH OF KOREAN ADULTS (한국 성인 유치악자의 하악 치열궁에 관한 조사)

  • Kim, Il-Han;Choi, Dae-Gyun
    • The Journal of Korean Academy of Prosthodontics
    • /
    • v.36 no.1
    • /
    • pp.166-182
    • /
    • 1998
  • The purposes of this study are to evaluate the Korean mandibular dental arch and classify the mandibular dental arch shape and size based on the incisal angle, canine angle, inter second molar width and height. In this study the mandibular study models were fabricated using irreversible hydrocolloid impression material from 225 volunteers with a mean age 23.62 (range 19-29). And the study models were measured with 3-dimensional measuring device and the mandibular dental arch was classified by means of K-means clustering method and visual inspection, then obtained data were analyzed with t-test for the statistical analysis. The results were as follows ; 1. The average canine height was 5.19mm(s.d. 1.17) in both sex, 5.34mm in male, and 4.95mnm in female. And the sexual difference was significant($0). 2. The average second molar height was 39.81mm(s.d. 2.44) in both sex, 40.19mm in male, and 39.21mm in female. And the sexual difference was significant($0). 3. The average inter-canine width was 27.16mm(s.d. 1.78) in both sex, 27.41mm in male, and 26.77mm in female. And the sexual difference was significant($0). 4. The average inter-first molar width was 46.93mm(s.d. 2.67) in both sex, 47.72mm in male, and 45.7mm in female. And the sexual difference was significant($0). 5. The inter-second molar width was average 56.09mm(s.d. 3.01) in both sex, 57.24mm in male, and 54.32mn in woma. And the sexual difference was significant($0). 6. The arch form was classified into three shapes based on the incisal and canine angle. V-shape showed $124.88^{\circ}$ of incisal angle and $141.64^{\circ}$ of canine angle, U-shape showed $152.76^{\circ}\;and\;125.35^{\circ}$, and O-shape showed $138.03^{\circ}\;and \;33.66^{\circ}$ respectively. Each shape distribution was that the V-shape was 14.2%, the U-Shape was 14.7%, and the O-shape was 71.1% of the 225 study models. 7. It was thought that the use of second molar width is more reasonable than height for classifying the dental arch size. The arch size was classified into four sizes based on the second molar width. Size 1 showed range of 42.24-48.23mm, size 2 showed 48.24-54.23mm, size 3 showed 54.24-60.23mm, and size 4 showed 60.24-66.23mm respectively. Each arch size distribution was that the size 1 was 1.3%, the size 2 was 27.1%, the size 3 was 63.6%, and the size 4 was 8.0% of the 225 study models.

  • PDF

Seasonal fluctuations and changing characteristics of a temperate zone wetland bird community

  • Lee, Soo-Dong;Kang, Hyun-Kyung
    • Journal of Ecology and Environment
    • /
    • v.43 no.2
    • /
    • pp.104-116
    • /
    • 2019
  • Background: The composition of wild bird populations in temperate zones greatly varies depending on phenological changes rather than other environmental factors. Particularly, wild birds appearing in wetlands fluctuate greatly due to the crossover of species arriving for breeding during the summer and for wintering. Therefore, to understand the changes to species composition related to phenology, we conducted this basic analysis of populations to further the cause of the protection of wetland-dependent wild birds. Methods: It is wrong to simply divide a wild bird population investigation into seasons. This study identifies species composition and indicator species that change along with seasons. Wetlands to be surveyed are protected by natural monuments and wetland inventory and are in a state close to nature. In order to identify as many species as possible in wetlands, a survey was conducted in both shallow and deep wetlands. The water depth varied in these areas, ranging from 0.2 to 2.0 m, allowing for both dabbling and diving ducks to inhabit the area. Surveys were conducted using line-transect and distance sampling methods and were conducted at intervals of 2 weeks. The survey was conducted under the following three categories: the eco-tone and emergent zone, the submergent zone, and the water surface. The survey was conducted along a wetland boundary by observing wild birds. A PC-ord program was used for clustering, and the SAS program was used to analyze the changes in species composition. The data strongly indicates that day length is the main factor for seasonal migration periods, despite the fact that climate change and increasing temperatures are often discussed. Results and conclusions: The indicator species for determining seasons include migrant birds such as Ardea cinerea, Alcedo atthis, Anas penelope, and Poiceps ruficollis, as well as resident birds such as Streptopelia orientalis and Emberiza elegans. Importantly, increases in local individual counts of these species may also serve as indicators. The survey results of seasonal fluctuations in temperate zones shows that spring (April to June), summer (July to September), autumn (October), and winter (November to March) are clearly distinguishable, even though spring and summer seasons tend to overlap, leading to the conclusion that additional research could more clearly identify fluctuation patterns in species composition and abundance in the study area.

Context-Dependent Classification of Multi-Echo MRI Using Bayes Compound Decision Model (Bayes의 복합 의사결정모델을 이용한 다중에코 자기공명영상의 context-dependent 분류)

  • 전준철;권수일
    • Investigative Magnetic Resonance Imaging
    • /
    • v.3 no.2
    • /
    • pp.179-187
    • /
    • 1999
  • Purpose : This paper introduces a computationally inexpensive context-dependent classification of multi-echo MRI with Bayes compound decision model. In order to produce accurate region segmentation especially in homogeneous area and along boundaries of the regions, we propose a classification method that uses contextual information of local enighborhood system in the image. Material and Methods : The performance of the context free classifier over a statistically heterogeneous image can be improved if the local stationary regions in the image are disassociated from each other through the mechanism of the interaction parameters defined at he local neighborhood level. In order to improve the classification accuracy, we use the contextual information which resolves ambiguities in the class assignment of a pattern based on the labels of the neighboring patterns in classifying the image. Since the data immediately surrounding a given pixel is intimately associated with this given pixel., then if the true nature of the surrounding pixel is known this can be used to extract the true nature of the given pixel. The proposed context-dependent compound decision model uses the compound Bayes decision rule with the contextual information. As for the contextual information in the model, the directional transition probabilities estimated from the local neighborhood system are used for the interaction parameters. Results : The context-dependent classification paradigm with compound Bayesian model for multi-echo MR images is developed. Compared to context free classification which does not consider contextual information, context-dependent classifier show improved classification results especially in homogeneous and along boundaries of regions since contextual information is used during the classification. Conclusion : We introduce a new paradigm to classify multi-echo MRI using clustering analysis and Bayesian compound decision model to improve the classification results.

  • PDF

Gene Expression Profiling in Diethylnitrosamine Treated Mouse Liver: From Pathological Data to Microarray Analysis (Diethylnitrosamine 처리 후 병리학적 결과를 기초로 한 마우스 간에서의 유전자 발현 분석)

  • Kim, Ji-Young;Yoon, Seok-Joo;Park, Han-Jin;Kim, Yong-Bum;Cho, Jae-Woo;Koh, Woo-Suk;Lee, Michael
    • Toxicological Research
    • /
    • v.23 no.1
    • /
    • pp.55-63
    • /
    • 2007
  • Diethylnitrosamine (DEN) is a nitrosamine compound that can induce a variety of liver lesions including hepatic carcinoma, forming DNA-carcinogen adducts. In the present study, microarray analyses were performed with Affymetrix Murine Genome 430A Array in order to identify the gene-expression profiles for DEN and to provide valuable information for the evaluation of potential hepatotoxicity. C57BL/6NCrj mice were orally administered once with DEN at doses of 0, 3, 7 and 20 mg/kg. Liver from each animal was removed 2, 4, 8 and 24 hrs after the administration. The histopathological analysis and serum biochemical analysis showed no significant difference in DEN-treated groups compared to control group. Conversely, the principal component analysis (PCA) profiles demonstrated that a specific normal gene expression profile in control groups differed clearly from the expression profiles of DEN-treated groups. Within groups, a little variance was found between individuals. Student's t-test on the results obtained from triplicate hybridizations was performed to identify those genes with statistically significant changes in the expression. Statistical analysis revealed that 11 genes were significantly downregulated and 28 genes were upregulated in all three animals after 2 h treatment at 20 mg/kg. The upregulated group included genes encoding Gdf15, JunD1, and Mdm2, while the genes including Sox6, Shmt2, and SIc6a6 were largely down regulated. Hierarchical clustering of gene expression also allowed the identification of functionally related clusters that encode proteins related to metabolism, and MAPK signaling pathway. Taken together, this study suggests that match with a toxicant signature can assign a putative mechanism of action to the test compound if is established a database containing response patterns to various toxic compounds.

Evaluation of Water Quality Characteristics in the Nakdong River using Statistical Analysis (통계분석을 이용한 낙동강유역의 수질변화 특성 조사)

  • Choi, Kil Yong;Im, Toe Hyo;Lee, Jae Woon;Cheon, Se Uk
    • Journal of Korea Water Resources Association
    • /
    • v.45 no.11
    • /
    • pp.1157-1168
    • /
    • 2012
  • In this study, we assess changes in water quality trends over time based on certain control measurements in order to identify and analyze the cause of the trend in water quality. The current water pollution in the Nakdong River was analyzed, as it suggests that the significant changes in water quality have occurred in between 2006 and 2010. Based on monthly average data, we have examined for trends of the Nakdong River watershed in water temperature, Biological Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Total Nitrogen (TN), and Total Phosphorus (TP). Moreover, we have investigated seasonal variation of water quality of sites within the Nakdong River Basin by implementing further analyses such as, Correlation Coefficient, Regression Analysis, Hierarchical Clustering Method, and Time Series Analysis on SPSS. Geology and topography of the watershed, controlled by various conditions such as, climate, vegetation, topography, soil, and rain medium, have been affected by the non-homogeneity. Our study suggests that such variables could possibly cause eutrophication problems in the river. One possible way to overcome this particular problem is to lay up a ship on the river by increasing the nasal flow measurement of the Nakdong River during rainy season. Moreover, the water management requires arranging the measurement of the flow in order to secure the river while the numerous construction projects need to be continuously observed. However, the water is not flowing tributary of the reason for the timing to be flowing in a natural state of river water and industrial water intake because agriculture. Therefore, ongoing research is needed in addition to configuration of all observations.

Analysis of Genetic Relatedness by Random Amplified Polymorphic DNA (RAPD) in Pecan Taxa (RAPD를 이용한 Pecan 품종의 유전적 관계 분석)

  • 신동영;김회택;박종인;노일섭
    • Korean Journal of Plant Resources
    • /
    • v.13 no.1
    • /
    • pp.1-10
    • /
    • 2000
  • Pecan is deciduous tree and belongs to the Julandaceae family. Pecan is an economically important as a nut and timber crop. Heterozygosity is expected to be high for typically cross-pollinated. Yet little is known about the nature of genetic variation within this species. In addition, the pedigree of many pecan cultivars remains unknown or is questionable. In this study, the phylogenetic relationships between 22 pecan cultivars and its analyzed by RAPD (randomly amplified polymorphic DNA). PCR Amplification used 40 randomly selected oligoes as primers. Based on their genetic similarities derived from the RAPD data, the 22 pecan cultivars were classified into different five groups in agarose gel. The 22 pecan cultivars were classified into five sectional groups by UPGMA clustering analysis, too. C. flacra and Black walnut showed the 0.9 of similarity index and Farley, Pawnee showed the 0.85 of similarity index. The 22 pecan cultivars were classified into different five groups by analysis of the 4% polyacrylamide gel fraction. (Group I : 1, 2, 3, 4, 13, 16, 17, 20, 21 Group II : 14,18 GroupIII : 6,12 GroupIV : 5, 11, 15, 19, 22 CroupV : 7, 8, 9, 10) Group V show the 1.0 of similarity index and Farley, Sturya, Clarke, Pawnee show the 0.98 of similarity index and Kiowa, Schley show the 0.92 of similarity index. Results from this study indicated that RAPD can be used to establish the genetic relationships among the 22 pecan cultivars. Similarity coefficients generally agreed with what would be predicted in cultivars with known pedigrees, and we could accurately construct relationships among cultivars. In addition, we have shown that RAPD provides useful information on the origin of unknown cultivars.

  • PDF

A Study of User Interests and Tag Classification related to resources in a Social Tagging System (소셜 태깅에서 관심사로 바라본 태그 특징 연구 - 소셜 북마킹 사이트 'del.icio.us'의 태그를 중심으로 -)

  • Bae, Joo-Hee;Lee, Kyung-Won
    • 한국HCI학회:학술대회논문집
    • /
    • 2009.02a
    • /
    • pp.826-833
    • /
    • 2009
  • Currently, the rise of social tagging has changing taxonomy to folksonomy. Tag represents a new approach to organizing information. Nonhierarchical classification allows data to be freely gathered, allows easy access, and has the ability to move directly to other content topics. Tag is expected to play a key role in clustering various types of contents, it is expand to network in the common interests among users. First, this paper determine the relationships among user, tags and resources in social tagging system and examine the circumstances of what aspects to users when creating a tag related to features of websites. Therefore, this study uses tags from the social bookmarking service 'del.icio.us' to analyze the features of tag words when adding a new web page to a list. To do this, websites features classified into 7 items, it is known as tag classification related to resources. Experiments were conducted to test the proposed classify method in the area of music, photography and games. This paper attempts to investigate the perspective in which users apply a tag to a webpage and establish the capacity of expanding a social service that offers the opportunity to create a new business model.

  • PDF