• Title/Summary/Keyword: Big Data Clustering

Search Result 147, Processing Time 0.023 seconds

Design and Implementation of Paper Classification Systems based on Keyword Extraction and Clustering (키워드 추출과 군집화 기반의 논문 분류 시스템의 설계 및 구현)

  • Lee, Yun-Soo;Pheaktra, They;Lee, Jong-Hyuk;Gil, Joon-Min
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.48-51
    • /
    • 2018
  • 컴퓨터 및 기술의 발전으로 힘입어 수많은 논문이 오프라인뿐 아니라 온라인으로 발행되고 있고, 새로운 분야들도 계속 생기면서 사용자들은 방대한 논문들 중 자신이 필요로 하는 논문을 검색하거나 분류하기에 많은 어려움을 겪고 있다. 이러한 한계를 극복하기 위해 본 논문에서는 유사 내용의 논문을 분류하고 이를 군집화하는 방법을 제안한다. 제안하는 방법은 TF-IDF를 이용하여 각 논문의 초록으로 부터 대표 주제어를 추출하고, K-means 클러스터링 알고리즘을 이용하여 추출한 TF-IDF 값을 근거로 논문들을 유사 내용의 논문으로 군집화한다.

Keyword Analysis of Two SCI Journals on Rock Engineering by using Text Mining (텍스트 마이닝을 이용한 암반공학분야 SCI논문의 주제어 분석)

  • Jung, Yong-Bok;Park, Eui-Seob
    • Tunnel and Underground Space
    • /
    • v.25 no.4
    • /
    • pp.303-319
    • /
    • 2015
  • Text mining is one of the branches of data mining and is used to find any meaningful information from the large amount of text. In this study, we analyzed titles and keywords of two SCI journals on rock engineering by using text mining to find major research area, trend and associations of research fields. Visualization of the results was also included for the intuitive understanding of the results. Two journals showed similar research fields but different patterns in the associations among research fields. IJRMMS showed simple network, that is one big group based on the keyword 'rock' with a few small groups. On the other hand, RMRE showed a complex network among various medium groups. Trend analysis by clustering and linear regression of keyword - year frequency matrix provided that most of the keywords increased in number as time goes by except a few descending keywords.

Research on Natural Language Processing Package using Open Source Software (오픈소스 소프트웨어를 활용한 자연어 처리 패키지 제작에 관한 연구)

  • Lee, Jong-Hwa;Lee, Hyun-Kyu
    • The Journal of Information Systems
    • /
    • v.25 no.4
    • /
    • pp.121-139
    • /
    • 2016
  • Purpose In this study, we propose the special purposed R package named ""new_Noun()" to process nonstandard texts appeared in various social networks. As the Big data is getting interested, R - analysis tool and open source software is also getting more attention in many fields. Design/methodology/approach With more than 9,000 R packages, R provides a user-friendly functions of a variety of data mining, social network analysis and simulation functions such as statistical analysis, classification, prediction, clustering and association analysis. Especially, "KoNLP" - natural language processing package for Korean language - has reduced the time and effort of many researchers. However, as the social data increases, the informal expressions of Hangeul (Korean character) such as emoticons, informal terms and symbols make the difficulties increase in natural language processing. Findings In this study, to solve the these difficulties, special algorithms that upgrade existing open source natural language processing package have been researched. By utilizing the "KoNLP" package and analyzing the main functions in noun extracting command, we developed a new integrated noun processing package "new_Noun()" function to extract nouns which improves more than 29.1% compared with existing package.

A Study on the Development of Industrial Clusters in the International Science and Business Belt through the Industrial Clustering Analysis (산업 클러스터링 분석을 통한 국제과학비즈니스벨트의 클러스터 발전 방향 연구)

  • Jung, Hye-Jin;Og, Joo-Young;Kim, Byung-Keun;Ji, Il-Yong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.2
    • /
    • pp.370-379
    • /
    • 2018
  • The Korean government announced plans for the International Science Business Belt as a spatial area for promoting the linkage between scientific knowledge and commercialization in 2009. R&D and entrepreneurial activities are essential for the success of the International Science Business Belt. In particular, prioritizing the types of businesses is critical at the cluster establishment stage in that this largely affects the features and development of clusters comprising the International Science Business Belt. This research aims to predict the entry and growth of firms that specialize in four industrial clusters, including Big Science Cluster, Frontier Cluster, ICT Cluster, and Bio-Healthcare Cluster. For this purpose, we employ the Swann & Prevezer's industrial clustering model to identify sectors that affect the establishment and growth of industrial clusters in the International Science Business Belt, focusing on ICT, Bio-Healthcare and Frontier clusters. Data was collected from the 2014 Korean Innovation Survey (KIS) and University Alimi for the ICT cluster, 2014 National Bio Industry Survey and University Alimi for the Bio-Healthcare Cluster, and the 2015 National Nano Convergent Industry Survey and Annual Report of Nano Technology for the Frontier cluster. Empirical results show that the ICT service sector, bio process/equipment sector, and Nano electronic sector promote clustering in other sectors. Based on the analysis results, we discuss several policy implications and strategies that can attract relevant firms for the development of industrial clusters.

Comparative Analysis for Clustering Based Optimal Vehicle Routes Planning (클러스터링 기반의 최적 차량 운행 계획 수립을 위한 비교연구)

  • Kim, Jae-Won;Shin, KwangSup
    • The Journal of Bigdata
    • /
    • v.5 no.1
    • /
    • pp.155-180
    • /
    • 2020
  • It takes the most important role the problem of assigining vehicles and desigining optimal routes for each vehicle in order to enhance the logistics service level. While solving the problem, various cost factors such as number of vehicles, the capacity of vehicles, total travelling distance, should be considered at the same time. Although most of logistics service providers introduced the Transportation Management System (TMS), the system has the limitation which can not consider the practical constraints. In order to make the solution of TMS applicable, it is required experts revised the solution of TMS based on their own experience and intuition. In this research, different from previous research which have focused on minimizing the total cost, it has been proposed the methodology which can enhance the efficiency and fairness of asset utilization, simultaneously. First of all, it has been adopted the Cluster-First Route-Second (CFRS) approach. Based on the location of customers, we have grouped customers as clusters by using four different clustering algorithm such as K-Means, K-Medoids, DBSCAN, Model-based clustering and a procedural approach, Fisher & Jaikumar algorithm. After getting the result of clustering, it has been developed the optiamal vehicle routes within clusters. Based on the result of numerical experiments, it can be said that the propsed approach based on CFRS may guarantee the better performance in terms of total travelling time and distance. At the same time, the variance of travelling distance and number of visiting customers among vehicles, it can be concluded that the proposed approach can guarantee the better performance of assigning tasks in terms of fairness.

Analysis of Reading Domian of Men and Women Elderly Using Book Lending Data (도서 대출데이터를 활용한 남녀 노령자의 독서 주제 분석)

  • Cho, Jane
    • Journal of Korean Library and Information Science Society
    • /
    • v.50 no.1
    • /
    • pp.23-41
    • /
    • 2019
  • This study understand the subject domain of book which has been read by men and woman elderly by analizying the PFNET using library big data and confirm the difference between adult at age 30-40. This study extract co-occurrence matrix of book lending on the popular book list from library big data, for 4 group, men/woman elderly, men/woman adult. With these matrix, this study performs FP network analysis. And Pearson Correlation Analysis based on the Triangle Betweenness Centrality calculated on the loan book was performed to understand the correlation among the 4 clusters which has been created by PNNC algorithm. As a result, reading trend which has been focused on modern korean novel has been revealed in elderly regardless gender, among them, men elderly show extreme tendency concentrated on modern korean long series novel. In the correlation analysis, the male elderly showed a weak negative correlation with the adult male of r = -0.222, and the negative direction of all the other groups showed that the tendency of male elderly's loan book was opposite.

A Study On Clusters and Ecosystem In Distribution Industry Using Big Data Analysis (빅데이타 분석을 통한 유통산업 클러스터의 형성과 생태계 연구)

  • Jung, Jaeheon
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.7
    • /
    • pp.360-375
    • /
    • 2019
  • This paper tries to study the ecosystem after constructing the network of the continuing transactions associated with distribution industry with the data of more than 50 thousands firms provided by the Korean enterprise data (KED) for 2015. After applying the clustering method, one of social network analysis tools, we find the firms in the network grouped into 732 clusters occupying about 80% of whole distribution industry sales in KED data. The firms in a cluster have most of their transactions with other firms in the cluster. But the clusters have smaller firm numbers in the cluster and sales portion of the biggest firms in the industry than the case of the manufacturing industry. The Input-output analysis for the biggest distribution firms show that the small and medium size enterprise(SME)s have very high sale dependency on a main firm in some clusters. This fact implies more efficient fair transaction policies within the clusters. And small number of big distribution firms have very high rear production linkage effects on SMEs or on the 10th or 31th group with high portion of SME employment. They should be considered important in the SME growth and employment policies.

A Big Data Based Random Motif Frequency Method for Analyzing Human Proteins (인간 단백질 분석을 위한 빅 데이타 기반 RMF 방법)

  • Kim, Eun-Mi;Jeong, Jong-Cheol;Lee, Bae-Ho
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.13 no.6
    • /
    • pp.1397-1404
    • /
    • 2018
  • Due to the technical difficulties and high cost for obtaining 3-dimensional structure data, sequence-based approaches in proteins have not been widely acknowledged. A motif can be defined as any segments in protein or gene sequences. With this simplicity, motifs have been actively and widely used in various areas. However, the motif itself has not been studied comprehensively. The value of this study can be categorized in three fields in order to analyze the human proteins using artificial intelligence method: (1) Based on our best knowledge, this research is the first comprehensive motif analysis by analyzing motifs with all human proteins in Protein Data Bank (PDB) associated with the database of Enzyme Commission (EC) number and Structural Classification of Proteins (SCOP). (2) We deeply analyze the motif in three different categories: pattern, statistical, and functional analysis of clusters. (3) At the last and most importantly, we proposed random motif frequency(RMF) matric that can efficiently distinct the characteristics of proteins by identifying interface residues from non-interface residues and clustering protein functions based on big data while varying the size of random motif.

A Study on the Changes of the Restaurant Industry Before and After COVID-19 Using BigData (빅데이터를 활용한 코로나 19 이전과 이후 외식산업의 변화에 관한 연구)

  • Ahn, Youn Ju
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.787-793
    • /
    • 2022
  • After COVID-19, with the emergence of social distancing, non-face-to-face services, and home economics, visiting dining out is rapidly being replaced by non-face-to-face dining out. The purpose of this study is to find ways to create a safe dining culture centered on living quarantine in line with the changing trend of the restaurant industry after the outbreak of COVID-19, establish the direction of food culture improvement projects, and enhance the effectiveness of the project. This study used TEXTOM to collect and refine search frequency, perform TF-IDF analysis, and Ucinet6 programs to implement visualization using NetDraw from January 1, 2018 to October 31, 2019 and December 31, 2021, and identified the network between nodes of key keywords. Finally, clustering between them was performed through Concor analysis. As a result of the study, if you check the frequency of searches before and after COVID-19, it can be seen that the COVID-19 pandemic greatly affects the changes in the restaurant industry.

A Study of Similarity Measure Algorithms for Recomendation System about the PET Food (반려동물 사료 추천시스템을 위한 유사성 측정 알고리즘에 대한 연구)

  • Kim, Sam-Taek
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.11
    • /
    • pp.159-164
    • /
    • 2019
  • Recent developments in ICT technology have increased interest in the care and health of pets such as dogs and cats. In this paper, cluster analysis was performed based on the component data of pet food to be used in various fields of the pet industry. For cluster analysis, the similarity was analyzed by analyzing the correlation between components of 300 dogs and cats in the market. In this paper, clustering techniques such as Hierarchical, K-Means, Partitioning around medoids (PAM), Density-based, Mean-Shift are clustered and analyzed. We also propose a personalized recommendation system for pets. The results of this paper can be used for personalized services such as feed recommendation system for pets.