• Title/Summary/Keyword: K-Means clustering algorithm

Search Result 545, Processing Time 0.026 seconds

Analysis of Research Trends Related to drug Repositioning Based on Machine Learning (머신러닝 기반의 신약 재창출 관련 연구 동향 분석)

  • So Yeon Yoo;Gyoo Gun Lim
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.21-37
    • /
    • 2022
  • Drug repositioning, one of the methods of developing new drugs, is a useful way to discover new indications by allowing drugs that have already been approved for use in people to be used for other purposes. Recently, with the development of machine learning technology, the case of analyzing vast amounts of biological information and using it to develop new drugs is increasing. The use of machine learning technology to drug repositioning will help quickly find effective treatments. Currently, the world is having a difficult time due to a new disease caused by coronavirus (COVID-19), a severe acute respiratory syndrome. Drug repositioning that repurposes drugsthat have already been clinically approved could be an alternative to therapeutics to treat COVID-19 patients. This study intends to examine research trends in the field of drug repositioning using machine learning techniques. In Pub Med, a total of 4,821 papers were collected with the keyword 'Drug Repositioning'using the web scraping technique. After data preprocessing, frequency analysis, LDA-based topic modeling, random forest classification analysis, and prediction performance evaluation were performed on 4,419 papers. Associated words were analyzed based on the Word2vec model, and after reducing the PCA dimension, K-Means clustered to generate labels, and then the structured organization of the literature was visualized using the t-SNE algorithm. Hierarchical clustering was applied to the LDA results and visualized as a heat map. This study identified the research topics related to drug repositioning, and presented a method to derive and visualize meaningful topics from a large amount of literature using a machine learning algorithm. It is expected that it will help to be used as basic data for establishing research or development strategies in the field of drug repositioning in the future.

A Statistical Approach for Improving the Embedding Capacity of Block Matching based Image Steganography (블록 매칭 기반 영상 스테가노그래피의 삽입 용량 개선을 위한 통계적 접근 방법)

  • Kim, Jaeyoung;Park, Hanhoon;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.22 no.5
    • /
    • pp.643-651
    • /
    • 2017
  • Steganography is one of information hiding technologies and discriminated from cryptography in that it focuses on avoiding the existence the hidden information from being detected by third parties, rather than protecting it from being decoded. In this paper, as an image steganography method which uses images as media, we propose a new block matching method that embeds information into the discrete wavelet transform (DWT) domain. The proposed method, based on a statistical analysis, reduces loss of embedding capacity due to inequable use of candidate blocks. It works in such a way that computes the variance of each candidate block, preserves candidate blocks with high frequency components while reducing candidate blocks with low frequency components by compressing them exploiting the k-means clustering algorithm. Compared with the previous block matching method, the proposed method can reconstruct secret images with similar PSNRs while embedding higher-capacity information.

Design of Robust Face Recognition System Realized with the Aid of Automatic Pose Estimation-based Classification and Preprocessing Networks Structure

  • Kim, Eun-Hu;Kim, Bong-Youn;Oh, Sung-Kwun;Kim, Jin-Yul
    • Journal of Electrical Engineering and Technology
    • /
    • v.12 no.6
    • /
    • pp.2388-2398
    • /
    • 2017
  • In this study, we propose a robust face recognition system to pose variations based on automatic pose estimation. Radial basis function neural network is applied as one of the functional components of the overall face recognition system. The proposed system consists of preprocessing and recognition modules to provide a solution to pose variation and high-dimensional pattern recognition problems. In the preprocessing part, principal component analysis (PCA) and 2-dimensional 2-directional PCA ($(2D)^2$ PCA) are applied. These functional modules are useful in reducing dimensionality of the feature space. The proposed RBFNNs architecture consists of three functional modules such as condition, conclusion and inference phase realized in terms of fuzzy "if-then" rules. In the condition phase of fuzzy rules, the input space is partitioned with the use of fuzzy clustering realized by the Fuzzy C-Means (FCM) algorithm. In conclusion phase of rules, the connections (weights) are realized through four types of polynomials such as constant, linear, quadratic and modified quadratic. The coefficients of the RBFNNs model are obtained by fuzzy inference method constituting the inference phase of fuzzy rules. The essential design parameters (such as the number of nodes, and fuzzification coefficient) of the networks are optimized with the aid of Particle Swarm Optimization (PSO). Experimental results completed on standard face database -Honda/UCSD, Cambridge Head pose, and IC&CI databases demonstrate the effectiveness and efficiency of face recognition system compared with other studies.

Flower Recognition System Using OpenCV on Android Platform (OpenCV를 이용한 안드로이드 플랫폼 기반 꽃 인식 시스템)

  • Kim, Kangchul;Yu, Cao
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.1
    • /
    • pp.123-129
    • /
    • 2017
  • New mobile phones with high tech-camera and a large size memory have been recently launched and people upload pictures of beautiful scenes or unknown flowers in SNS. This paper develops a flower recognition system that can get information on flowers in the place where mobile communication is not even available. It consists of a registration part for reference flowers and a recognition part based on OpenCV for Android platform. A new color classification method using RGB color channel and K-means clustering is proposed to reduce the recognition processing time. And ORB for feature extraction and Brute-Force Hamming algorithm for matching are used. We use 12 kinds of flowers with four color groups, and 60 images are applied for reference DB design and 60 images for test. Simulation results show that the success rate is 83.3% and the average recognition time is 2.58 s on Huawei ALEUL00 and the proposed system is suitable for a mobile phone without a network.

The Effect of the Number of Phoneme Clusters on Speech Recognition (음성 인식에서 음소 클러스터 수의 효과)

  • Lee, Chang-Young
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.9 no.11
    • /
    • pp.1221-1226
    • /
    • 2014
  • In an effort to improve the efficiency of the speech recognition, we investigate the effect of the number of phoneme clusters. For this purpose, codebooks of varied number of phoneme clusters are prepared by modified k-means clustering algorithm. The subsequent processing is fuzzy vector quantization (FVQ) and hidden Markov model (HMM) for speech recognition test. The result shows that there are two distinct regimes. For large number of phoneme clusters, the recognition performance is roughly independent of it. For small number of phoneme clusters, however, the recognition error rate increases nonlinearly as it is decreased. From numerical calculation, it is found that this nonlinear regime might be modeled by a power law function. The result also shows that about 166 phoneme clusters would be the optimal number for recognition of 300 isolated words. This amounts to roughly 3 variations per phoneme.

A Study on Recommendation Technique Using Mining and Clustering of Weighted Preference based on FRAT (마이닝과 FRAT기반 가중치 선호도 군집을 이용한 추천 기법에 관한 연구)

  • Park, Wha-Beum;Cho, Young-Sung;Ko, Hyung-Hwa
    • Journal of Digital Contents Society
    • /
    • v.14 no.4
    • /
    • pp.419-428
    • /
    • 2013
  • Real-time accessibility and agility are required in u-commerce under ubiquitous computing environment. Most of the existing recommendation techniques adopt the method of evaluation based on personal profile, which has been identified with difficulties in accurately analyzing the customers' level of interest and tendencies, as well as the problems of cost, consequently leaving customers unsatisfied. Researches have been conducted to improve the accuracy of information such as the level of interest and tendencies of the customers. However, the problem lies not in the preconstructed database, but in generating new and diverse profiles that are used for the evaluation of the existing data. Also it is difficult to use the unique recommendation method with hierarchy of each customer who has various characteristics in the existing recommendation techniques. Accordingly, this dissertation used the implicit method without onerous question and answer to the users based on the data from purchasing, unlike the other evaluation techniques. We applied FRAT technique which can analyze the tendency of the various personalization and the exact customer.

A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables (단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로)

  • Choi, Garam;Choi, Sung-Pil
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.231-250
    • /
    • 2018
  • In this paper, we propose a new methodology for extracting and formalizing subjective topics at a specific time using a set of keywords extracted automatically from online news articles. To do this, we first extracted a set of keywords by applying TF-IDF methods selected by a series of comparative experiments on various statistical weighting schemes that can measure the importance of individual words in a large set of texts. In order to effectively calculate the semantic relation between extracted keywords, a set of word embedding vectors was constructed by using about 1,000,000 news articles collected separately. Individual keywords extracted were quantified in the form of numerical vectors and clustered by K-means algorithm. As a result of qualitative in-depth analysis of each keyword cluster finally obtained, we witnessed that most of the clusters were evaluated as appropriate topics with sufficient semantic concentration for us to easily assign labels to them.

Modeling of the Cluster-based Multi-hop Sensor Networks (클거스터 기반 다중 홉 센서 네트워크의 모델링 기법)

  • Choi Jin-Chul;Lee Chae-Woo
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.43 no.1 s.343
    • /
    • pp.57-70
    • /
    • 2006
  • This paper descWireless Sensor Network consisting of a number of small sensors with transceiver and data processor is an effective means for gathering data in a variety of environments. The data collected by each sensor is transmitted to a processing center that use all reported data to estimate characteristics of the environment or detect an event. This process must be designed to conserve the limited energy resources of the sensor since neighboring sensors generally have the data of similar information. Therefore, clustering scheme which sends aggregated information to the processing center may save energy. Existing multi-hop cluster energy consumption modeling scheme can not estimate exact energy consumption of an individual sensor. In this paper, we propose a new cluster energy consumption model which modified existing problem. We can estimate more accurate total energy consumption according to the number of clusterheads by using Voronoi tessellation. Thus, we can realize an energy efficient cluster formation. Our modeling has an accuracy over $90\%$ when compared with simulation and has considerably superior than existing modeling scheme about $60\%.$ We also confirmed that energy consumption of the proposed modeling scheme is more accurate when the sensor density is increased.

Analysis method of patent document to Forecast Patent Registration (특허 등록 예측을 위한 특허 문서 분석 방법)

  • Koo, Jung-Min;Park, Sang-Sung;Shin, Young-Geun;Jung, Won-Kyo;Jang, Dong-Sik
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.4
    • /
    • pp.1458-1467
    • /
    • 2010
  • Recently, imitation and infringement rights of an intellectual property are being recognized as impediments to nation's industrial growth. To prevent the huge loss which comes from theses impediments, many researchers are studying protection and efficient management of an intellectual property in various ways. Especially, the prediction of patent registration is very important part to protect and assert intellectual property rights. In this study, we propose the patent document analysis method by using text mining to predict whether the patent is registered or rejected. In the first instance, the proposed method builds the database by using the word frequencies of the rejected patent documents. And comparing the builded database with another patent documents draws the similarity value between each patent document and the database. In this study, we used k-means which is partitioning clustering algorithm to select criteria value of patent rejection. In result, we found conclusion that some patent which similar to rejected patent have strong possibility of rejection. We used U.S.A patent documents about bluetooth technology, solar battery technology and display technology for experiment data.

Assessment through Statistical Methods of Water Quality Parameters(WQPs) in the Han River in Korea

  • Kim, Jae Hyoun
    • Journal of Environmental Health Sciences
    • /
    • v.41 no.2
    • /
    • pp.90-101
    • /
    • 2015
  • Objective: This study was conducted to develop a chemical oxygen demand (COD) regression model using water quality monitoring data (January, 2014) obtained from the Han River auto-monitoring stations. Methods: Surface water quality data at 198 sampling stations along the six major areas were assembled and analyzed to determine the spatial distribution and clustering of monitoring stations based on 18 WQPs and regression modeling using selected parameters. Statistical techniques, including combined genetic algorithm-multiple linear regression (GA-MLR), cluster analysis (CA) and principal component analysis (PCA) were used to build a COD model using water quality data. Results: A best GA-MLR model facilitated computing the WQPs for a 5-descriptor COD model with satisfactory statistical results ($r^2=92.64$,$Q{^2}_{LOO}=91.45$,$Q{^2}_{Ext}=88.17$). This approach includes variable selection of the WQPs in order to find the most important factors affecting water quality. Additionally, ordination techniques like PCA and CA were used to classify monitoring stations. The biplot based on the first two principal components (PCs) of the PCA model identified three distinct groups of stations, but also differs with respect to the correlation with WQPs, which enables better interpretation of the water quality characteristics at particular stations as of January 2014. Conclusion: This data analysis procedure appears to provide an efficient means of modelling water quality by interpreting and defining its most essential variables, such as TOC and BOD. The water parameters selected in a COD model as most important in contributing to environmental health and water pollution can be utilized for the application of water quality management strategies. At present, the river is under threat of anthropogenic disturbances during festival periods, especially at upstream areas.