• Title/Summary/Keyword: Probabilistic Latent Semantic Analysis

Search Result 11, Processing Time 0.031 seconds

Reputation Analysis of Document Using Probabilistic Latent Semantic Analysis Based on Weighting Distinctions (가중치 기반 PLSA를 이용한 문서 평가 분석)

  • Cho, Shi-Won;Lee, Dong-Wook
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.3
    • /
    • pp.632-638
    • /
    • 2009
  • Probabilistic Latent Semantic Analysis has many applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. In this paper, we propose an algorithm using weighted Probabilistic Latent Semantic Analysis Model to find the contextual phrases and opinions from documents. The traditional keyword search is unable to find the semantic relations of phrases, Overcoming these obstacles requires the development of techniques for automatically classifying semantic relations of phrases. Through experiments, we show that the proposed algorithm works well to discover semantic relations of phrases and presents the semantic relations of phrases to the vector-space model. The proposed algorithm is able to perform a variety of analyses, including such as document classification, online reputation, and collaborative recommendation.

Bag of Visual Words Method based on PLSA and Chi-Square Model for Object Category

  • Zhao, Yongwei;Peng, Tianqiang;Li, Bicheng;Ke, Shengcai
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.7
    • /
    • pp.2633-2648
    • /
    • 2015
  • The problem of visual words' synonymy and ambiguity always exist in the conventional bag of visual words (BoVW) model based object category methods. Besides, the noisy visual words, so-called "visual stop-words" will degrade the semantic resolution of visual dictionary. In view of this, a novel bag of visual words method based on PLSA and chi-square model for object category is proposed. Firstly, Probabilistic Latent Semantic Analysis (PLSA) is used to analyze the semantic co-occurrence probability of visual words, infer the latent semantic topics in images, and get the latent topic distributions induced by the words. Secondly, the KL divergence is adopt to measure the semantic distance between visual words, which can get semantically related homoionym. Then, adaptive soft-assignment strategy is combined to realize the soft mapping between SIFT features and some homoionym. Finally, the chi-square model is introduced to eliminate the "visual stop-words" and reconstruct the visual vocabulary histograms. Moreover, SVM (Support Vector Machine) is applied to accomplish object classification. Experimental results indicated that the synonymy and ambiguity problems of visual words can be overcome effectively. The distinguish ability of visual semantic resolution as well as the object classification performance are substantially boosted compared with the traditional methods.

Trend Analysis of School Health Research using Latent Semantic Analysis (잠재의미분석방법을 통한 학교보건 연구동향 분석)

  • Shin, Seon-Hi;Park, Youn-Ju
    • Journal of the Korean Society of School Health
    • /
    • v.33 no.3
    • /
    • pp.184-193
    • /
    • 2020
  • Purpose: This study was designed to investigate the trends in school health research in Korea using probabilistic latent semantic analysis. The study longitudinally analyzed the abstracts of the papers published in 「The Journal of the Korean Society of School Health」 over the recent 17 years, which is between 2004 and August 2020. By classifying all the papers according to the topics identified through the analysis, it was possible to see how the distribution of the topics has changed over years. Based on the results, implications for school health research and educational uses of latent semantic analysis were suggested. Methods: This study investigated the research trends by longitudinally analyzing journal abstracts using latent dirichlet allocation (LDA), a type of LSA. The abstracts in 「The Journal of the Korean Society of School Health」 published from 2004 to August 2020 were used for the analysis. Results: A total of 34 latent topics were identified by LDA. Six topics, which were「Adolescent depression and suicide prevention」, 「Students' knowledge, attitudes, & behaviors」, 「Effective self-esteem program through depression interventions」, 「Factors of students' stress」, 「Intervention program to prevent adolescent risky behaviors」, and 「Sex education curriculum, and teacher」were most frequently covered by the journal. Each of them was dealt with in at least 20 papers. The topics related to 「Intervention program to prevent adolescent risky behaviors」, 「Effective self-esteem program through depression interventions」, and 「Preventive vaccination and factors of effective vaccination」 appeared repeatedly over the most recent 5 years. Conclusion: This study introduced an AI-powered analysis method that enables data-centered objective text analysis without human intervention. Based on the results, implications for school health research were presented, and various uses of latent semantic analysis (LSA) in educational research were suggested.

Target Word Selection Disambiguation using Untagged Text Data in English-Korean Machine Translation (영한 기계 번역에서 미가공 텍스트 데이터를 이용한 대역어 선택 중의성 해소)

  • Kim Yu-Seop;Chang Jeong-Ho
    • The KIPS Transactions:PartB
    • /
    • v.11B no.6
    • /
    • pp.749-758
    • /
    • 2004
  • In this paper, we propose a new method utilizing only raw corpus without additional human effort for disambiguation of target word selection in English-Korean machine translation. We use two data-driven techniques; one is the Latent Semantic Analysis(LSA) and the other the Probabilistic Latent Semantic Analysis(PLSA). These two techniques can represent complex semantic structures in given contexts like text passages. We construct linguistic semantic knowledge by using the two techniques and use the knowledge for target word selection in English-Korean machine translation. For target word selection, we utilize a grammatical relationship stored in a dictionary. We use k- nearest neighbor learning algorithm for the resolution of data sparseness Problem in target word selection and estimate the distance between instances based on these models. In experiments, we use TREC data of AP news for construction of latent semantic space and Wail Street Journal corpus for evaluation of target word selection. Through the Latent Semantic Analysis methods, the accuracy of target word selection has improved over 10% and PLSA has showed better accuracy than LSA method. finally we have showed the relatedness between the accuracy and two important factors ; one is dimensionality of latent space and k value of k-NT learning by using correlation calculation.

Learning Similarity with Probabilistic Latent Semantic Analysis for Image Retrieval

  • Li, Xiong;Lv, Qi;Huang, Wenting
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.4
    • /
    • pp.1424-1440
    • /
    • 2015
  • It is a challenging problem to search the intended images from a large number of candidates. Content based image retrieval (CBIR) is the most promising way to tackle this problem, where the most important topic is to measure the similarity of images so as to cover the variance of shape, color, pose, illumination etc. While previous works made significant progresses, their adaption ability to dataset is not fully explored. In this paper, we propose a similarity learning method on the basis of probabilistic generative model, i.e., probabilistic latent semantic analysis (PLSA). It first derives Fisher kernel, a function over the parameters and variables, based on PLSA. Then, the parameters are determined through simultaneously maximizing the log likelihood function of PLSA and the retrieval performance over the training dataset. The main advantages of this work are twofold: (1) deriving similarity measure based on PLSA which fully exploits the data distribution and Bayes inference; (2) learning model parameters by maximizing the fitting of model to data and the retrieval performance simultaneously. The proposed method (PLSA-FK) is empirically evaluated over three datasets, and the results exhibit promising performance.

Target Word Selection using Word Similarity based on Latent Semantic Structure in English-Korean Machine Translation (잠재의미구조 기반 단어 유사도에 의한 역어 선택)

  • 장정호;김유섭;장병탁
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04b
    • /
    • pp.502-504
    • /
    • 2002
  • 본 논문에서는 대량의 말뭉치에서 추출된 잠재의미에 기반하여 단어간 유사도를 측정하고 이를 영한 기계 번역에서의 역어선택에 적용한다. 잠재의미 추출을 위해서는 latent semantic analysis(LSA)와 probabilistic LSA(PLSA)를 이용한다. 주어진 단어의 역어 선택시 기본적으로 연어(collocation) 사전을 검색하고, 미등록 단어의 경우 등재된 단어 중 해당 단어와 유사도가 높은 항목의 정보를 활용하며 이 때 $textsc{k}$-최근접 이웃 방법이 이용된다. 단어들간의 유사도 계산은 잠재의미 공간상에서 이루어진다. 실험에서, 연어사전만 이용하였을 경우보다 최고 15%의 성능 향상을 보였으며, PLSA에 기반한 방법이 LSA에 의한 방법보다 역어선택 성능 면에서 약간 더 우수하였다.

  • PDF

Analysis of Virus Types by a Latent Variable Model (Latent variable model에 의한 바이러스 유형 분석)

  • Kim Soo-Jin;Joung Je-Gun;Tae Kang Soo;Zhang Byoung-Tak
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.11b
    • /
    • pp.262-264
    • /
    • 2005
  • 인유두종 바이러스(Human Papillomavirus: HPV)는 사마귀로부터 생식기 및 배설기의 침윤성 암에 이르기까지 여러 질병과 연관되어 있음이 알려져 있다. 현재 200종 이상이 알려져 있고, 이 중 85개는 전체 유전자가 밝혀져 있다. HPV 감염 시 만들어지는 단백질 중 E6. E7 단백질은 암 억제 유전자(p53, pRb)에 결합하여 세포의 암 억제 기능을 저하시키고 이로 인해 암을 발생시킨다. 본 논문은 암 발생과 밀접한 관련이 있는 HPV의 E6 단백질 서열과 HPV 유형(HPV Type)을 가지고, PLSA (Probabilistic Latent Semantic Analysis) 방법을 이용하여 HPV를 클러스터링(clustering) 해 보았다. 실험 결과, 특정 클러스터는 질병과 밀접하게 연관되어 있으며, 이와 관련된 주요 서열 분석이 가능함을 보여주고 있다.

  • PDF

Accelerated Loarning of Latent Topic Models by Incremental EM Algorithm (점진적 EM 알고리즘에 의한 잠재토픽모델의 학습 속도 향상)

  • Chang, Jeong-Ho;Lee, Jong-Woo;Eom, Jae-Hong
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.12
    • /
    • pp.1045-1055
    • /
    • 2007
  • Latent topic models are statistical models which automatically captures salient patterns or correlation among features underlying a data collection in a probabilistic way. They are gaining an increased popularity as an effective tool in the application of automatic semantic feature extraction from text corpus, multimedia data analysis including image data, and bioinformatics. Among the important issues for the effectiveness in the application of latent topic models to the massive data set is the efficient learning of the model. The paper proposes an accelerated learning technique for PLSA model, one of the popular latent topic models, by an incremental EM algorithm instead of conventional EM algorithm. The incremental EM algorithm can be characterized by the employment of a series of partial E-steps that are performed on the corresponding subsets of the entire data collection, unlike in the conventional EM algorithm where one batch E-step is done for the whole data set. By the replacement of a single batch E-M step with a series of partial E-steps and M-steps, the inference result for the previous data subset can be directly reflected to the next inference process, which can enhance the learning speed for the entire data set. The algorithm is advantageous also in that it is guaranteed to converge to a local maximum solution and can be easily implemented just with slight modification of the existing algorithm based on the conventional EM. We present the basic application of the incremental EM algorithm to the learning of PLSA and empirically evaluate the acceleration performance with several possible data partitioning methods for the practical application. The experimental results on a real-world news data set show that the proposed approach can accomplish a meaningful enhancement of the convergence rate in the learning of latent topic model. Additionally, we present an interesting result which supports a possible synergistic effect of the combination of incremental EM algorithm with parallel computing.

Information Retrieval based on Probabilistic Latent Semantic Analysis within P2P Environments (P2P 환경에서 확률적 잠재 의미 분석에 기반한 정보 검색)

  • Gu, Tae-Wan;Kim, Yu-Seop;Lee, Kwang-Mo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2004.05a
    • /
    • pp.515-518
    • /
    • 2004
  • 전통적인 Peer-to-Peer 모델에서 정보검색 문제를 해결하기 위한 방법으로는 질의 및 키워드를 각 Peer에 전송하여 해당 질의 및 키워드와 문서들을 비교하는 방법이 대부분이었다. 본 논문에서는 이러한 방법을 확장하여 문서에 대한 의미론적 분석을 통해 검색의 정확성을 향상시키고자 한다. 이를 위해 본 논문에서는 확률적 의미분석 기법을 이용하여 각 Peer에 존재하는 정보에 대한 색인을 작성 한 후, 이것을 Peer-to-Peer 환경에 적용하기 위한 분산 색인 분배 알고리즘을 제안한다.

  • PDF

Salient Object Detection Based on Regional Contrast and Relative Spatial Compactness

  • Xu, Dan;Tang, Zhenmin;Xu, Wei
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.7 no.11
    • /
    • pp.2737-2753
    • /
    • 2013
  • In this study, we propose a novel salient object detection strategy based on regional contrast and relative spatial compactness. Our algorithm consists of four basic steps. First, we learn color names offline using the probabilistic latent semantic analysis (PLSA) model to find the mapping between basic color names and pixel values. The color names can be used for image segmentation and region description. Second, image pixels are assigned to special color names according to their values, forming different color clusters. The saliency measure for every cluster is evaluated by its spatial compactness relative to other clusters rather than by the intra variance of the cluster alone. Third, every cluster is divided into local regions that are described with color name descriptors. The regional contrast is evaluated by computing the color distance between different regions in the entire image. Last, the final saliency map is constructed by incorporating the color cluster's spatial compactness measure and the corresponding regional contrast. Experiments show that our algorithm outperforms several existing salient object detection methods with higher precision and better recall rates when evaluated using public datasets.