Search | Korea Science

An Analysis Scheme Design of Customer Spending Pattern using Text Mining (텍스트 마이닝을 이용한 소비자 소비패턴 분석 기법 설계)

Jeong, Eun-Hee;Lee, Byung-Kwan
- The Journal of Korea Institute of Information, Electronics, and Communication Technology
- /
- v.11 no.2
- /
- pp.181-188
- /
- 2018
In this paper, we propose an analysis scheme of customer spending pattern using text mining. In proposed consumption pattern analysis scheme, first we analyze user's rating similarity using Pearson correlation, second we analyze user's review similarity using TF-IDF cosine similarity, third we analyze the consistency of the rating and review using Sendiwordnet. And we select the nearest neighbors using rating similarity and review similarity, and provide the recommended list that is proper with consumption pattern. The precision of recommended list are 0.79 for the Pearson correlation, 0.73 for the TF-IDF, and 0.82 for the proposed consumption pattern. That is, the proposed consumption pattern analysis scheme can more accurately analyze consumption pattern because it uses both quantitative rating and qualitative reviews of consumers.
https://doi.org/10.17661/jkiiect.2018.11.2.181 인용 PDF KSCI

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.15 no.11
- /
- pp.3991-4010
- /
- 2021
The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.
https://doi.org/10.3837/tiis.2021.11.007 인용 PDF KSCI HTML

Local Similarity based Document Layout Analysis using Improved ARLSA

Kim, Gwangbok;Kim, SooHyung;Na, InSeop
- International Journal of Contents
- /
- v.11 no.2
- /
- pp.15-19
- /
- 2015
In this paper, we propose an efficient document layout analysis algorithm that includes table detection. Typical methods of document layout analysis use the height and gap between words or columns. To correspond to the various styles and sizes of documents, we propose an algorithm that uses the mean value of the distance transform representing thickness and compare with components in the local area. With this algorithm, we combine a table detection algorithm using the same feature as that of the text classifier. Table candidates, separators, and big components are isolated from the image using Connected Component Analysis (CCA) and distance transform. The key idea of text classification is that the characteristics of the text parallel components that have a similar thickness and height. In order to estimate local similarity, we detect a text region using an adaptive searching window size. An improved adaptive run-length smoothing algorithm (ARLSA) was proposed to create the proper boundary of a text zone and non-text zone. Results from experiments on the ICDAR2009 page segmentation competition test set and our dataset demonstrate the superiority of our dataset through f-measure comparison with other algorithms.
https://doi.org/10.5392/IJoC.2015.11.2.015 인용 PDF KSCI KPUBS HTML

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization (위키피디아를 이용한 분류자질 선정에 관한 연구)

Kim, Yong-Hwan;Chung, Young-Mee
- Journal of the Korean Society for information Management
- /
- v.29 no.2
- /
- pp.155-171
- /
- 2012
In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.
https://doi.org/10.3743/KOSIM.2012.29.2.155 인용 PDF KSCI

Web Image Caption Extraction using Positional Relation and Lexical Similarity (위치적 연관성과 어휘적 유사성을 이용한 웹 이미지 캡션 추출)

Lee, Hyoung-Gyu;Kim, Min-Jeong;Hong, Gum-Won;Rim, Hae-Chang
- Journal of KIISE:Software and Applications
- /
- v.36 no.4
- /
- pp.335-345
- /
- 2009
In this paper, we propose a new web image caption extraction method considering the positional relation between a caption and an image and the lexical similarity between a caption and the main text containing the caption. The positional relation between a caption and an image represents how the caption is located with respect to the distance and the direction of the corresponding image. The lexical similarity between a caption and the main text indicates how likely the main text generates the caption of the image. Compared with previous image caption extraction approaches which only utilize the independent features of image and captions, the proposed approach can improve caption extraction recall rate, precision rate and 28% F-measure by including additional features of positional relation and lexical similarity.
PDF KSCI

Misinformation Detection and Rectification Based on QA System and Text Similarity with COVID-19

Insup Lim;Namjae Cho
- Journal of Information Technology Applications and Management
- /
- v.28 no.5
- /
- pp.41-50
- /
- 2021
As COVID-19 spread widely, and rapidly, the number of misinformation is also increasing, which WHO has referred to this phenomenon as "Infodemic". The purpose of this research is to develop detection and rectification of COVID-19 misinformation based on Open-domain QA system and text similarity. 9 testing conditions were used in this model. For open-domain QA system, 6 conditions were applied using three different types of dataset types, scientific, social media, and news, both datasets, and two different methods of choosing the answer, choosing the top answer generated from the QA system and voting from the top three answers generated from QA system. The other 3 conditions were the Closed-Domain QA system with different dataset types. The best results from the testing model were 76% using all datasets with voting from the top 3 answers outperforming by 16% from the closed-domain model.
https://doi.org/10.21219/jitam.2021.28.5.041 인용 PDF KSCI

Sentence Similarity Measurement Method Using a Set-based POI Data Search (집합 기반 POI 검색을 이용한 문장 유사도 측정 기법)

Ko, EunByul;Lee, JongWoo
- KIISE Transactions on Computing Practices
- /
- v.20 no.12
- /
- pp.711-716
- /
- 2014
With the gradual increase of interest in plagiarism and intelligent file content search, the demand for similarity measuring between two sentences is increasing. There is a lot of researches for sentence similarity measurement methods in various directions such as n-gram, edit-distance and LSA. However, these methods have their own advantages and disadvantages. In this paper, we propose a new sentence similarity measurement method approaching from another direction. The proposed method uses the set-based POI data search that improves search performance compared to the existing hard matching method when data includes the inverse, omission, insertion and revision of characters. Using this method, we are able to measure the similarity between two sentences more accurately and more quickly. We modified the data loading and text search algorithm of the set-based POI data search. We also added a word operation algorithm and a similarity measure between two sentences expressed as a percentage. From the experimental results, we observe that our sentence similarity measurement method shows better performance than n-gram and the set-based POI data search.
https://doi.org/10.5626/KTCP.2014.20.12.711 인용

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic

Milyahilu, John;Kim, Jong Nam
- Journal of Korea Multimedia Society
- /
- v.23 no.4
- /
- pp.595-602
- /
- 2020
We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.
https://doi.org/10.9717/kmms.2020.23.4.595 인용 PDF KSCI HTML

Semantic Word Categorization using Feature Similarity based K Nearest Neighbor

Jo, Taeho
- Journal of Multimedia Information System
- /
- v.5 no.2
- /
- pp.67-78
- /
- 2018
This article proposes the modified KNN (K Nearest Neighbor) algorithm which considers the feature similarity and is applied to the word categorization. The texts which are given as features for encoding words into numerical vectors are semantic related entities, rather than independent ones, and the synergy effect between the word categorization and the text categorization is expected by combining both of them with each other. In this research, we define the similarity metric between two vectors, including the feature similarity, modify the KNN algorithm by replacing the exiting similarity metric by the proposed one, and apply it to the word categorization. The proposed KNN is empirically validated as the better approach in categorizing words in news articles and opinions. The significance of this research is to improve the classification performance by utilizing the feature similarities.
https://doi.org/10.9717/JMIS.2018.5.2.67 인용 PDF KSCI

Deep learning-based custom problem recommendation algorithm to improve learning rate (학습률 향상을 위한 딥러닝 기반 맞춤형 문제 추천 알고리즘)

Lim, Min-Ah;Hwang, Seung-Yeon;Kim, Jeong-Jun
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.22 no.5
- /
- pp.171-176
- /
- 2022
With the recent development of deep learning technology, the areas of recommendation systems have also diversified. This paper studied algorithms to improve the learning rate and studied the significance results according to words through comparison with the performance characteristics of the Word2Vec model. The problem recommendation algorithm was implemented with the values expressed through the reflection of meaning and similarity test between texts, which are characteristics of the Word2Vec model. Through Word2Vec's learning results, problem recommendations were conducted using text similarity values, and problems with high similarity can be recommended. In the experimental process, it was seen that the accuracy decreased with the quantitative amount of data, and it was confirmed that the larger the amount of data in the data set, the higher the accuracy.
https://doi.org/10.7236/JIIBC.2022.22.5.171 인용 PDF KSCI HTML

Search Result 280, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)