Search | Korea Science

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
- Journal of Intelligence and Information Systems
- /
- v.25 no.4
- /
- pp.105-122
- /
- 2019
Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.
https://doi.org/10.13088/jiis.2019.25.4.105 인용 PDF KSCI

Analyzing the Sentence Structure for Automatic Identification of Metadata Elements based on the Logical Semantic Structure of Research Articles (연구 논문의 의미 구조 기반 메타데이터 항목의 자동 식별 처리를 위한 문장 구조 분석)

Song, Min-Sun
- Journal of the Korean Society for information Management
- /
- v.35 no.3
- /
- pp.101-121
- /
- 2018
This study proposes the analysis method in sentence semantics that can be automatically identified and processed as appropriate items in the system according to the composition of the sentences contained in the data corresponding to the logical semantic structure metadata of the research papers. In order to achieve the purpose, the structure of sentences corresponding to 'Research Objectives' and 'Research Outcomes' among the semantic structure metadata was analyzed based on the number of words, the link word types, the role of many-appeared words in sentences, and the end types of a word. As a result of this study, the number of words in the sentences was 38 in 'Research Objectives' and 212 in 'Research Outcomes'. The link word types in 'Research Objectives' were occurred in the order such as Causality, Sequence, Equivalence, In-other-word/Summary relation, and the link word types in 'Research Outcomes' were appeared in the order such as Causality, Equivalence, Sequence, In-other-word/Summary relation. Analysis target words like '역할(Role)', '요인(Factor)' and '관계(Relation)' played a similar role in both purpose and result part, but the role of '연구(Study)' was little different. Finally, the verb endings in sentences were appeared many times such as '~고자', '~였다' in 'Research Objectives', and '~었다', '~있다', '~였다' in 'Research Outcomes'. This study is significant as a fundamental research that can be utilized to automatically identify and input the metadata element reflecting the common logical semantics of research papers in order to support researchers' scholarly sensemaking.
https://doi.org/10.3743/KOSIM.2018.35.3.101 인용 PDF KSCI

Prediction of Physical Examination Demand Using Text Mining (텍스트 마이닝을 이용한 건강검진 수요 예측)

Park, Kyungbo;Kim, Mi Ryang
- Journal of Information Technology Services
- /
- v.21 no.5
- /
- pp.95-106
- /
- 2022
Recently, physical examinations have become an important strategy to reduce costs for individuals and society. Pre-physical counseling is important for an effective physical examination. However, incomplete counseling is being conducted because the demand for physical examinations is not predicted. Therefore, in this study, the demand for physical examination was predicted using text mining and stepwise regression. As a result of the analysis, the most recent text data showed a high explanatory power of the demand for physical examination. Also, large amounts of data have high explanatory power. In addition, it was found that the high frequency of the text "health food" reduces the number of health examination customers. And the higher the frequency of the text of the word "food", the lower the number of physical examination customers. However, when the word "wild ginseng" was exposed a lot on Twitter, the number of physical examination customers visiting hospitals increased. In other words, customers consume efficiently by comparing the health examination price with the price of consumer goods. The proposed research framework can help predict demand in other industries.
https://doi.org/10.9716/KITS.2022.21.5.095 인용 PDF KSCI

A Convergence Study of the Research Trends on Stress Urinary Incontinence using Word Embedding (워드임베딩을 활용한 복압성 요실금 관련 연구 동향에 관한 융합 연구)

Kim, Jun-Hee;Ahn, Sun-Hee;Gwak, Gyeong-Tae;Weon, Young-Soo;Yoo, Hwa-Ik
- Journal of the Korea Convergence Society
- /
- v.12 no.8
- /
- pp.1-11
- /
- 2021
The purpose of this study was to analyze the trends and characteristics of 'stress urinary incontinence' research through word frequency analysis, and their relationships were modeled using word embedding. Abstract data of 9,868 papers containing abstracts in PubMed's MEDLINE were extracted using a Python program. Then, through frequency analysis, 10 keywords were selected according to the high frequency. The similarity of words related to keywords was analyzed by Word2Vec machine learning algorithm. The locations and distances of words were visualized using the t-SNE technique, and the groups were classified and analyzed. The number of studies related to stress urinary incontinence has increased rapidly since the 1980s. The keywords used most frequently in the abstract of the paper were 'woman', 'urethra', and 'surgery'. Through Word2Vec modeling, words such as 'female', 'urge', and 'symptom' were among the words that showed the highest relevance to the keywords in the study on stress urinary incontinence. In addition, through the t-SNE technique, keywords and related words could be classified into three groups focusing on symptoms, anatomical characteristics, and surgical interventions of stress urinary incontinence. This study is the first to examine trends in stress urinary incontinence-related studies using the keyword frequency analysis and word embedding of the abstract. The results of this study can be used as a basis for future researchers to select the subject and direction of the research field related to stress urinary incontinence.
https://doi.org/10.15207/JKCS.2021.12.8.001 인용 PDF KSCI

Pronunciation Variation Modeling for Korean Point-of-Interest Data Using Prosodic Information (운율 정보를 이용한 한국어 위치 정보 데이타의 발음 모델링)

Kim, Sun-He;Park, Jeon-Gue;Na, Min-Soo;Jeon, Je-Hun;Chung, Min-Wha
- Journal of KIISE:Software and Applications
- /
- v.34 no.2
- /
- pp.104-111
- /
- 2007
This paper examines how the performance of an automatic speech recognizer was improved for Korean Point-of-Interest (POI) data by modeling pronunciation variation using structural prosodic information such as prosodic words and syllable length. First, multiple pronunciation variants are generated using prosodic words given that each POI word can be broken down into prosodic words. And the cross-prosodic-word variations were modeled considering the syllable length of word. A total of 81 experiments were conducted using 9 test sets (3 baseline and 6 proposed) on 9 trained sets (3 baseline, 6 proposed). The results show: (i) the performance was improved when the pronunciation lexica were generated using prosodic words; (ii) the best performance was achieved when the maximum number of variants was constrained to 3 based on the syllable length; and (iii) compared to the baseline word error rate (WER) of 4.63%, a maximum of 8.4% in WER reduction was achieved when both prosodic words and syllable length were considered.
PDF KSCI

A Study of the Kinds and Frequency Characteristics of Descriptors in the Articles Related to Scientific Literacy (과학적 소양 관련 논문에서 서술자의 종류와 빈도 특성 연구)

Lee, Myeong-Je
- Journal of Korean Elementary Science Education
- /
- v.29 no.4
- /
- pp.401-413
- /
- 2010
This study analyzed the kinds and frequencies of descriptors in 154 articles in ERIC data base on the 4th day of January in 2010. The titles of the articles includes the words, 'scientific literacy'. As each descriptor is constituted of two words and over, in this study the first word in the descriptor was defined as 'restrictive word' and the rest word(s) as 'target word(s)'. The results are as follows. First, the descriptors which show high frequencies of target words are the traditionally important themes of scientific literacy education. Target words which show relatively high frequency are 'education', 'literacy', 'instruction' and 'countries'. Low frequency word is 'curriculum', which has various restrictive words and represents wide differentiation. Second, among the descriptors which show low frequencies of target words, relatively high frequency descriptors are '(and)society', 'change', 'secondary education', 'concepts', and 'biology', which have been given more attention in scientific literacy research than the rest descriptors. Third, the number of the descriptors that shows largely distributed pattern A, which happens over 15 years continuously, is over the half of all analyzed descriptors, which shows that they have been the major objectives in researches about scientific literacy. Most descriptors of pattern A shows normal distribution of frequency or the trends of increasing frequency as the time is nearer. Fourth, The descriptors are divided into four groups according to the time span. Each research trends are as follows. In later 80s, the research which emphasizes the importance of the sociality and technology in all level school science curriculum. In later 90s the research for educational change of inquiry-centered science curriculum which considers technological literacy in social contexts. In earlier 2000s the research that scientists and science teachers develop science curricula mostly related to scientific principles and thinking in chemistry and biology especially. In later 2000s case studies which relates teaching methods and science process activities to students' attitudes, scientific concepts and curricula.
PDF

A Study to Compare between Groups Glassified by Demographic Characteristic into Effects of Word of Mouth and Methods of Sales Promotion in Intention of Watching Movies (개봉 전 후 영화의 구전효과와 판촉방식에 따른 인구통계학적 집단 간의 차이에 관한 연구)

Kim, Yang Sug;Lee, Bo Young
- Asia-Pacific Journal of Business Venturing and Entrepreneurship
- /
- v.10 no.6
- /
- pp.59-68
- /
- 2015
It's important to analyse effects of word of mouth for making its impact higher in performance of motions pictures. And it's required to combine variety sales activities like free gift, promotion goods and price discount with word of mouth for the box office of film. The purpose of this study is to compare between groups classified by demographic characteristic into effects from word of mouth and methods of sales promotion in intention of watching film. On the other hand existing studies on sales activity and word of mouth were one-sided in theoretical background, a meaning of this study is theorizing a social phenomenon about sales promotion of movie giving actual examples that currently are effected by production company, Movie theaters, distribution company and affiliated company. For this purpose, it conducted a survey targeting 500 students in B university in Seoul city and 379 answers got received, and it proceeded this study with 369 answers except 10 inaccurate ones. Creating questionnaires with Likert 5 point scale, it decided that case of substantial inclination was 5 points and inverse one is 1 point. Doing analysis T and ANOVA according to male and female, kinds of major study and number of average monthly watching movie, it analysed the test results after comparison analysis between classified group. The results are summarized as follows: First, offering premiums is more effective by masculine than feminine, but situation of free gift is an opposite result. Second, there are no differences of effects word of mouth and methods of sales promotion by majority departments. Third, there are much differences between groups classified by average number of watching film in a month into effects from word of mouth and methods of sales promotion. Group of watching film more 3 times in a month is more effective than the other groups in intension of watching film by word of mouth. Fourth, word of mouth is great factor to increase intention of watching film and second one is discount on the price.
PDF

A Study of Psychometric Function Curve for Korean Standard Monosyllabic Word Lists for Preschoolers (KS-MWL-P) (한국표준 학령전기용 단음절어표 (Korean Standard Monosyllabic Word Lists for Preschoolers, KS-MWL-P)의 심리음향기능곡선 연구)

Shin, Hyun-Wook;Kim, Jin-Sook
- The Journal of the Acoustical Society of Korea
- /
- v.28 no.6
- /
- pp.534-541
- /
- 2009
Word recognition test (WRT) for the children can be useful for diagnosing the degree of communication disability, prescribing hearing instruments, planning aural rehabilitation and speech therapy, and determination of site of lesions. The Korean standard monosyllabic word lists for preschoolers (KS-MWL-P) were developed considering the criteria given by the literatures. However, the authors of KS-MWL-P suggested more children should be included to verify homogeneity of the lists using psychometric function curve since only 8 children participated in the developing process. The purpose of this study was to explore the homogeneity of KS-MWL-P for supplementing the limitations of the lists employing psychometric analysis. To 23 preschoolers who have normal-hearing, 100 monosyllabic KS-MWL-P words were examined with the pictures. Psychometric function curve with linear slopes of 20% and 80%'s correct rates through accounting recognition scores of each monosyllabic word at variable intensities from -10 to 40 dBHL was obtained and analyzed. As a result, s-shaped psychometric function curve was presented with increasing correct rate depending on intensity and showed no statistical significant differences among each word and list. The congruous graph shapes among lists also indicated good homogeneity and the list 1,2,3,4's average slopes were 4.48, 3.86, 4.65, 4.50. It was verified that the homogeneity was suitable because the analysis of variance showed no statistical significance among lists (p>0.05). However, KS-MWL-P's order of slope according to the order of the number of items, $1{\sim}10$, $1{\sim}20$, $1{\sim}25$ showed no difference with the p-value of 0.93, 0.59, 0.91, 0.70 for the lists 1,2,3, and 4, respectively. Although KS-MWL-P was assumed that the lower-numbered items were easy for testing younger ages, this study's results could not agree with the author's conclusion. Considering this matter, rearranging of the number of items should be performed according to the analysis of slope suggested by this study for testing younger children with easier items. Other than this, in conclusion, KS-MWL-P was proved to be useful for clinical and rehabilitative evaluating and training tools for preschoolers.
https://doi.org/10.7776/ASK.2009.28.6.534 인용 PDF KSCI

A Feature -Based Word Spotting for Content-Based Retrieval of Machine-Printed English Document Images (내용기반의 인쇄체 영문 문서 영상 검색을 위한 특징 기반 단어 검색)

Jeong, Gyu-Sik;Gwon, Hui-Ung
- Journal of KIISE:Software and Applications
- /
- v.26 no.10
- /
- pp.1204-1218
- /
- 1999
문서영상 검색을 위한 디지털도서관의 대부분은 논문제목과/또는 논문요약으로부터 만들어진 색인에 근거한 제한적인 검색기능을 제공하고 있다. 본 논문에서는 영문 문서영상전체에 대한 검색을 위한 단어 영상 형태 특징기반의 단어검색시스템을 제안한다. 본 논문에서는 검색의 효율성과 정확도를 높이기 위해 1) 기존의 단어검색시스템에서 사용된 특징들을 조합하여 사용하며, 2) 특징의 개수 및 위치뿐만 아니라 특징들의 순서를 포함하여 매칭하는 방법을 사용하며, 3) 특징비교에 의해 검색결과를 얻은 후에 여과목적으로 문자인식을 부분적으로 적용하는 2단계의 검색방법을 사용한다. 제안된 시스템의 동작은 다음과 같다. 문서 영상이 주어지면, 문서 영상 구조가 분석되고 단어 영역들의 조합으로 분할된다. 단어 영상의 특징들이 추출되어 저장된다. 사용자의 텍스트 질의가 주어지면 이에 대응되는 단어 영상이 만들어지며 이로부터 영상특징이 추출된다. 이 참조 특징과 저장된 특징들과 비교하여 유사한 단어를 검색하게 된다. 제안된 시스템은 IBM-PC를 이용한 웹 환경에서 구축되었으며, 영문 문서영상을 이용하여 실험이 수행되었다. 실험결과는 본 논문에서 제안하는 방법들의 유효성을 보여주고 있다. Abstract Most existing digital libraries for document image retrieval provide a limited retrieval service due to their indexing from document titles and/or the content of document abstracts. This paper proposes a word spotting system for full English document image retrieval based on word image shape features. In order to improve not only the efficiency but also the precision of a retrieval system, we develop the system by 1) using a combination of the holistic features which have been used in the existing word spotting systems, 2) performing image matching by comparing the order of features in a word in addition to the number of features and their positions, and 3) adopting 2 stage retrieval strategies by obtaining retrieval results by image feature matching and applying OCR(Optical Charater Recognition) partly to the results for filtering purpose. The proposed system operates as follows: given a document image, its structure is analyzed and is segmented into a set of word regions. Then, word shape features are extracted and stored. Given a user's query with text, features are extracted after its corresponding word image is generated. This reference model is compared with the stored features to find out similar words. The proposed system is implemented with IBM-PC in a web environment and its experiments are performed with English document images. Experimental results show the effectiveness of the proposed methods.

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words) (속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구)

Eo, Kyun Sun;Lee, Kun Chang
- Journal of Digital Convergence
- /
- v.17 no.2
- /
- pp.163-170
- /
- 2019
Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.
https://doi.org/10.14400/JDC.2019.17.2.163 인용 PDF KSCI HTML

Search Result 697, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)