• Title/Summary/Keyword: Sentence Similarity

Search Result 81, Processing Time 0.031 seconds

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

An Analysis Method of User Preference by using Web Usage Data in User Device (사용자 기기에서 이용한 웹 데이터 분석을 통한 사용자 취향 분석 방법)

  • Lee, Seung-Hwa;Choi, Hyoung-Kee;Lee, Eun-Seok
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.3
    • /
    • pp.189-199
    • /
    • 2009
  • The amount of information on the Web is explosively growing as the Internet gains in popularity. However, only a small portion of the information on the Web is truly relevant or useful to the user. Thus, offering suitable information according to user demand is an important subject in information retrieval. In e-commerce, the recommender system is essential to revitalize commercial transactions, raise user satisfaction and loyalty towards the information provider. The existing recommender systems are mostly based on user data collected at servers, so user data are dispersed over several servers. Therefore, web servers that lack sufficient user behavior data cannot easily infer user preferences. Also, if the user visits the server infrequently, it may be hard to reflect the dynamically changing user's interest. This paper proposes a novel personalization system analyzing the user preference based on web documents that are accessed by the user on a user device. The system also identifies non-content blocks appearing repeatedly in the dynamically generated web documents, and adds weight to the keywords extracted from the hyperlink sentence selected by the user. Therefore, the system establishes at an early stage recommendation strategies for the web server that has little user data. Also, user profiles are generated rapidly and more accurately by identifying the information blocks. In order to evaluate the proposed system, this study collected web data and purchase history from users who have current purchase activity. Then, we computed the similarity between purchase data and the user profile. We confirm the accuracy of the generated user profile since the web page containing the purchased item has higher correlation than other item pages.

A Method for Spelling Error Correction in Korean Using a Hangul Edit Distance Algorithm (한글 편집거리 알고리즘을 이용한 한국어 철자오류 교정방법)

  • Bak, Seung Hyeon;Lee, Eun Ji;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.6 no.1
    • /
    • pp.16-21
    • /
    • 2017
  • Long time has passed since computers which used to be a means of research were commercialized and available for the general public. People used writing instruments to write before computer was commercialized. However, today a growing number of them are using computers to write instead. Computerized word processing helps write faster and reduces fatigue of hands than writing instruments, making it better fit to making long texts. However, word processing programs are more likely to cause spelling errors by the mistake of users. Spelling errors distort the shape of words, making it easy for the writer to find and correct directly, but those caused due to users' lack of knowledge or those hard to find may make it almost impossible to produce a document free of spelling errors. However, spelling errors in important documents such as theses or business proposals may lead to falling reliability. Consequently, it is necessary to conduct research on high-level spelling error correction programs for the general public. This study was designed to produce a system to correct sentence-level spelling errors to normal words with Korean alphabet similarity algorithm. On the basis of findings reported in related literatures that corrected words are significantly similar to misspelled words in form, spelling errors were extracted from a corpus. Extracted corrected words were replaced with misspelled ones to correct spelling errors with spelling error detection algorithm.

Validity and Reliability of a Korean Version of Nurse Clinical Reasoning Competence Scale (한국어판 간호사 임상적 추론 역량 척도의 타당도와 신뢰도)

  • Joung, Jaewon;Han, Jeong Won
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.4
    • /
    • pp.304-310
    • /
    • 2017
  • This study is a methodological research study that tests the validity and reliability of the NCRC (Nurse Clinical Reasoning Competence scale), an instrument developed by Liou and his colleagues as the basic data for enhancing the clinical reasoning competence of nurses, by translating it into Korean and checking the similarity of the sentence structure and meaning (between the two versions?). This study verified its validity and reliability by examining 166 nurses working in four tertiary hospitals located in Seoul and Busan. An analysis of the content validity by experts showed that all of the items have a content validity higher than CVI 0.8. From the exploratory and confirmatory factor analysis, it was found that the instrument includes a total of 15 items consisting of one factor. In addition, the correlation with the Korean version of the Nurse Clinical Reasoning Competence scale is confirmed to test the concurrent validity, by using a measurement tool of nurses' critical thinking dispositions and clinical decision-making abilities (correlation coefficient =.55-.64(p<.001) and Cronbach's ${\alpha}=.93$). Thus, the Korean version of the NCRC may be a useful instrument for evaluating the clinical reasoning competence of Korean nurses and providing the basic data for assessing their clinical reasoning competence and developing their promotion strategies.

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization (부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석)

  • Cho, Dan Bi;Lee, Hyun Young;Jung, Won Sup;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

Analysis of the Continuity of Reading Passages in the 5th and 6th Grade Elementary School English Textbooks Based on Readability (이독성을 통한 초등학교 5, 6학년 영어 교과서 읽기 지문의 연계성 분석)

  • Jang, Hankyeol;Lee, Je-Young
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.6
    • /
    • pp.116-124
    • /
    • 2022
  • The purpose of this study is to examine the vertical and horizontal continuity between grades and publishers, respectively, by analyzing the readability of reading passages included in English textbooks for 5th and 6th grades of elementary school. In order to do so, a corpus was constructed with the reading passages contained in 10 textbooks, and the reading passages in each textbook were analyzed through Coh-Metrix. Also, it was examined whether there was a statistically significant difference between grades and publishers in readability through one-way ANOVA. The results are as follows. First, as a result of analyzing the difference in readability between publishers within the same grade, there was a statistically significant difference between fifth-grade textbooks in the L2 readability index. Second, as a result of analyzing the vertical continuity between grades within the publisher, the difficulty of textbook A was higher in grade 6 than grade 5 based on FRE and FKGL, which showed a statistically significant difference. On the other hand, when L2 readability was used as the standard, the difficulty of textbook B was lower in 6th grade than in 5th grade. This result seems to be because FRE and FKGL calculate readability based on sentence and word length, whereas L2 readability is based on content word overlap, word frequency, and syntactic similarity of sentences.

Artificial Intelligence for Assistance of Facial Expression Practice Using Emotion Classification (감정 분류를 이용한 표정 연습 보조 인공지능)

  • Dong-Kyu, Kim;So Hwa, Lee;Jae Hwan, Bong
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.17 no.6
    • /
    • pp.1137-1144
    • /
    • 2022
  • In this study, an artificial intelligence(AI) was developed to help with facial expression practice in order to express emotions. The developed AI used multimodal inputs consisting of sentences and facial images for deep neural networks (DNNs). The DNNs calculated similarities between the emotions predicted by the sentences and the emotions predicted by facial images. The user practiced facial expressions based on the situation given by sentences, and the AI provided the user with numerical feedback based on the similarity between the emotion predicted by sentence and the emotion predicted by facial expression. ResNet34 structure was trained on FER2013 public data to predict emotions from facial images. To predict emotions in sentences, KoBERT model was trained in transfer learning manner using the conversational speech dataset for emotion classification opened to the public by AIHub. The DNN that predicts emotions from the facial images demonstrated 65% accuracy, which is comparable to human emotional classification ability. The DNN that predicts emotions from the sentences achieved 90% accuracy. The performance of the developed AI was evaluated through experiments with changing facial expressions in which an ordinary person was participated.

Development of Block-based Code Generation and Recommendation Model Using Natural Language Processing Model (자연어 처리 모델을 활용한 블록 코드 생성 및 추천 모델 개발)

  • Jeon, In-seong;Song, Ki-Sang
    • Journal of The Korean Association of Information Education
    • /
    • v.26 no.3
    • /
    • pp.197-207
    • /
    • 2022
  • In this paper, we develop a machine learning based block code generation and recommendation model for the purpose of reducing cognitive load of learners during coding education that learns the learner's block that has been made in the block programming environment using natural processing model and fine-tuning and then generates and recommends the selectable blocks for the next step. To develop the model, the training dataset was produced by pre-processing 50 block codes that were on the popular block programming language web site 'Entry'. Also, after dividing the pre-processed blocks into training dataset, verification dataset and test dataset, we developed a model that generates block codes based on LSTM, Seq2Seq, and GPT-2 model. In the results of the performance evaluation of the developed model, GPT-2 showed a higher performance than the LSTM and Seq2Seq model in the BLEU and ROUGE scores which measure sentence similarity. The data results generated through the GPT-2 model, show that the performance was relatively similar in the BLEU and ROUGE scores except for the case where the number of blocks was 1 or 17.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

『황제내경소문(黃帝內經素問)·칠편대론(七篇大論)』 왕빙 주본(注本)을 통(通)한 운기학설(運氣學說) 관(關)한 연구(硏究)

  • Kim, Gi-Uk;Park, Hyeon-Guk
    • The Journal of Dong Guk Oriental Medicine
    • /
    • v.4
    • /
    • pp.109-140
    • /
    • 1995
  • As we considered in the main subjects, investigations on the theory of 'Doctrine on five elements' motion and six kinds of natural factors(運氣學說)' through 'Wang Bing's Commentary(王氷 注本)' of 'The seven great chapters in The Yellow Emperor's Internal Classic Su Wen' ("黃帝內經素問 七篇大論") are as follows. (1) In The seven great chapters("七篇大論")' Wang Bing supplement theory and in the academic aspects as a interpreter, judging from 'forget(亡)' character. expressed in the 'The missing chapters("素問遺篇")', 'Bonbyung-ron("本病論")' and 'Jabeob-ron(刺法論)', 'The seven great chapters("七篇大論")' must be supplementary work by Wang Bing. Besides, he quoted such forty books as medical books, taoist books, confucianist books, miscellaneous books, etc in the commentary and the contents quoted in the 'Su Wen(素問)' and 'Ling Shu("靈樞")' scripture nearly occupy in the book. As a method of interpreting scripiure as scripture, he edited the order of 'Internal Classic("內經")' ascended from the ancient time and when he compensated for commentary, with exhaustive scholarly mind and by observing the natural phenomena practically and writing the pathology and the methods of treatment. We knew that the book is combined with the study of 'Doctrine on five elements motion and six kinds of natural factors(運氣學說)' (2) When we compare, analyze the similar phrase of 'The seven great chapters in The Yellow Emperor's Internal Classic Su Wen'("黃帝內經素問ㆍ七篇大論") through 'Wang Bing's Commentary(王氷 注本)', he tells abouts organized 'five elements(五行)' and 'heaven's regularly movement(天道運行)' rather than 'Emyangengsangdae-ron("陰陽應象大論")' in 'The seven great chapters("七篇大論")'. Also the 'Ohanunhangdae-ron("五運行大論")' because the repeated sentences with 'Emyangengsangdae-ron("陰陽應象大論")' is long they are omitted. And in the 'Youkmijidae-ron("六微旨大論")', 'Cheonjin ideology(天眞四象)' based on the 'Sanggocheonjin- ron("上古天眞論")', 'Sagijosindae-ron("四氣調神大論")' is written and in the 'Gigoupyondae-ron("氣交變大論")', the syndrome and symptom are explained in detail rather than 'Janggibeobsi-ron("藏氣法時論")', 'Okgijinjang-ron ("玉機眞藏論")' and in the 'Osangieongdae-ron("五常政大論")', the concept of 'five element(五行)' of the 'Gemgwejineon-ron("金櫃眞言論")' is expanded to 'the five elements' motion concept(五運槪念)' and in the 'Youkwonjeonggidae-ron("六元正紀大論")', explanations of 'The five elements' motion and six kinds of natural factors(運氣)' function are mentioned mainly and instead systematic pathology is not revealed rather than 'Emyangengsangdae-ron("陰陽應象大論")'. And in the 'Jijinyodae-ron("至眞要大論")', explanations of the change of atmosphere which correspond to treatment principle by 'The three Yin and Yang(三陰三陽)' as a progressed concepts are revealed. Therefore there are much similarity between the phrase of 'Emyangengsangdae-ron("陰陽應象大論")' and 'chapters of addition(補缺之篇)'. Generally, the doctrine which 'The seven great chapters("七篇大論")' are added by Wang Bing(王氷) is supported because there are more profound concepts rather than the other chapter in 'The seven great chapters("七篇大論")'. (3) When we study Wang Bing's(王氷) 'Pattern on five elements motion and six kinds of natural factors(運氣格局)' in 'The seven great chapter("七篇大論")', in the 'Cheonwongi-dae-ron("天元紀大論")', With 'Cheonjin ideology(天眞思想)' and the concepts of 'Owang(旺)'${\cdot}$'Sang(相)'${\cdot}$'Sa(死)'${\cdot}$'Su(囚)'${\cdot}$'Hu(休)' and 'Cheonbu(天符)'${\cdot}$'Sehwoi(歲會)' are measured time-spacially to the concept of 'Three Sum(三合)' the concept of 'Taeulcheonbu(太乙天符)' is explained. In the 'Ounhangdae-ron("五運行大論")', 'The calender Signs five Sum(天干五合)' is compared to the concepts of 'couples(夫婦)', 'weak-strong(柔强)' and in the 'Youkmijidae-ron("六微旨大論")', 'the relationship of obedience and disobedience(順逆關係)' which conform to the 'energy status(氣位)' change and 'monarch-minister(君相)' position is mentioned. In the 'Gikyobyeondae-ron("氣交變大論")', the concept of 'Sang-duk(相得)', 'Pyungsang(平常)' is emphasized but concrete measurement is mentioned. In the 'Osangieongdae-ron("五常政大論")', the detailed explanation with twenty three 'systemic of the five elements' motion(五運體系)' form and 'rountine-contrary treatment(正治. 反治)' with 'chill-fever-warm-cold(寒${\cdot}$${\cdot}$${\cdot}$凉)' are mentioned according to the 'analyse and differentiate pathological conditions in accordance with the eight principal syndromes(八綱辨證)'. In the 'Youkwonjeonggidae-ron("六元正紀大論")', Wang Bing of doesn't mention the concepts of 'Jungwun(中運)' that is seen in the original classic. In the new corrective edition, as the concepts of 'Jungwun, Dongcheonbu, Dongsehae and Taeulcheonbu(中運, 同天符, 同歲會, 太乙天符)' is appeared, Wang Bing seems to only use the concepts of 'Daewun, Juwun, and Gaekwun(大運, 主運, 客運)'. In the 'Jijinyodaeron("至眞要大論")', Wang Bing added detailed commentary to pathology and treatment doctrine by explaining the numerous appearances of 'Sebo, sufficiency, deficiency(歲步, 有餘, 不足)' and in the relation of 'victory-defeat(勝復)', he argued clearly that it is not mechanical estimation. (4) When we observe the Wang Bing's originality on the study of 'the theory of Doctrine on five elements' motion and six kinds of natural factors(運氣學說)', he emphasized 'The idea of Jeongindogi and Health preserving(全眞導氣${\cdot}$養生思想)' by adding 'Wang Bing's Commentary(王氷 注本)' of 'The seven great chapters("七篇大論")' and explained clearly 'The theory of Doctrine on five elements' motion and six kinds of natural factors(運氣學說)' and simpled and expanded the meaning of 'man, as a microcosm, is connected with the macrocosm(天人相應)' and with 'Atmosphere theory(大氣論)' also explained the meaning of 'rising and falling mechanism(升降氣機)'. In the sentence of 'By examining the pathology, take care of your health(審察病機 無失氣宜)'. he explained the meaning of pathology of 'heart-kidney-water-fire(心腎水火)' and suggested the doctrine and management of prescription. In the estimation and treatment, by suggesting 'asthenia and sthenia(虛實)' two method's estimation, 'contrary treatment(反治)' and treatment principals of 'falling heart fire tonifyng kidney water(降心火益腎水)', 'two class of chill and fever(寒熱二綱)' were demonstrated. There are 'inside and outside in the illness and so inner and outer in the treatment(病有中外 治有表囊)'. This sentence suggests concertedly. 'two class of superfies and interior(表囊二綱)' conforming to the position of disease. Therefore Wang Bing as an excellent theorist and introduced 'Cheoniin ideology(天眞思想)' as a clinician and realized the medical science. With these accomplishes mainly written in 'The theory of Doctrine on five elements' motion and six kinds of natural factors(運氣學說)' of 'The seven great chapters("七篇大論")', he interpreted the ancient medical scriptures and expanded the meaning of scriptures and conclusively contributed to the development of the study 'Korean Oriental Medicine(韓醫學)'.

  • PDF