• Title/Summary/Keyword: Stop Words

Search Result 107, Processing Time 0.02 seconds

Automatic Keyword Extraction System for Korean Documents Information Retrieval (국내(國內) 문헌정보(文獻情報) 검색(檢索)을 위한 키워드 자동추출(自動抽出) 시스템 개발(開發))

  • Yae, Yong-Hee
    • Journal of Information Management
    • /
    • v.23 no.1
    • /
    • pp.39-62
    • /
    • 1992
  • In this paper about 60 auxiliary words and 320 stopwords are selected from analysis of sample data, four types of stop word are classified left, right and - auxiliary word truncation & normal. And a keyword extraction system is suggested which undertakes efficient truncation of auxiliary word from words, conversion of Chinese word to Korean and exclusion of stopword. The selected keyeords in this system show 92.2% of accordance ratio compared with manually selected keywords by expert. And then compound words consist of $4{\sim}6$ character generate twice of additional new words and 58.8% words of those are useful as keyword.

  • PDF

A Comparative Study on Requirements Analysis Techniques using Natural Language Processing and Machine Learning

  • Cho, Byung-Sun;Lee, Seok-Won
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.7
    • /
    • pp.27-37
    • /
    • 2020
  • In this paper, we propose the methodology based on data-driven approach using Natural Language Processing and Machine Learning for classifying requirements into functional requirements and non-functional requirements. Through the analysis of the results of the requirements classification, we have learned that the trained models derived from requirements classification with data-preprocessing and classification algorithm based on the characteristics and information of existing requirements that used term weights based on TF and IDF outperformed the results that used stemming and stop words to classify the requirements into functional and non-functional requirements. This observation also shows that the term weight calculated without removal of the stemming and stop words influenced the results positively. Furthermore, we investigate an optimized method for the study of classifying software requirements into functional and non-functional requirements.

Compensation in VC and Word

  • Yun, Il-Sung
    • Phonetics and Speech Sciences
    • /
    • v.2 no.3
    • /
    • pp.81-89
    • /
    • 2010
  • Korean and three other languages (English, Arabic, and Japanese) were compared with regard to the compensatory movements in a VC (Vowel and Consonant) sequence and word. For this, Korean data were collected from an experiment and the other languages' data from literature. All the test words of the languages had the same syllabic contexture, i.e., /CVCV(r)/, where C was an oral stop and intervocalic consonants were either bilabial or alveolar stops. The present study found that (1) Korean is most striking in the durational variations of segments (vowel and the following hetero-syllabic consonant); (2) unlike the three languages that show a constant sum of VC, Korean yields a three-way distinction in the length of VC according the type (lax unaspirated vs. tense unaspirated vs. tense aspirated) of the following stop consonant; (3) a durational constancy is maintained up to the word level in the three languages, but Korean word duration varies as a function of the feature tenseness of the intervocalic consonants; (4) consonant duration is proven to differentiate Korean the most from the other languages. It is suggested that the durational difference between a lax consonant and its tense cognate(s) and the degree of compensation between V and C are determined by the phonology in each language.

  • PDF

A stemming algorithm for a korean language free-text retrieval system (자연어검색시스템을 위한 스태밍알고리즘의 설계 및 구현)

  • 이효숙
    • Journal of the Korean Society for information Management
    • /
    • v.14 no.2
    • /
    • pp.213-234
    • /
    • 1997
  • A stemming algorithm for the Korean language free-text retrieval system has been designed and implemented. The algorithm contains three major parts and it operates iteratively ; firstly, stop-words are removed with a use of a stop-word list ; secondly, a basic removing procedure proceeds with a rule table 1, which contains the suffixes, the postpositional particles, and the optionally adopted symbols specifying an each stemming action ; thirdly, an extended stemming and rewriting procedures continue with a rule table 2, which are composed of th suffixes and the optionally combined symbols representing various actions depending upon the context-sensitive rules. A test was carried out to obtain an indication of how successful the algorithm was and to identify any minor changes in the algorithm for an enhanced one. As a result of it, 21.4 % compression is achieved and an error rate is 15.9%.

  • PDF

Voice Onset Time of Korean Stops as a Function of Speaking Rate (발화 속도에 따른 한국어 폐쇄음의 VOT 값 변화)

  • Oh, Eun-Jin
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.39-48
    • /
    • 2009
  • Previous studies on the effects of speaking rate on voice onset time (VOT) of stops in English, French, Icelandic, and Thai indicate that speaking rate asymmetrically affects VOT values. That is, pre-voiced and long-lag stops vary due to the rate factor more than short-lag stops do. One suggested explanation for this asymmetry is that it is due to the necessity of maintaining phonetic contrasts among the stop categories. Since pre-voiced and long-lag stops represent the ends of the VOT scale, they encompass broad swathes of that range and consequently allow for large variations. On the other hand, the VOT variations of short-lag stops may result in overlap with the VOTs of long-lag stops. This study aimed to explore the effects of speaking rate on the VOTs of Korean stops and see whether Korean fortis and lenis stops are limited in the degrees of variation as a function of rates due to the existence of stops with larger VOT values, lenis and aspirated stops respectively. Conversely, aspirated stops were expected to show more variation since there are no other categories with longer VOTs. Fortis, lenis, and aspirated stops in /CVn/ words (C = bilabial or velar stop, V = /i/ or /a/) were examined in isolation, and at normal and fast rates in a carrier sentence. Speaking rates were controlled by alternating words or sentences on a computer screen at intervals of two seconds for the isolation- and normal-rate conditions and one second for the fast-rate condition. This study found that while the VOTs of fortis stops did not change significantly, those of lenis and aspirated stops showed considerable changes as a function of speaking rates. Also, overlap between lenis and aspirated stops occurred considerably at all speaking rates. These phenomena were interpreted to relate to the fact that VOT contrasts between lenis and aspirated stops in Korean are currently being collapsed. Large variations of lenis stops as a function of rates seem to occur due to a weak motivation to limit the degree of variations for the purpose of maintaining phonetic contrasts. The significant overlap between lenis and aspirated stops at all rates was interpreted to occur because the VOT merger between the two categories became considerably fixed. Also the percentage of correctly-classified VOTs by optimal-boundary values between lenis and aspirated stops turned out to be lower than in previously-studied languages. This was interpreted to be further evidence that VOTs are losing their role in contrasting the two stop categories in Korean.

  • PDF

A Study on Unstructured text data Post-processing Methodology using Stopword Thesaurus (불용어 시소러스를 이용한 비정형 텍스트 데이터 후처리 방법론에 관한 연구)

  • Won-Jo Lee
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.6
    • /
    • pp.935-940
    • /
    • 2023
  • Most text data collected through web scraping for artificial intelligence and big data analysis is generally large and unstructured, so a purification process is required for big data analysis. The process becomes structured data that can be analyzed through a heuristic pre-processing refining step and a post-processing machine refining step. Therefore, in this study, in the post-processing machine refining process, the Korean dictionary and the stopword dictionary are used to extract vocabularies for frequency analysis for word cloud analysis. In this process, "user-defined stopwords" are used to efficiently remove stopwords that were not removed. We propose a methodology for applying the "thesaurus" and examine the pros and cons of the proposed refining method through a case analysis using the "user-defined stop word thesaurus" technique proposed to complement the problems of the existing "stop word dictionary" method with R's word cloud technique. We present comparative verification and suggest the effectiveness of practical application of the proposed methodology.

An Experimental Study of Korean Dialectal Speech (한국어 방언 음성의 실험적 연구)

  • Kim, Hyun-Gi;Choi, Young-Sook;Kim, Deok-Su
    • Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.49-65
    • /
    • 2006
  • Recently, several theories on the digital speech signal processing expanded the communication boundary between human beings and machines drastically. The aim of this study is to collect dialectal speech in Korea on a large scale and to establish a digital speech data base in order to provide the data base for further research on the Korean dialectal and the creation of value-added network. 528 informants across the country participated in this study. Acoustic characteristics of vowels and consonants are analyzed by Power spectrum and Spectrogram of CSL. Test words were made on the picture cards and letter cards which contained each vowel and each consonant in the initial position of words. Plot formants were depicted on a vowel chart and transitions of diphthongs were compared according to dialectal speech. Spectral times, VOT, VD, and TD were measured on a Spectrogram for stop consonants, and fricative frequency, intensity, and lateral formants (LF1, LF2, LF3) for fricative consonants. Nasal formants (NF1, NF2, NF3) were analyzed for different nasalities of nasal consonants. The acoustic characteristics of dialectal speech showed that young generation speakers did not show distinction between close-mid /e/ and open-mid$/\epsilon/$. The diphthongs /we/ and /wj/ showed simple vowels or diphthongs depending to dialect speech. The sibilant sound /s/ showed the aspiration preceded to fricative noise. Lateral /l/ realized variant /r/ in Kyungsang dialectal speech. The duration of nasal consonants in Chungchong dialectal speech were the longest among the dialects.

  • PDF

Text Network Analysis of Newspaper Articles on Life-sustaining Treatments (연명의료 관련 신문 기사의 텍스트네트워크분석)

  • Park, Eun-Jun;Ahn, Dae Woong;Park, Chan Sook
    • Research in Community and Public Health Nursing
    • /
    • v.29 no.2
    • /
    • pp.244-256
    • /
    • 2018
  • Purpose: This study tried to understand discourses of life-sustaining treatments in general daily and healthcare newspapers. Methods: A text-network analysis was conducted using the NetMiner program. Firstly, 572 articles from 11 daily newspapers and 258 articles from 8 healthcare newspapers were collected, which were published from August 2013 to October 2016. Secondly, keywords (semantic morphemes) were extracted from the articles and rearranged by removing stop-words, refining similar words, excluding non-relevant words, and defining meaningful phrases. Finally, co-occurrence matrices of the keywords with a frequency of 30 times or higher were developed and statistical measures-indices of degree and betweenness centrality, ego-networks, and clustering-were obtained. Results: In the general daily and healthcare newspapers, the top eight core keywords were common: "patients," "death," "LST (life-sustaining treatments)," "hospice palliative care," "hospitals," "family," "opinion," and "withdrawal." There were also common subtopics shared by the general daily and healthcare newspapers: withdrawal of LST, hospice palliative care, National Bioethics Review Committee, and self-determination and proxy decision of patients and family. Additionally, the general daily newspapers included diverse social interest or events like well-dying, euthanasia, and the death of farmer Baek Nam-ki, whereas the healthcare newspapers discussed problems of the relevant laws, and insufficient infrastructure and low reimbursement for hospice-palliative care. Conclusion: The discourse that withdrawal of futile LST should be allowed according to the patient's will was consistent in the newspapers. Given that newspaper articles influence knowledge and attitudes of the public, RNs are recommended to participate actively in public communication on LST.

Acoustic Evidence for the Development of Aspiration Feature in Putonghua Stops

  • Han, Ji-Yeon
    • Speech Sciences
    • /
    • v.12 no.3
    • /
    • pp.201-209
    • /
    • 2005
  • This study was investigated developmental temporal features in Putonghua-speaking children. The total of 212 children between the ages 2;6 and 6;5 participated in Shanghai. Speech materials were constructed according to aspiration feature in stop sounds of Putonghua. Six words were selected in this study. A voice onset time was measured. Non-parametric procedures were employed for all the analyses. The VOT value across bilabial, alveolar, and velar stops was significantly differed between aspirated and unaspirated stops for each age group. Effect of age is. significant for unaspirated stops. It is clear that each of Putonghua stops showed decreasing mean and standard deviation. The overshoot phenomenon of VOT was apparent from the age of 2;6-2;11 to 4;6-4;11. There was high variability in the production of lag time for aspirated stops.

  • PDF

The Use of Persona Based Scenario Method for the Development of Web Board Game for the Pre-elderly

  • Seo, Mi-Ra;Kim, Ae-Kyung
    • International Journal of Contents
    • /
    • v.10 no.2
    • /
    • pp.37-41
    • /
    • 2014
  • This study defined the pre-elderly as middle age people from 50 to 59. Because it is difficult to produce a design to satisfy the pre-elderly without deeply understanding them, their financial and physical characteristics and persona-based scenario method was studied. An experimental study about persona based scenario method was conducted, and as a result, the types of personas found were as follows: 1) Users enjoy the same games online and offline. 2) Users enjoy playing alone on the computer. 3) Users prefer games that end quickly with win or loss. Writing the situation scenario for each type, the pre-elderly's problems and needs occurring while they play web board games were obtained. The obtained user requests were as follows: users would like the level of difficulty to be simpler in the game of baduk; users wanted unlimited credit and refrainment from using English words in go-Stop; and there were simple comments about game screen design.