• 제목/요약/키워드: Corpus Frequency

검색결과 166건 처리시간 0.025초

R&D Perspective Social Issue Packaging using Text Analysis

  • Wong, William Xiu Shun;Kim, Namgyu
    • 한국IT서비스학회지
    • /
    • 제15권3호
    • /
    • pp.71-95
    • /
    • 2016
  • In recent years, text mining has been used to extract meaningful insights from the large volume of unstructured text data sets of various domains. As one of the most representative text mining applications, topic modeling has been widely used to extract main topics in the form of a set of keywords extracted from a large collection of documents. In general, topic modeling is performed according to the weighted frequency of words in a document corpus. However, general topic modeling cannot discover the relation between documents if the documents share only a few terms, although the documents are in fact strongly related from a particular perspective. For instance, a document about "sexual offense" and another document about "silver industry for aged persons" might not be classified into the same topic because they may not share many key terms. However, these two documents can be strongly related from the R&D perspective because some technologies, such as "RF Tag," "CCTV," and "Heart Rate Sensor," are core components of both "sexual offense" and "silver industry." Thus, in this study, we attempted to discover the differences between the results of general topic modeling and R&D perspective topic modeling. Furthermore, we package social issues from the R&D perspective and present a prototype system, which provides a package of news articles for each R&D issue. Finally, we analyze the quality of R&D perspective topic modeling and provide the results of inter- and intra-topic analysis.

Another Evidence for Nitric Oxide as One of the Mediators of the Rat gastric Fundus in Response to NANC-Mediated Relaxation

  • Chang, Ki-Churl
    • Biomolecules & Therapeutics
    • /
    • 제3권2호
    • /
    • pp.149-153
    • /
    • 1995
  • Nitric oxide (NO) has been regarded as one of the neurotransmitters of nonadrenergic, noncholinergic (NANC) nerve stimulation in rabbit corpus cavernosum, rat gastric fundus and human intestine. PIANO (photo-induced adequate nitric oxide) is a very useful tool to investige the role of NO in various smooth muscles where NO is a mediator. The present study was undertaken to compare the physiological responses of the rat gastric smooth muscle in response to NANC nerve stimulation and to PIANO. Photolysis of L-NAME, D-NAME and streptozotocin (572) by UV light in the bathing medium caused relaxation of rat gastric fungus that contracted with carbachol, but was resistant to tetrodotoxin (TTX, 1 $\mu$M). Electrical stimulation (20 V, 2~32 Hz, 0.2 msec, 10s) of the gastric fundus, in the presence of atropine and guanethidine, induced frequency-dependent, TTX-sensitive relaxation. Sodium nitroprusside (1 nM-10 $\mu$M), a NO donor, mimicked the relaxations observed after NANC-stimulation or PIANO. Furthermore, PIANO caused UV light exposure time-dependent increase of CGMP in rat gastric fungus strips. These results provide another evidence indirectly that NO is one of the mediators of the NANC inhibitory nerve stimulation in the rat gastric fundus.

  • PDF

Topic Extraction and Classification Method Based on Comment Sets

  • Tan, Xiaodong
    • Journal of Information Processing Systems
    • /
    • 제16권2호
    • /
    • pp.329-342
    • /
    • 2020
  • In recent years, emotional text classification is one of the essential research contents in the field of natural language processing. It has been widely used in the sentiment analysis of commodities like hotels, and other commentary corpus. This paper proposes an improved W-LDA (weighted latent Dirichlet allocation) topic model to improve the shortcomings of traditional LDA topic models. In the process of the topic of word sampling and its word distribution expectation calculation of the Gibbs of the W-LDA topic model. An average weighted value is adopted to avoid topic-related words from being submerged by high-frequency words, to improve the distinction of the topic. It further integrates the highest classification of the algorithm of support vector machine based on the extracted high-quality document-topic distribution and topic-word vectors. Finally, an efficient integration method is constructed for the analysis and extraction of emotional words, topic distribution calculations, and sentiment classification. Through tests on real teaching evaluation data and test set of public comment set, the results show that the method proposed in the paper has distinct advantages compared with other two typical algorithms in terms of subject differentiation, classification precision, and F1-measure.

Phoneme distribution and syllable structure of entry words in the CMU English Pronouncing Dictionary

  • Yang, Byunggon
    • 말소리와 음성과학
    • /
    • 제8권2호
    • /
    • pp.11-16
    • /
    • 2016
  • This study explores the phoneme distribution and syllable structure of entry words in the CMU English Pronouncing Dictionary to provide phoneticians and linguists with fundamental phonetic data on English word components. Entry words in the dictionary file were syllabified using an R script and examined to obtain the following results: First, English words preferred consonants to vowels in their word components. In addition, monophthongs occurred much more frequently than diphthongs. When all consonants were categorized by manner and place, the distribution indicated the frequency order of stops, fricatives, and nasals according to manner and that of alveolars, bilabials and velars according to place. These results were comparable to the results obtained from the Buckeye Corpus (Yang, 2012). Second, from the analysis of syllable structure, two-syllable words were most favored, followed by three- and one-syllable words. Of the words in the dictionary, 92.7% consisted of one, two or three syllables. This result may be related to human memory or decoding time. Third, the English words tended to exhibit discord between onset and coda consonants and between adjacent vowels. Dissimilarity between the last onset and the first coda was found in 93.3% of the syllables, while 91.6% of the adjacent vowels were different. From the results above, the author concludes that an analysis of the phonetic symbols in a dictionary may lead to a deeper understanding of English word structures and components.

Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model

  • Kadowaki, Natsuki;Kishida, Kazuaki
    • Journal of Information Science Theory and Practice
    • /
    • 제8권2호
    • /
    • pp.6-17
    • /
    • 2020
  • Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSM- and LDA-based similarity measures performed best.

고지사육유우의 번식장해 발생상태에 관한 조사연구 (Studies on the Occurence of Reproductive Disorders of Dairy Cattle Resident in High-land)

  • 박춘근;고광두
    • 한국가축번식학회지
    • /
    • 제10권1호
    • /
    • pp.9-18
    • /
    • 1986
  • Breeding and infertility status of Holstein cows reared in pasture of High-land was investigated. The results obtained were summarized as follows; 1. Eight hundred and forty Holstein cows with heifers (12.4%), 3 years old cows (19.3), 4 years old cows (16.7%), 5 years old cows (12.5%), 6 years old cows (11.4%), 7 years old cows (13.2%), 8 years old cows (9.4%) and above 9 years old cows (5.1%) were investigated. 2. They were 72.4% of conceived cows, 8.2% of uncertain pregnancy, 7.7% of physiological empty and 11.7% of reproductive disorder. 3. The percentage of cow concepted with 1, 2, 3 and more than 4 times of A.I. was 49.2, 28.8, 14.6 and 7.4% respectively. 4. In the nutritional condition of infertile cows, excellent, good, fair and poor was 7.1, 30.6, 36.7 and 25.5%, respectively. 5. Among 98 infertile cows, distribution of reproductive disorder was 41.8, 37.8, 5.1, 5.1 and 11.2% in ovary, uterus, vagina, oviduct and others, respectively. Ovary showed higher percentage than any other reproductive organs. Among the ovarian syndromes, lutein cystic ovary, follicular cystic ovary, and persistent corpus luteum were 31.7, 26.8 and 19.5%, respectively. 6. Four years old cow showed highest distribution (16.4%) among the aged groups in disordered cows. In the syndromes of reproductive disorder, latent endometritis showed higher frequency (14.3%) than any others. 7. Infertile cows with complex syncrome of genital disease was 29.6%.

  • PDF

카이제곱 통계량과 지지벡터기계를 이용한 스팸메일 필터 (Spam Filter by Using X2 Statistics and Support Vector Machines)

  • 이성욱
    • 정보처리학회논문지B
    • /
    • 제17B권3호
    • /
    • pp.249-254
    • /
    • 2010
  • 본 논문은 지지벡터기계를 이용하여 스팸메일을 자동으로 분류하는 시스템을 제안한다. 이메일에 포함된 단어의 어휘 정보와 품사 태그 정보를 지지벡터기계의 자질로 사용한다. 우리는 카이제곱 통계량을 이용하여 자질을 선택한 후 각각의 자질을 TF, TF-IDF, 이진 가중치 등으로 표현하여 실험하였다. 카이제곱 통계량을 이용하여 선택된 자질들을 이용하여 SVM을 학습한 후, SVM분류기는 각각의 이메일의 스팸 여부를 결정한다. 실험 결과, 선택되어진 자질들이 성능향상을 가져왔으며, TREC05-p1 스팸 말뭉치에 대해 약 98.9%의 정확도를 얻었다.

영어권 학습자를 위한 한국어 구어 문법 교육 - 보고 표지 '-대'를 중심으로 - (Teaching Grammar for Spoken Korean to English-speaking Learners: Reported Speech Marker '-dae'.)

  • 김영아;조인정
    • 한국어교육
    • /
    • 제23권1호
    • /
    • pp.1-23
    • /
    • 2012
  • The development of corpus in recent years has attracted increased research on spoken Korean. Nevertheless, these research outcomes are yet to be meaningfully and adequately reflected in Korean language textbooks. The reported speech marker '-dae' is one of these areas that need more attention. This study investigates whether or not in textbooks '-dae' is clearly explained to English-speaking learners to prevent confusion and misuse. Based on a contrastive analysis of Korean and English, this study argues three points: Firstly, '-dae' should be introduced to Korean learners as an independent sentence ender rather than a contracted form of '-dago hae'. Secondly, it is necessary to teach English-speaking learners that '-dae' is not equivalent to the English report speech form. It functions more or less as a third person marker in Korean. Learners should be informed that '-dae' is used for statements in English, if those statements were hearsay but the source of information does not need to be specified. This is a very distinctive difference between Korean and English and should be emphasized in class when 'dae' is taught. Thirdly, '-dae' should be introduced before indirect speech constructions, because it is mainly used in simple statements and the frequency of '-dae' is very high in spoken Korean.

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text

  • Atwan, Jaffar
    • International Journal of Computer Science & Network Security
    • /
    • 제22권7호
    • /
    • pp.65-74
    • /
    • 2022
  • In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.

Emotion Recognition in Arabic Speech from Saudi Dialect Corpus Using Machine Learning and Deep Learning Algorithms

  • Hanaa Alamri;Hanan S. Alshanbari
    • International Journal of Computer Science & Network Security
    • /
    • 제23권8호
    • /
    • pp.9-16
    • /
    • 2023
  • Speech can actively elicit feelings and attitudes by using words. It is important for researchers to identify the emotional content contained in speech signals as well as the sort of emotion that resulted from the speech that was made. In this study, we studied the emotion recognition system using a database in Arabic, especially in the Saudi dialect, the database is from a YouTube channel called Telfaz11, The four emotions that were examined were anger, happiness, sadness, and neutral. In our experiments, we extracted features from audio signals, such as Mel Frequency Cepstral Coefficient (MFCC) and Zero-Crossing Rate (ZCR), then we classified emotions using many classification algorithms such as machine learning algorithms (Support Vector Machine (SVM) and K-Nearest Neighbor (KNN)) and deep learning algorithms such as (Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM)). Our Experiments showed that the MFCC feature extraction method and CNN model obtained the best accuracy result with 95%, proving the effectiveness of this classification system in recognizing Arabic spoken emotions.