• Title/Summary/Keyword: Word Count Analysis

Search Result 22, Processing Time 0.025 seconds

A Performance Analysis Based on Hadoop Application's Characteristics in Cloud Computing (클라우드 컴퓨팅에서 Hadoop 애플리케이션 특성에 따른 성능 분석)

  • Keum, Tae-Hoon;Lee, Won-Joo;Jeon, Chang-Ho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.15 no.5
    • /
    • pp.49-56
    • /
    • 2010
  • In this paper, we implement a Hadoop based cluster for cloud computing and evaluate the performance of this cluster based on application characteristics by executing RandomTextWriter, WordCount, and PI applications. A RandomTextWriter creates given amount of random words and stores them in the HDFS(Hadoop Distributed File System). A WordCount reads an input file and determines the frequency of a given word per block unit. PI application induces PI value using the Monte Carlo law. During simulation, we investigate the effect of data block size and the number of replications on the execution time of applications. Through simulation, we have confirmed that the execution time of RandomTextWriter was proportional to the number of replications. However, the execution time of WordCount and PI were not affected by the number of replications. Moreover, the execution time of WordCount was optimum when the block size was 64~256MB. Therefore, these results show that the performance of cloud computing system can be enhanced by using a scheduling scheme that considers application's characteristics.

Comparison between Word Embedding Techniques in Traditional Korean Medicine for Data Analysis: Implementation of a Natural Language Processing Method (한의학 고문헌 데이터 분석을 위한 단어 임베딩 기법 비교: 자연어처리 방법을 적용하여)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.1
    • /
    • pp.61-74
    • /
    • 2019
  • Objectives : The purpose of this study is to help select an appropriate word embedding method when analyzing East Asian traditional medicine texts as data. Methods : Based on prescription data that imply traditional methods in traditional East Asian medicine, we have examined 4 count-based word embedding and 2 prediction-based word embedding methods. In order to intuitively compare these word embedding methods, we proposed a "prescription generating game" and compared its results with those from the application of the 6 methods. Results : When the adjacent vectors are extracted, the count-based word embedding method derives the main herbs that are frequently used in conjunction with each other. On the other hand, in the prediction-based word embedding method, the synonyms of the herbs were derived. Conclusions : Counting based word embedding methods seems to be more effective than prediction-based word embedding methods in analyzing the use of domesticated herbs. Among count-based word embedding methods, the TF-vector method tends to exaggerate the frequency effect, and hence the TF-IDF vector or co-word vector may be a more reasonable choice. Also, the t-score vector may be recommended in search for unusual information that could not be found in frequency. On the other hand, prediction-based embedding seems to be effective when deriving the bases of similar meanings in context.

The Review about the Development of Korean Linguistic Inquiry and Word Count (언어적 특성을 이용한 '심리학적 한국어 글분석 프로그램(KLIWC)' 개발 과정에 대한 고찰)

  • Lee Chang H.;Sim Jung-Mi;Yoon Aesun
    • Korean Journal of Cognitive Science
    • /
    • v.16 no.2
    • /
    • pp.93-121
    • /
    • 2005
  • Substantial amounts of research have been accumulated by the attempt to use linguistic styles as the dependent measure in conducting psychological research. This research was condoned to develope a Korean text analysis program(KLIWC) based on the English text analysis program, LIWC(Linguistic Inquiry and Word Count), and the program reflects the Korean linguistic characteristics and culture that is related with language. We made it possible to analyze agglutinative phrase of many morphemes by linguistic tagging, and basic form dictionary and inflection rule were built. In addition, the face-saving weeds and emotional words were included as the analysis variables. The process of development and characteristics of Korean text analysis have been reviewed, and future direction for the improvement of the program has been discussed.

  • PDF

Design of a Sentiment Analysis System to Prevent School Violence and Student's Suicide (학교폭력과 자살사고를 예방하기 위한 감성분석 시스템의 설계)

  • Kim, YoungTaek
    • The Journal of Korean Association of Computer Education
    • /
    • v.17 no.6
    • /
    • pp.115-122
    • /
    • 2014
  • One of the problems with current youth generations is increasing rate of violence and suicide in their school lives, and this study aims at the design of a sentiment analysis system to prevent suicide by uising big data process. The main issues of the design are economical implementation, easy and fast processing for the users, so, the open source Hadoop system with MapReduce algorithm is used on the HDFS(Hadoop Distributed File System) for the experimentation. This study uses word count method to do the sentiment analysis with informal data on some sns communications concerning a kinds of violent words, in terms of text mining to avoid some expensive and complex statistical analysis methods.

  • PDF

A Study on Phon Call Big Data Analytics (전화통화 빅데이터 분석에 관한 연구)

  • Kim, Jeongrae;Jeong, Chanki
    • Journal of Information Technology and Architecture
    • /
    • v.10 no.3
    • /
    • pp.387-397
    • /
    • 2013
  • This paper proposes an approach to big data analytics for phon call data. The analytical models for phon call data is composed of the PVPF (Parallel Variable-length Phrase Finding) algorithm for identifying verbal phrases of natural language and the word count algorithm for measuring the usage frequency of keywords. In the proposed model, we identify words using the PVPF algorithm, and measure the usage frequency of the identified words using word count algorithm in MapReduce. The results can be interpreted from various viewpoints. We design and implement the model based HDFS (Hadoop Distributed File System), verify the proposed approach through a case study of phon call data. So we extract useful results through analysis of keyword correlation and usage frequency.

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

  • Mao, Makara;Peng, Sony;Yang, Yixuan;Park, Doo-Soon
    • Journal of Information Processing Systems
    • /
    • v.18 no.4
    • /
    • pp.549-561
    • /
    • 2022
  • In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.

A Performance Analysis Based on Spark Application (Spark 애플리케이션 기반의 성능 분석)

  • Jung, Young Gyo;Lee, Byung-Jun;Cho, Young-Joo;Youn, Hee Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2016.01a
    • /
    • pp.79-80
    • /
    • 2016
  • 아파치 스파크는 효율적으로 대용량 데이터를 처리하기 위해 분산 메모리 추상화를 사용하는 오픈 소스 분산 데이터 처리 플랫폼이다. 하지만 아파치 스파크 플랫폼의 특정 작업의 성능은 입력 데이터의 유형과 크기, 디자인 및 알고리즘의 구현 및 컴퓨팅 능력에 따라 메모리 사용량 및 I/O 비용이 크게 달라질 수 있다는 문제점이 있다. 이러한 문제점을 해결하기 위하여 본 논문에서는 아파치 스파크 플랫폼에 대한 높은 정밀도 작업 성능을 예측할 수 있도록 CPU core수의 증가에 따른 WordCount 시뮬레이션을 비교 평가 하였다.

  • PDF

Performance analysis of Various Embedding Models Based on Hyper Parameters (다양한 임베딩 모델들의 하이퍼 파라미터 변화에 따른 성능 분석)

  • Lee, Sanga;Park, Jaeseong;Kang, Sangwoo;Lee, Jeong-Eom;Kim, Seona
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.510-513
    • /
    • 2018
  • 본 논문은 다양한 워드 임베딩 모델(word embedding model)들과 하이퍼 파라미터(hyper parameter)들을 조합하였을 때 특정 영역에 어떠한 성능을 보여주는지에 대한 연구이다. 3 가지의 워드 임베딩 모델인 Word2Vec, FastText, Glove의 차원(dimension)과 윈도우 사이즈(window size), 최소 횟수(min count)를 각기 달리하여 총 36개의 임베딩 벡터(embedding vector)를 만들었다. 각 임베딩 벡터를 Fast and Accurate Dependency Parser 모델에 적용하여 각 모들의 성능을 측정하였다. 모든 모델에서 차원이 높을수록 성능이 개선되었으며, FastText가 대부분의 경우에서 높은 성능을 내는 것을 알 수 있었다.

  • PDF

Trend Analysis of Korea Papers in the Fields of 'Artificial Intelligence', 'Machine Learning' and 'Deep Learning' ('인공지능', '기계학습', '딥 러닝' 분야의 국내 논문 동향 분석)

  • Park, Hong-Jin
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.13 no.4
    • /
    • pp.283-292
    • /
    • 2020
  • Artificial intelligence, which is one of the representative images of the 4th industrial revolution, has been highly recognized since 2016. This paper analyzed domestic paper trends for 'Artificial Intelligence', 'Machine Learning', and 'Deep Learning' among the domestic papers provided by the Korea Academic Education and Information Service. There are approximately 10,000 searched papers, and word count analysis, topic modeling and semantic network is used to analyze paper's trends. As a result of analyzing the extracted papers, compared to 2015, in 2016, it increased 600% in the field of artificial intelligence, 176% in machine learning, and 316% in the field of deep learning. In machine learning, a support vector machine model has been studied, and in deep learning, convolutional neural networks using TensorFlow are widely used in deep learning. This paper can provide help in setting future research directions in the fields of 'artificial intelligence', 'machine learning', and 'deep learning'.

Preliminary Analysis of the Relationship between Language Use and Subjective Well-being (주관적 삶의 질과 언어 사용의 관계성 분석)

  • Kim, Kyung-Il;Bae, Jin-Hee;Kim, Young-Jin;Kim, Dong-Geun
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.11
    • /
    • pp.4875-4880
    • /
    • 2011
  • Individuals' language use has been hypothesized as a useful tool for the analysis of psychological aspects. This study examined relationships between language use and their subjective well-being, which consists of life satisfaction and feeling about life. For this, 126 college students wrote an essay and responded to the subjective well-being scale. Then we analyzed their writings through KLIWC (Korean Linguistic Inquiry and Word Count) and compared language use between the high and the low groups of subjective well-being. We also examined the relationships between KLIWC factors and the two sub-factors of subjective well being. The results shows that various individual factors of KLIWC reflect participants' subjective well-being and provids preliminary descriptive data on language use and subjective well-being.