• Title/Summary/Keyword: Corpus Frequency

Search Result 165, Processing Time 0.028 seconds

A Model for evaluating the efficiency of inputting Hangul on a telephone keyboard (전화기 자판의 한글 입력 효율성 평가 모형)

  • Koo, Min-Mo;Lee, Mahn-Young
    • The KIPS Transactions:PartD
    • /
    • v.8D no.3
    • /
    • pp.295-304
    • /
    • 2001
  • The standards of a telephone Hangul keyboard should be decided in terms of objective factors : the number of strokes and fingers’moving distance. A number of designers will agree on them, because these factors can be calculated in an objective manner. So, We developed the model which can evaluate the efficiency of inputting Hangul on a telephone keyboard in terms of two factors. As compared with other models, the major features of this model are as follows : in order to evaluate the efficiency of Hangul input on a telephone keyboard, (1) this model calculated not a typing time but the number of strokes ; (2) concurrence frequency that had been counted on KOREA-1 Corpus was used directly ; (3) a total set of 67 consonants and vowels was used ; and (4) this model could evaluate a number of keyboards that use a kind of syllabic function key-the complete key, the null key and the final consonant key and also calculate a lot of keyboards that adopt no syllabic function key. However, there are many other factors to judge the efficiency of inputting Hangul on a telephone keyboard. If we want to make more accurate estimate of a telephone Hangul keyboard, we must consider both logical data and experimental data as well.

  • PDF

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality (지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법)

  • Choi, Sukjae;Lee, Jungwon;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.119-138
    • /
    • 2017
  • Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.

Comparative study of typical and atypical benign epilepsy with centrotemporal spikes (Rolandic epilepsy) (중심 측두부 극파를 보이는 전형적 및 비전형적 양성 부분 간진의 비교 연구)

  • Song, Junhyuk;Lee, Kyuha;Chung, Sajun
    • Clinical and Experimental Pediatrics
    • /
    • v.51 no.10
    • /
    • pp.1085-1089
    • /
    • 2008
  • Purpose : This study aims to examine and compare the features of rolandic epilepsy. Methods : Of 158 patients selected retrospectively, 116 had typical (group A) and 42 had atypical (group B) rolandic epilepsy, as defined by Worrall's criteria. Results : The age at onset of the seizures in group A was $8.6{\pm}2.0y$ and $6.2{\pm}1.7y$ in group B (P>0.05). Among the 40 patients who underwent neuroimaging studies (25 patients in group A and 15 patients in group B), abnormal findings in group B included ventricular dilatation, mild cortical atrophy, and partial agenesis of corpus callosum. group A had no abnormal findings. The frequency of seizures was $2.0{\pm}1.0$ and $2.3{\pm}1.2$ per month in groups A and B respectively. Seizure control from the initial anticonvulsant treatment was achieved within 3 months in group A, and 3 to 12 months in group B. A 2-year remission rate was noted in 105 patients in group A and in 38 patients in group B. Of these, the recurrence rate after 2 y was 13 in group A and 12 in group B. Conclusion : Age of onset of seizures, gender, frequency of seizures before therapy, and 2-y remission rate were not significantly different in the 2 groups. However, neuroimaging abnormalities, the time to achieving seizure control from the initial anticonvulsant treatment, and the recurrence rate after being seizure-free for 2 y were significantly different in the 2 groups.

A Comparative Analysis of Basal Body Temperature to Ultrasound, as a Method of Ovulation Detection in Induced Ovulatory Menstrual Cycles (배란유도주기에 따른 초음파검사와 기초체온표의 비교분석)

  • Choi, W.;Suh, B.H.;Lee, J.H.
    • Clinical and Experimental Reproductive Medicine
    • /
    • v.12 no.2
    • /
    • pp.25-37
    • /
    • 1985
  • Four points on the basal body temperature (B.B.T.) curve was correlated with the estimated time of ovulation, as determined by serial ultrasound in 50 induced menstrual cycles from 22 subjects. The time of ovulation was estimated by measuring the maximal diameter of follicles and observing the morphologic changes within the ovary from follicle to corpus luteum. The results were as following; 1. The diameter of the follicle measured at the day before disappearance was 21.1 mm on an average (S.D.: 2.14). The average follicular growth for 4 days before ovulation was measured at a rate of 2.8 mm/day, and rapid growth of follicle was observed 3.1 mm/day at the day before. 2. The changes associated with rupture of the follicles were the followings, in order of frequency; decrease in size(94%), disappearance of follicles(64%), fluid in the Cul-de-Sac(26%) and increased internal echoes(16%). 3. Only 20 of 50 cycles, exhibited a BBT dip and correlated with the estimated time of ovulation by ultrasound in 2 of which cases(10%). BBT nadir, 30 of 50 cycles, correlated in 5(16.7%). The first day of hyperthermic plateau(FDHP) and BBT coverline was exhibited in all cycles, correlated in 41(82%) and 35(70%) cases. 4. The relationship between the diameter of dominant dominant follicle, measured by ultrasound, and the basal body temperature curve were as following. During cycles in which dip was observed on the BBT curve, the follicular diameter were 10.5${\pm}$2.12 mm on 4 days prior to the point (D-4), and 12.5${\pm}$2.12 mm (D-3), 15.5${\pm$2.12 mm (D-2), 17.0${\pm}$1.41 mm (D-1) and 21.5${\pm}$2.12 mm just prior to the dip (D-0). In the nadir; 9.6${\pm}$1.67 mm (N-4), 12.8${\pm}$1.79 mm (N-3), 16.2${\pm}$1.92 mm (N-2), 18.2${\pm}$2.17 mm (N-1) and 21.4${\pm}$2.61 mm (N-0). In the First day of Hyperthemic Plateau (FDHP); 9.8${\pm}$1.36 mm (F-4), 12.4${\pm}$1.41 mm (F-3),15.1${\pm}$1.57 mm (F-2), 18.1${\pm}$1.67 mm (F-1) and 21.2${\pm}$2.25 mm (F-0). In the BBT coverline endopint; 9.9${\pm}$.39 mm (C-4), 12.5 ${\pm}$1.44 mm (C-3), 15.2${\pm}$1.64 mm (C-2), 18.0 ${\pm}$1.69 mm (C-1), and 21.2${\pm}$2.31 mm (C-0). 5. The relationship between the ultrasonographic signs of ovulation and the basal body temperature curve were as following. The BBT dip correlated with the ovulation in 2 cases, which revealed decrease in follicular diameter (100%), fluid pattem in the Cul-de-Sac (1 case, 50%) and complete disappearance of follicle (1 case, 50%). In the nadir (5 cases); the ultrasonographic signs of ovulation were decrease in follicular diameter (5 cases, 100%), fluid pattern in the Cul-de-Sac (1 case, 20%) and complete disappearance of follicle (3 cases, 60%). In the First day of Hyperthermic Plateau (41 cases); decrease in follicular diameter (40 cases, 97.6%), fluid pattern in the Cul-de-Sac (11 cases, 26.8%), appearance of internal echo and thickening of the wall (6 cases, 14.6%) and com plete disappearance of follicle (28 cases, 68.3%). In the BBT coverline endpoint (35 cases); decrease in follicular diameter (33 cases, 94.3%), fluid pattern in the Cul-de Sac (9 cases, 25.7%), appearance of internal echo and thickening of the wall (5 cases 14.3%) and complete disappearance of follicle (20 cases, 57.1%).

  • PDF

The Effect of Domain Specificity on the Performance of Domain-Specific Pre-Trained Language Models (도메인 특수성이 도메인 특화 사전학습 언어모델의 성능에 미치는 영향)

  • Han, Minah;Kim, Younha;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.251-273
    • /
    • 2022
  • Recently, research on applying text analysis to deep learning has steadily continued. In particular, researches have been actively conducted to understand the meaning of words and perform tasks such as summarization and sentiment classification through a pre-trained language model that learns large datasets. However, existing pre-trained language models show limitations in that they do not understand specific domains well. Therefore, in recent years, the flow of research has shifted toward creating a language model specialized for a particular domain. Domain-specific pre-trained language models allow the model to understand the knowledge of a particular domain better and reveal performance improvements on various tasks in the field. However, domain-specific further pre-training is expensive to acquire corpus data of the target domain. Furthermore, many cases have reported that performance improvement after further pre-training is insignificant in some domains. As such, it is difficult to decide to develop a domain-specific pre-trained language model, while it is not clear whether the performance will be improved dramatically. In this paper, we present a way to proactively check the expected performance improvement by further pre-training in a domain before actually performing further pre-training. Specifically, after selecting three domains, we measured the increase in classification accuracy through further pre-training in each domain. We also developed and presented new indicators to estimate the specificity of the domain based on the normalized frequency of the keywords used in each domain. Finally, we conducted classification using a pre-trained language model and a domain-specific pre-trained language model of three domains. As a result, we confirmed that the higher the domain specificity index, the higher the performance improvement through further pre-training.