• Title/Summary/Keyword: 10-fold Validation

Search Result 236, Processing Time 0.039 seconds

Sentiment Classification of Movie Reviews using Levenshtein Distance (Levenshtein 거리를 이용한 영화평 감성 분류)

  • Ahn, Kwang-Mo;Kim, Yun-Suk;Kim, Young-Hoon;Seo, Young-Hoon
    • Journal of Digital Contents Society
    • /
    • v.14 no.4
    • /
    • pp.581-587
    • /
    • 2013
  • In this paper, we propose a method of sentiment classification which uses Levenshtein distance. We generate BOW(Bag-Of-Word) applying Levenshtein daistance in sentiment features and used it as the training set. Then the machine learning algorithms we used were SVMs(Support Vector Machines) and NB(Naive Bayes). As the data set, we gather 2,385 reviews of movies from an online movie community (Daum movie service). From the collected reviews, we pick sentiment words up manually and sorted 778 words. In the experiment, we perform the machine learning using previously generated BOW which was applied Levenshtein distance in sentiment words and then we evaluate the performance of classifier by a method, 10-fold-cross validation. As the result of evaluation, we got 85.46% using Multinomial Naive Bayes as the accuracy when the Levenshtein distance was 3. According to the result of the experiment, we proved that it is less affected to performance of the classification in spelling errors in documents.

An automated Classification System of Standard Industry and Occupation Codes by Using Information Retrieval Techniques (정보검색 기법을 이용한 산업/직업 코드 자동 분류 시스템)

  • Lim, Heui Seok
    • The Journal of Korean Association of Computer Education
    • /
    • v.7 no.4
    • /
    • pp.51-60
    • /
    • 2004
  • This paper proposes an automated coding system of Korean standard industry/occupation for census which reduces a lot of cost and labor for manual coding. The proposed system converts natural language responses on survey questionnaires into corresponding numeric codes using information retrieval techniques and document classification algorithm. The system was experimented with 46,762 industry records and occupation 36,286 records using 10-fold cross -validation evaluation method. As experimental results, the system show 87.08% and 66.08% production rates when classifying industry records into level 2 and level 5 codes respectively. The system shows slightly lower performances on occupation code classification. We expect that the system is enough to be used as a semi-automate coding system which can minimize manual coding task or as a verification tool for manual coding results though it has much room to be improved as an automated coding system.

  • PDF

Prediction of Protein-Protein Interaction Sites Based on 3D Surface Patches Using SVM (SVM 모델을 이용한 3차원 패치 기반 단백질 상호작용 사이트 예측기법)

  • Park, Sung-Hee;Hansen, Bjorn
    • The KIPS Transactions:PartD
    • /
    • v.19D no.1
    • /
    • pp.21-28
    • /
    • 2012
  • Predication of protein interaction sites for monomer structures can reduce the search space for protein docking and has been regarded as very significant for predicting unknown functions of proteins from their interacting proteins whose functions are known. In the other hand, the prediction of interaction sites has been limited in crystallizing weakly interacting complexes which are transient and do not form the complexes stable enough for obtaining experimental structures by crystallization or even NMR for the most important protein-protein interactions. This work reports the calculation of 3D surface patches of complex structures and their properties and a machine learning approach to build a predictive model for the 3D surface patches in interaction and non-interaction sites using support vector machine. To overcome classification problems for class imbalanced data, we employed an under-sampling technique. 9 properties of the patches were calculated from amino acid compositions and secondary structure elements. With 10 fold cross validation, the predictive model built from SVM achieved an accuracy of 92.7% for classification of 3D patches in interaction and non-interaction sites from 147 complexes.

A Comparison Study of Runoff Projections for Yongdam Dam Watershed Using SWAT (SWAT모형을 이용한 용담댐 유역의 유량 전망 결과 비교 연구)

  • Jung, Cha Mi;Shin, Mun-Ju;Kim, Young-Oh
    • Journal of Korea Water Resources Association
    • /
    • v.48 no.6
    • /
    • pp.439-449
    • /
    • 2015
  • In this study, reliable future runoff projections based on RCPs for Yongdam dam watershed was performed using SWAT model, which was validated by k-fold cross validation method, and investigated the factors that cause the differences with respect to runoff projections between this study and previous studies. As a result, annual average runoff compared to baseline runoff would increase 17.7% and 26.1% in 2040s and 2080s respectively under RCP8.5 scenario, and 21.9% and 44.6% in 2040s and 2080s respectively under RCP4.5 scenario. Comparing the results to previous studies, minimum and maximum differences between runoff projections over different studies were 10.3% and 53.2%, even though runoff was projected by the same rainfall-runoff model. SWAT model has 27 parameters and physically based complex structure, so it tends to make different results by the model users' setting. In the future, it is necessary to reduce the cause of difference to generate standard runoff scenarios.

Vulnerability Assessment for Fine Particulate Matter (PM2.5) in the Schools of the Seoul Metropolitan Area, Korea: Part II - Vulnerability Assessment for PM2.5 in the Schools (인공지능을 이용한 수도권 학교 미세먼지 취약성 평가: Part II - 학교 미세먼지 범주화)

  • Son, Sanghun;Kim, Jinsoo
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.6_2
    • /
    • pp.1891-1900
    • /
    • 2021
  • Fine particulate matter (FPM; diameter ≤ 2.5 ㎛) is frequently found in metropolitan areas due to activities associated with rapid urbanization and population growth. Many adolescents spend a substantial amount of time at school where, for various reasons, FPM generated outdoors may flow into indoor areas. The aims of this study were to estimate FPM concentrations and categorize types of FPM in schools. Meteorological and chemical variables as well as satellite-based aerosol optical depth were analyzed as input data in a random forest model, which applied 10-fold cross validation and a grid-search method, to estimate school FPM concentrations, with four statistical indicators used to evaluate accuracy. Loose and strict standards were established to categorize types of FPM in schools. Under the former classification scheme, FPM in most schools was classified as type 2 or 3, whereas under strict standards, school FPM was mostly classified as type 3 or 4.

New Automatic Taxonomy Generation Algorithm for the Audio Genre Classification (음악 장르 분류를 위한 새로운 자동 Taxonomy 구축 알고리즘)

  • Choi, Tack-Sung;Moon, Sun-Kook;Park, Young-Cheol;Youn, Dae-Hee;Lee, Seok-Pil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.27 no.3
    • /
    • pp.111-118
    • /
    • 2008
  • In this paper, we propose a new automatic taxonomy generation algorithm for the audio genre classification. The proposed algorithm automatically generates hierarchical taxonomy based on the estimated classification accuracy at all possible nodes. The estimation of classification accuracy in the proposed algorithm is conducted by applying the training data to classifier using k-fold cross validation. Subsequent classification accuracy is then to be tested at every node which consists of two clusters by applying one-versus-one support vector machine. In order to assess the performance of the proposed algorithm, we extracted various features which represent characteristics such as timbre, rhythm, pitch and so on. Then, we investigated classification performance using the proposed algorithm and previous flat classifiers. The classification accuracy reaches to 89 percent with proposed scheme, which is 5 to 25 percent higher than the previous flat classification methods. Using low-dimensional feature vectors, in particular, it is 10 to 25 percent higher than previous algorithms for classification experiments.

Deep Learning-based Happiness Index Model Considering Social Variables and Individual Emotional Index (사회적 변수와 개개인의 감정지수를 함께 고려한 딥러닝 기반 행복 지수 모델 설계)

  • Sumin Oh;Minseo Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.1
    • /
    • pp.489-493
    • /
    • 2024
  • Happiness index is a measurement system for understanding collective happiness. As values change, studies have been proposed to add the value of behavior to the happiness index. However, there is a lack of studies analyze the relationship using individual emotions. Using a deep learning model, we predicted happiness index using social variables and individual emotional index. First, we collected social and emotional variables from January 2005 to December 2020. Second, we preprocessed the data and identified significant variables. Finally, we trained deep learning-based regression model. Our proposed model was evaluated using 5-fold cross validation. The proposed model showed 90.86% accuracy on test sets. Our model will be expected to analyze the significant factors of country-specific happiness index.

Improvement of FK506 Production in the High-Yielding Strain Streptomyces sp. RM7011 by Engineering the Supply of Allylmalonyl-CoA Through a Combination of Genetic and Chemical Approach

  • Mo, SangJoon;Lee, Sung-Kwon;Jin, Ying-Yu;Suh, Joo-Won
    • Journal of Microbiology and Biotechnology
    • /
    • v.26 no.2
    • /
    • pp.233-240
    • /
    • 2016
  • FK506, a widely used immunosuppressant, is a 23-membered polyketide macrolide that is produced by several Streptomyces species. FK506 high-yielding strain Streptomyces sp. RM7011 was developed from the discovered Streptomyces sp. KCCM 11116P by random mutagenesis in our previous study. The results of transcript expression analysis showed that the transcription levels of tcsA, B, C, and D were increased in Streptomyces sp. RM7011 by 2.1-, 3.1-, 3.3-, and 4.1-fold, respectively, compared with Streptomyces sp. KCCM 11116P. The overexpression of tcsABCD g enes in Streptomyces sp. RM7011 gave rise to approximately 2.5-fold (238.1 μg/ml) increase in the level of FK506 production compared with that of Streptomyces sp. RM7011. When vinyl pentanoate was added into the culture broth of Streptomyces sp. RM7011, the level of FK506 production was approximately 2.2-fold (207.7 μg/ml) higher than that of the unsupplemented fermentation. Furthermore, supplementing the culture broth of Streptomyces sp. RM7011 expressing tcsABCD genes with vinyl pentanoate resulted in an additional 1.7-fold improvement in the FK506 titer (498.1 μg/ml) compared with that observed under non-supplemented condition. Overall, the level of FK506 production was increased approximately 5.2-fold by engineering the supply of allylmalonyl-CoA in the high-yielding strain Streptomyces sp. RM7011, using a combination of overexpressing tcsABCD genes and adding vinyl pentanoate, as compared with Streptomyces sp. RM7011 (95.3 μg/ml). Moreover, among the three precursors analyzed, pentanoate was the most effective precursor, supporting the highest titer of FK506 in the FK506 high-yielding strain Streptomyces sp. RM7011.

Motion Recognition for Kinect Sensor Data Using Machine Learning Algorithm with PNF Patterns of Upper Extremities

  • Kim, Sangbin;Kim, Giwon;Kim, Junesun
    • The Journal of Korean Physical Therapy
    • /
    • v.27 no.4
    • /
    • pp.214-220
    • /
    • 2015
  • Purpose: The purpose of this study was to investigate the availability of software for rehabilitation with the Kinect sensor by presenting an efficient algorithm based on machine learning when classifying the motion data of the PNF pattern if the subjects were wearing a patient gown. Methods: The motion data of the PNF pattern for upper extremities were collected by Kinect sensor. The data were obtained from 8 normal university students without the limitation of upper extremities. The subjects, wearing a T-shirt, performed the PNF patterns, D1 and D2 flexion, extensions, 30 times; the same protocol was repeated while wearing a patient gown to compare the classification performance of algorithms. For comparison of performance, we chose four algorithms, Naive Bayes Classifier, C4.5, Multilayer Perceptron, and Hidden Markov Model. The motion data for wearing a T-shirt were used for the training set, and 10 fold cross-validation test was performed. The motion data for wearing a gown were used for the test set. Results: The results showed that all of the algorithms performed well with 10 fold cross-validation test. However, when classifying the data with a hospital gown, Hidden Markov model (HMM) was the best algorithm for classifying the motion of PNF. Conclusion: We showed that HMM is the most efficient algorithm that could handle the sequence data related to time. Thus, we suggested that the algorithm which considered the sequence of motion, such as HMM, would be selected when developing software for rehabilitation which required determining the correctness of the motion.

Association of miR-193b Down-regulation and miR-196a up-Regulation with Clinicopathological Features and Prognosis in Gastric Cancer

  • Mu, Yong-Ping;Tang, Song;Sun, Wen-Jie;Gao, Wei-Min;Wang, Mao;Su, Xiu-Lan
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.15 no.20
    • /
    • pp.8893-8900
    • /
    • 2014
  • Dysregulated expression of microRNAs (miRNAs) has been shown to be closely associated with tumor development, progression, and carcinogenesis. However, their clinical implications for gastric cancer remain elusive. To investigate the hypothesis that genome-wide alternations of miRNAs differentiate gastric cancer tissues from those matched adjacent non-tumor tissues (ANTTs), miRNA arrays were employed to examine miRNA expression profiles for the 5-pair discovery stage, and the quantitative real-time polymerase chain reaction (qRTPCR) was applied to validate candidate miRNAs for 48-pair validation stage. Furthermore, the relationship between altered miRNA and clinicopathological features and prognosis of gastric cancer was explored. Among a total of 1,146 miRNAs analyzed, 16 miRNAs were found to be significantly different expressed in tissues from gastric cancer compared to ANTTs (p<0.05). qRT-PCR further confirmed the variation in expression of miR-193b and miR-196a in the validation stage. Down-expression of miR-193b was significantly correlated with Lauren type, differentiation, UICC stage, invasion, and metastasis of gastric cancer (p<0.05), while over-expression of miR-196a was significantly associated with poor differentiation (p=0.022). Moreover, binary logistic regression analysis demonstrated that the UICC stage was a significant risk factor for down-expression of miR-193b (adjusted OR=8.69; 95%CI=1.06-56.91; p=0.043). Additionally, Kaplan-Meier survival curves indicated that patients with a high fold-change of down-regulated miR-193b had a significantly shorter survival time (n=19; median survival=29 months) compared to patients with a low fold-change of down-regulated miR-193b (n=29; median survival=54 months) (p=0.001). Overall survival time of patients with a low fold-change of up-regulated miR-196a (n=27; median survival=52 months) was significantly longer than that of patients with a high fold-change of up-regulated miR-196a (n=21; median survival=46 months) (p=0.003). Hence, miR-193b and miR-196a may be applied as novel and promising prognostic markers in gastric cancer.