• Title/Summary/Keyword: Subset selection

Search Result 203, Processing Time 0.027 seconds

Extraction Method of Significant Clinical Tests Based on Data Discretization and Rough Set Approximation Techniques: Application to Differential Diagnosis of Cholecystitis and Cholelithiasis Diseases (데이터 이산화와 러프 근사화 기술에 기반한 중요 임상검사항목의 추출방법: 담낭 및 담석증 질환의 감별진단에의 응용)

  • Son, Chang-Sik;Kim, Min-Soo;Seo, Suk-Tae;Cho, Yun-Kyeong;Kim, Yoon-Nyun
    • Journal of Biomedical Engineering Research
    • /
    • v.32 no.2
    • /
    • pp.134-143
    • /
    • 2011
  • The selection of meaningful clinical tests and its reference values from a high-dimensional clinical data with imbalanced class distribution, one class is represented by a large number of examples while the other is represented by only a few, is an important issue for differential diagnosis between similar diseases, but difficult. For this purpose, this study introduces methods based on the concepts of both discernibility matrix and function in rough set theory (RST) with two discretization approaches, equal width and frequency discretization. Here these discretization approaches are used to define the reference values for clinical tests, and the discernibility matrix and function are used to extract a subset of significant clinical tests from the translated nominal attribute values. To show its applicability in the differential diagnosis problem, we have applied it to extract the significant clinical tests and its reference values between normal (N = 351) and abnormal group (N = 101) with either cholecystitis or cholelithiasis disease. In addition, we investigated not only the selected significant clinical tests and the variations of its reference values, but also the average predictive accuracies on four evaluation criteria, i.e., accuracy, sensitivity, specificity, and geometric mean, during l0-fold cross validation. From the experimental results, we confirmed that two discretization approaches based rough set approximation methods with relative frequency give better results than those with absolute frequency, in the evaluation criteria (i.e., average geometric mean). Thus it shows that the prediction model using relative frequency can be used effectively in classification and prediction problems of the clinical data with imbalanced class distribution.

Evolutionary Design of Radial Basis Function-based Polynomial Neural Network with the aid of Information Granulation (정보 입자화를 통한 방사형 기저 함수 기반 다항식 신경 회로망의 진화론적 설계)

  • Park, Ho-Sung;Jin, Yong-Ha;Oh, Sung-Kwun
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.60 no.4
    • /
    • pp.862-870
    • /
    • 2011
  • In this paper, we introduce a new topology of Radial Basis Function-based Polynomial Neural Networks (RPNN) that is based on a genetically optimized multi-layer perceptron with Radial Polynomial Neurons (RPNs). This study offers a comprehensive design methodology involving mechanisms of optimization algorithms, especially Fuzzy C-Means (FCM) clustering method and Particle Swarm Optimization (PSO) algorithms. In contrast to the typical architectures encountered in Polynomial Neural Networks (PNNs), our main objective is to develop a design strategy of RPNNs as follows : (a) The architecture of the proposed network consists of Radial Polynomial Neurons (RPNs). In here, the RPN is fully reflective of the structure encountered in numeric data which are granulated with the aid of Fuzzy C-Means (FCM) clustering method. The RPN dwells on the concepts of a collection of radial basis function and the function-based nonlinear (polynomial) processing. (b) The PSO-based design procedure being applied at each layer of RPNN leads to the selection of preferred nodes of the network (RPNs) whose local characteristics (such as the number of input variables, a collection of the specific subset of input variables, the order of the polynomial, and the number of clusters as well as a fuzzification coefficient in the FCM clustering) can be easily adjusted. The performance of the RPNN is quantified through the experimentation where we use a number of modeling benchmarks - NOx emission process data of gas turbine power plant and learning machine data(Automobile Miles Per Gallon Data) already experimented with in fuzzy or neurofuzzy modeling. A comparative analysis reveals that the proposed RPNN exhibits higher accuracy and superb predictive capability in comparison to some previous models available in the literature.

A Comparison of InSAR Techniques for Deformation Monitoring using Multi-temporal SAR (다중시기 SAR 영상을 이용한 시계열 변위 관측기법 비교 분석)

  • Kim, Sang-Wan
    • Korean Journal of Remote Sensing
    • /
    • v.26 no.2
    • /
    • pp.143-151
    • /
    • 2010
  • We carried out studies on InSAR techniques for time-series deformation monitoring using multi-temporal SAR. The PSInSAR method using permanent scatterer is much more complicate than the SBAS because it includes many non-linear equation due to the input of wrapped phase. It is conformed the PS algorithm is very sensitive to even PSC selection. On the other hand, the SBAS method using interferogram of small baseline subset is simple but sensitive to the accuracy of unwrapped phase. The SBAS is better method for expecting not significant unwrapping error while PSInSAR is more proper method for expecting local deformation within very limited area. We used 51 ERS-1/2 SAR data during 1992-2000 over Las Vegas, USA for the comparison between PSInSAR and SBAS. Both PSInSAR and SBAS show similar ground deformation value although local deformation seems to be detected in the PSInSAR method only.

2D-QSAR analysis for hERG ion channel inhibitors (hERG 이온채널 저해제에 대한 2D-QSAR 분석)

  • Jeon, Eul-Hye;Park, Ji-Hyeon;Jeong, Jin-Hee;Lee, Sung-Kwang
    • Analytical Science and Technology
    • /
    • v.24 no.6
    • /
    • pp.533-543
    • /
    • 2011
  • The hERG (human ether-a-go-go related gene) ion channel is a main factor for cardiac repolarization, and the blockade of this channel could induce arrhythmia and sudden death. Therefore, potential hERG ion channel inhibitors are now a primary concern in the drug discovery process, and lots of efforts are focused on the minimizing the cardiotoxic side effect. In this study, $IC_{50}$ data of 202 organic compounds in HEK (human embryonic kidney) cell from literatures were used to develop predictive 2D-QSAR model. Multiple linear regression (MLR), Support Vector Machine (SVM), and artificial neural network (ANN) were utilized to predict inhibition concentration of hERG ion channel as machine learning methods. Population based-forward selection method with cross-validation procedure was combined with each learning method and used to select best subset descriptors for each learning algorithm. The best model was ANN model based on 14 descriptors ($R^2_{CV}$=0.617, RMSECV=0.762, MAECV=0.583) and the MLR model could describe the structural characteristics of inhibitors and interaction with hERG receptors. The validation of QSAR models was evaluated through the 5-fold cross-validation and Y-scrambling test.

Short-Term Prediction of Vehicle Speed on Main City Roads using the k-Nearest Neighbor Algorithm (k-Nearest Neighbor 알고리즘을 이용한 도심 내 주요 도로 구간의 교통속도 단기 예측 방법)

  • Rasyidi, Mohammad Arif;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.121-131
    • /
    • 2014
  • Traffic speed is an important measure in transportation. It can be employed for various purposes, including traffic congestion detection, travel time estimation, and road design. Consequently, accurate speed prediction is essential in the development of intelligent transportation systems. In this paper, we present an analysis and speed prediction of a certain road section in Busan, South Korea. In previous works, only historical data of the target link are used for prediction. Here, we extract features from real traffic data by considering the neighboring links. After obtaining the candidate features, linear regression, model tree, and k-nearest neighbor (k-NN) are employed for both feature selection and speed prediction. The experiment results show that k-NN outperforms model tree and linear regression for the given dataset. Compared to the other predictors, k-NN significantly reduces the error measures that we use, including mean absolute percentage error (MAPE) and root mean square error (RMSE).

Feature Selection to Predict Very Short-term Heavy Rainfall Based on Differential Evolution (미분진화 기반의 초단기 호우예측을 위한 특징 선택)

  • Seo, Jae-Hyun;Lee, Yong Hee;Kim, Yong-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.6
    • /
    • pp.706-714
    • /
    • 2012
  • The Korea Meteorological Administration provided the recent four-years records of weather dataset for our very short-term heavy rainfall prediction. We divided the dataset into three parts: train, validation and test set. Through feature selection, we select only important features among 72 features to avoid significant increase of solution space that arises when growing exponentially with the dimensionality. We used a differential evolution algorithm and two classifiers as the fitness function of evolutionary computation to select more accurate feature subset. One of the classifiers is Support Vector Machine (SVM) that shows high performance, and the other is k-Nearest Neighbor (k-NN) that is fast in general. The test results of SVM were more prominent than those of k-NN in our experiments. Also we processed the weather data using undersampling and normalization techniques. The test results of our differential evolution algorithm performed about five times better than those using all features and about 1.36 times better than those using a genetic algorithm, which is the best known. Running times when using a genetic algorithm were about twenty times longer than those when using a differential evolution algorithm.

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

  • Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.123-132
    • /
    • 2013
  • As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.

An Optimized Combination of π-fuzzy Logic and Support Vector Machine for Stock Market Prediction (주식 시장 예측을 위한 π-퍼지 논리와 SVM의 최적 결합)

  • Dao, Tuanhung;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.43-58
    • /
    • 2014
  • As the use of trading systems has increased rapidly, many researchers have become interested in developing effective stock market prediction models using artificial intelligence techniques. Stock market prediction involves multifaceted interactions between market-controlling factors and unknown random processes. A successful stock prediction model achieves the most accurate result from minimum input data with the least complex model. In this research, we develop a combination model of ${\pi}$-fuzzy logic and support vector machine (SVM) models, using a genetic algorithm to optimize the parameters of the SVM and ${\pi}$-fuzzy functions, as well as feature subset selection to improve the performance of stock market prediction. To evaluate the performance of our proposed model, we compare the performance of our model to other comparative models, including the logistic regression, multiple discriminant analysis, classification and regression tree, artificial neural network, SVM, and fuzzy SVM models, with the same data. The results show that our model outperforms all other comparative models in prediction accuracy as well as return on investment.

A Novel Compressed Sensing Technique for Traffic Matrix Estimation of Software Defined Cloud Networks

  • Qazi, Sameer;Atif, Syed Muhammad;Kadri, Muhammad Bilal
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.10
    • /
    • pp.4678-4702
    • /
    • 2018
  • Traffic Matrix estimation has always caught attention from researchers for better network management and future planning. With the advent of high traffic loads due to Cloud Computing platforms and Software Defined Networking based tunable routing and traffic management algorithms on the Internet, it is more necessary as ever to be able to predict current and future traffic volumes on the network. For large networks such origin-destination traffic prediction problem takes the form of a large under- constrained and under-determined system of equations with a dynamic measurement matrix. Previously, the researchers had relied on the assumption that the measurement (routing) matrix is stationary due to which the schemes are not suitable for modern software defined networks. In this work, we present our Compressed Sensing with Dynamic Model Estimation (CS-DME) architecture suitable for modern software defined networks. Our main contributions are: (1) we formulate an approach in which measurement matrix in the compressed sensing scheme can be accurately and dynamically estimated through a reformulation of the problem based on traffic demands. (2) We show that the problem formulation using a dynamic measurement matrix based on instantaneous traffic demands may be used instead of a stationary binary routing matrix which is more suitable to modern Software Defined Networks that are constantly evolving in terms of routing by inspection of its Eigen Spectrum using two real world datasets. (3) We also show that linking this compressed measurement matrix dynamically with the measured parameters can lead to acceptable estimation of Origin Destination (OD) Traffic flows with marginally poor results with other state-of-art schemes relying on fixed measurement matrices. (4) Furthermore, using this compressed reformulated problem, a new strategy for selection of vantage points for most efficient traffic matrix estimation is also presented through a secondary compression technique based on subset of link measurements. Experimental evaluation of proposed technique using real world datasets Abilene and GEANT shows that the technique is practical to be used in modern software defined networks. Further, the performance of the scheme is compared with recent state of the art techniques proposed in research literature.

Loss of p15INK4b Expression in Colorectal Cancer is Linked to Ethnic Origin

  • Abdel-Rahman, Wael Mohamed;Nieminen, Taina Tuulikki;Shoman, Soheir;Eissa, Saad;Peltomaki, Paivi
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.15 no.5
    • /
    • pp.2083-2087
    • /
    • 2014
  • Colorectal cancers remain to be a common cause of cancer-related death. Early-onset cases as well as those of various ethnic origins have aggressive clinical features, the basis of which requires further exploration. The aim of this work was to examine the expression patterns of $p15^{INK4b}$ and SMAD4 in colorectal carcinoma of different ethnic origins. Fifty-five sporadic colorectal carcinoma of Egyptian origin, 25 of which were early onset, and 54 cancers of Finnish origin were immunohistochemically stained with antibodies against $p15^{INK4b}$ and SMAD4 proteins. Data were compared to the methylation status of the $p15^{INK4b}$ gene promotor. $p15^{INK4b}$ was totally lost or deficient (lost in ${\geq}50%$ of tumor cell) in 47/55 (85%) tumors of Egyptian origin as compared to 6/50 (12%) tumors of Finnish origin (p=7e-15). In the Egyptian cases with $p15^{INK4b}$ loss and available $p15^{INK4b}$ promotor methylation status, 89% of cases which lost $p15^{INK4b}$ expression were associated with $p15^{INK4b}$ gene promotor hypermethylation. SMAD4 was lost or deficient in 25/54 (46%) tumors of Egyptian origin and 28/48 (58%) tumors of Finnish origin. 22/54 (41%) Egyptian tumors showed combined loss/deficiency of both $p15^{INK4b}$ and SMAD4, while $p15^{INK4b}$ was selectively lost/deficient with positive SMAD4 expression in 24/54 (44%) tumors. Loss of $p15^{INK4b}$ was associated with older age at presentation (>50 years) in the Egyptian tumors (p=0.04). These data show for the first time that $p15^{INK4b}$ loss of expression marks a subset of colorectal cancers and ethnic origin may play a role in this selection. In a substantial number of cases, the loss was independent of SMAD4 but rather associated with $p15^{INK4b}$ gene promotor hypermethylation and old age which could be related to different environmental exposures.