• Title/Summary/Keyword: high dimensional data sets

Search Result 71, Processing Time 0.028 seconds

A Feature Selection Method Based on Fuzzy Cluster Analysis (퍼지 클러스터 분석 기반 특징 선택 방법)

  • Rhee, Hyun-Sook
    • The KIPS Transactions:PartB
    • /
    • v.14B no.2
    • /
    • pp.135-140
    • /
    • 2007
  • Feature selection is a preprocessing technique commonly used on high dimensional data. Feature selection studies how to select a subset or list of attributes that are used to construct models describing data. Feature selection methods attempt to explore data's intrinsic properties by employing statistics or information theory. The recent developments have involved approaches like correlation method, dimensionality reduction and mutual information technique. This feature selection have become the focus of much research in areas of applications with massive and complex data sets. In this paper, we provide a feature selection method considering data characteristics and generalization capability. It provides a computational approach for feature selection based on fuzzy cluster analysis of its attribute values and its performance measures. And we apply it to the system for classifying computer virus and compared with heuristic method using the contrast concept. Experimental result shows the proposed approach can give a feature ranking, select the features, and improve the system performance.

Comparative Study of NIR-based Prediction Methods for Biomass Weight Loss Profiles

  • Cho, Hyun-Woo;Liu, J. Jay
    • Clean Technology
    • /
    • v.18 no.1
    • /
    • pp.31-37
    • /
    • 2012
  • Biomass has become a major feedstock for bioenergy and other bio-based products because of its renewability and environmental benefits. Various researches have been done in the prediction of crucial characteristics of biomass, including the active utilization of spectroscopy data. Near infrared (NIR) spectroscopy has been widely used because of its attractive features: it's non-destructive and cost-effective producing fast and reliable analysis results. This work developed the multivariate statistical scheme for predicting weight loss profiles based on the utilization of NIR spectra data measured for six lignocellulosic biomass types. Wavelet analysis was used as a compression tool to suppress irrelevant noise and to select features or wavelengths that better explain NIR data. The developed scheme was demonstrated using real NIR data sets, in which different prediction models were evaluated in terms of prediction performance. In addition, the benefits of using right pretreatment of NIR spectra were also given. In our case, it turned out that compression of high-dimensional NIR spectra by wavelet and then PLS modeling yielded more reliable prediction results without handling full set of noisy data. This work showed that the developed scheme can be easily applied for rapid analysis of biomass.

Evidence for a Common Molecular Basis for Sequence Recognition of N3-Guanine and N3-Adenine DNA Adducts Involving the Covalent Bonding Reaction of (+)-CC-1065

  • Park, Hyun-Ju
    • Archives of Pharmacal Research
    • /
    • v.25 no.1
    • /
    • pp.11-24
    • /
    • 2002
  • The antitumor antibiotic (+)-CC-1065 can alkylate N3 of guanine in certain sequences. A previous high-field $^1H$ NMR study on the$(+)-CC-1065d[GCGCAATTG*CGC]_2$ adduct ($^*$ indicates the drug alkylation site) showed that drag modification on N3 of guanine results in protonation of the cross-strand cytosine [Park, H-J.; Hurley, L. H. J. Am. Chem. Soc.1997, 119,629]. In this contribution we describe a further analysis of the NMR data sets together with restrained molecular dynamics. This study provides not only a solution structure of the (+)-CC-1065(N3- guanine) DNA duplex adduct but also new insight into the molecular basis for the sequence- specific interaction between (+)-CC-1065 and N3-guanine in the DNA duplex. On the basis of NOESY data, we propose that the narrow minor groove at the 7T8T step and conformational kinks at the junctions of 16C17A and 18A19T are both related to DNA bending in the drugDNA adduct. Analysis of the one-dimensional $^1H$ NMR (in $H_2O$) data and rMD trajectories strongly suggests that hydrogen bonding linkages between the 8-OH group of the (+)-CC-1065 A-sub-unit and the 9G10C phosphate via a water molecule are present. All the phenomena observed here in the (+)-CC-1065(N3-guanine) adduct at 5'$-AATTG^*$are reminiscent of those obtained from the studies on the (+)-CC-1065(N3-adenine) adduct at $5'-AGTTA^*$, suggesting that (+)-CC-1065 takes advantage of the conformational flexibility of the 5'-TPu step to entrap the bent structure required for the covalent bonding reaction. This study reveals a common molecular basis for (+)-CC-1065 alkylation at both $5'-TTG^*$ and $5'-TTA^*$, which involves a trapping out of sequence-dependent DNA conformational flexibility as well as sequence-dependent general acid and general base catalysis by duplex DNA.

Automatic Clustering on Trained Self-organizing Feature Maps via Graph Cuts (그래프 컷을 이용한 학습된 자기 조직화 맵의 자동 군집화)

  • Park, An-Jin;Jung, Kee-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.9
    • /
    • pp.572-587
    • /
    • 2008
  • The Self-organizing Feature Map(SOFM) that is one of unsupervised neural networks is a very powerful tool for data clustering and visualization in high-dimensional data sets. Although the SOFM has been applied in many engineering problems, it needs to cluster similar weights into one class on the trained SOFM as a post-processing, which is manually performed in many cases. The traditional clustering algorithms, such as t-means, on the trained SOFM however do not yield satisfactory results, especially when clusters have arbitrary shapes. This paper proposes automatic clustering on trained SOFM, which can deal with arbitrary cluster shapes and be globally optimized by graph cuts. When using the graph cuts, the graph must have two additional vertices, called terminals, and weights between the terminals and vertices of the graph are generally set based on data manually obtained by users. The Proposed method automatically sets the weights based on mode-seeking on a distance matrix. Experimental results demonstrated the effectiveness of the proposed method in texture segmentation. In the experimental results, the proposed method improved precision rates compared with previous traditional clustering algorithm, as the method can deal with arbitrary cluster shapes based on the graph-theoretic clustering.

Application of Chiu's Two Dimensional Velocity Distribution Equations to Natural Rivers (Chiu가 제안한 2차원 유속분포식의 자연하천 적용성 분석)

  • Lee, Chan-Joo;Seo, Il-Won;Kim, Chang-Wan;Kim, Won
    • Journal of Korea Water Resources Association
    • /
    • v.40 no.12
    • /
    • pp.957-968
    • /
    • 2007
  • It is essential to obtain accurate and highly reliable streamflow data for quantitative management for water resources. Thereafter such real-time streamflow gauging methods as ultrasonic flowmeter and index-velocity are introduced recently. Since these methods calculate flowrate through entire cross-section by measuring partial velocities of it, rational and theoretical basis are necessary for accurate estimation of discharge. The purpose of the present study lies in analysis on the applicability of Chiu#s(1987, 1988) two dimensional velocity distribution equations by applying them to natural rivers and by comparing simulated velocity distributions with observed ones obtained with ADCP. Maximum and mean velocities are calculated from observed data to estimate entropy parameter M. Such isovel shape parameters as h and $\beta_i$ are estimated by object function based on least squares criterion. In case optimized parameters are applied, Chiu#s velocity distributions fairly well simulate observed ones. By using 14 simulated data sets which have relatively high correlation coefficients, properties of parameters are analyzed and h, $\beta_i$ are estimated for velocity-unknown river sections. When estimated parameters are adopted for verification, simulated velocity distributions well reproduce real ones. Finally, calculated discharges display rough agreement with measured data. The results of the present study mean that if parameters related are properly estimated, Chiu#s velocity distribution is likely to reproduce the real one of natural rivers.

Data Assimilation Effect of Mobile Rawinsonde Observation using Unified Model Observing System Experiment during the Summer Intensive Observation Period in 2013 (2013년 여름철 집중관측동안 통합모델 관측시스템실험을 이용한 이동형 레윈존데 관측의 자료동화 효과)

  • Lim, Yun-Kyu;Song, Sang-Keun;Han, Sang-Ok
    • Journal of the Korean earth science society
    • /
    • v.35 no.4
    • /
    • pp.215-224
    • /
    • 2014
  • Data assimilation effect of mobile rawinsonde observation was evaluated using Unified Model (UM) with a Three-Dimensional Variational (3DVAR) data assimilation system during the intensive observation program of 2013 summer season (rainy season: 20 June-7 July 2013, heavy rain period: 8 July-30 July 2013). The analysis was performed by two sets of simulation experiments: (1) ConTroL experiment (CTL) with observation data provided by Korea Meteorological Administration (KMA) and (2) Observing System Experiment (OSE) including both KMA and mobile rawinsonde observation data. In the model verification during the rainy season, there were no distinctive differences for 500 hPa geopotential height, 850 hPa air temperature, and 300 hPa wind speed between CTL and OSE simulation due to data limitation (0000 and 1200 UTC only) at stationary rawinsonde stations. In contrast, precipitation verification using the hourly accumulated precipitation data of Automatic Synoptic Observation System (ASOS) showed that Equivalent Threat Score (ETS) of the OSE was improved by about 2% compared with that of the CTL. For cases having a positive effect of the OSE simulation, ETS of the OSE showed a significantly higher improvement (up to 41%) than that of the CTL. This estimation thus suggests that the use of mobile rawinsonde observation data using UM 3DVAR could be reasonable enough to assess the improvement of prediction accuracy.

The Visual Evaluation according to various Methods of Motif Presentation and the Value contrast between the Motif and Background -Floral Pattern- (모티프의 표현방법, 모티프와 배경과의 명도대비에 따른 시각적 평가 -꽃패턴을 중심으로-)

  • 장수경
    • Journal of the Korean Home Economics Association
    • /
    • v.35 no.2
    • /
    • pp.159-172
    • /
    • 1997
  • The purpose of this study was to investigate visual evaluation according to various methods of motif presentation and the value contrast between the motif and background. The instruments developed for this purpose were two sets of stimuli and a response scale. the first set consisted of pattern stimuli. they were eight photographs of floral patterns constructed by using six different motif presentation methods and two different value contrasts. The second set had eight clothing stimuli, photographs of clothings with the above floral patterns. The 7-point sementic differential scale of 19 bipolar adjectives was used as the response scale. The data was analyzed by factor analysis, ANOVA and T-test. The major findings from this study were as follows; 1. Four factors emerged to account for the dimensional structure of the floral pattern image. These factors were attractiveness, tenderness, attention, and maturity. among them attractiveness and tenderness were the major dimensions 2. The patterns and the clothings had no significant difference from each other in terms of attractiveness and tenderness, but in terms of maturity and attention. The pattern presented a cute and sober image, but the clothing presented mature and gorgeous image. 3. methods of motif presentation had significant effects on all the factors. The pattern by shading method gave the most attractive and soft image, the one by line the most soberest, the one by area the most gorgeous, the one by collage the most unattractive, hardest, and cutest, and the one by mosaics the maturest. 4. The value contrast between the motif and background had no significant effects on attractiveness and maturity, but on tenderness and attention. The patterns with a high valued background presented a soft image, but the one with a low valued background a hard image. The patterns with a low valued area presented gorgeous image.

  • PDF

Anthropometry for clothing construction and cluster analysis ( I ) (피복구성학적 인체계측과 집낙구조분석 ( I ))

  • Kim Ku Ja
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.10 no.3
    • /
    • pp.37-48
    • /
    • 1986
  • The purpose of this study was to analyze 'the natural groupings' of subjects in order to classify highly similar somatotype for clothing construction. The sample for the study was drawn randomly out of senior high school boys in Seoul urban area. The sample size was 425 boys between age 16 and 18. Cluster analysis was more concerned with finding the hierarchical structure of subjects by three dimensional distance of stature. bust girth and sleeve length. The groups forming a partition can be subdivided into 5 and 6 sets by the hierarchical tree of the given subjects. Ward's Minimum Variance Method was applied after extraction of distance matrix by the Standardized Euclidean Distance. All of the above data was analyzed by the computer installed at Korea Advanced Institute of Science and Technology. The major findings, take for instance, of 16 age group can be summarized as follows. The results of cluster analysis of this study: 1. Cluster 1 (32 persons means $18.29\%$ of the total) is characterized with smaller bust girth than that of cluster 5, but stature and sleeve length of the cluster 1 are the largest group. 2. Cluster 2 (18 Persons means $10.29\%$ of the total) is characterized with the group of the smallest stature and sleeve length, but bust girth larger than that of cluster 3. 3. Cluster 3(35persons means $20\%$ of the total) is classified with the smallest group of all the stature, bust girth and sleeve length. 4. Cluster 4(60 persons means $34.29\%$ of the total) is grouped with the same value of sleeve length with the mean value of 16 age group, but the stature and bust girth is smaller than the mean value of this age group. 5. Cluster 5(30 persons means $17.14\%$ of the total) is characterized with smaller stature than that of cluster 1, and with larger bust girth than that of cluster 1, but with the same value of the sleeve length with the mean value of the 16 age group.

  • PDF

Genetic Clustering with Semantic Vector Expansion (의미 벡터 확장을 통한 유전자 클러스터링)

  • Song, Wei;Park, Soon-Cheol
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.3
    • /
    • pp.1-8
    • /
    • 2009
  • This paper proposes a new document clustering system using fuzzy logic-based genetic algorithm (GA) and semantic vector expansion technology. It has been known in many GA papers that the success depends on two factors, the diversity of the population and the capability to convergence. We use the fuzzy logic-based operators to adaptively adjust the influence between these two factors. In traditional document clustering, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach not only leads to a high dimensional feature space, but also ignores the semantic relationships between some important words, which would affect the accuracy of clustering. In this paper we use latent semantic analysis (LSA)to expand the documents to corresponding semantic vectors conceptually, rather than the individual terms. Meanwhile, the sizes of the vectors can be reduced drastically. We test our clustering algorithm on 20 news groups and Reuter collection data sets. The results show that our method outperforms the conventional GA in various document representation environments.

Comparison of the Wind Speed from an Atmospheric Pressure Map (Na Wind) and Satellite Scatterometer­observed Wind Speed (NSCAT) over the East (Japan) Sea

  • Park, Kyung-Ae;Kim, Kyung-Ryul;Kim, Kuh;Chung, Jong-Yul;Conillor, Peter-C.
    • Journal of the korean society of oceanography
    • /
    • v.38 no.4
    • /
    • pp.173-184
    • /
    • 2003
  • Major differences between wind speeds from atmospheric pressure maps (Na wind) and near­surface wind speeds derived from satellite scatterometer (NSCAT) observations over the East (Japan) Sea have been examined. The root­mean­square errors of Na wind and NSCAT wind speeds collocated with Japanese Meteorological Agency (JMA) buoy winds are about $3.84\;ms^{-1}\;and\;1.53\;ms^{-1}$, respectively. Time series of NSCAT wind speeds showed a high coherency of 0.92 with the real buoy measurements and contained higher spectral energy at low frequencies (>3 days) than the Na wind. The magnitudes of monthly Na winds are lower than NSCAT winds by up to 45%, particularly in September 1996. The spatial structures between the two are mostly coherent on basin­wide large scales; however, significant differences and energy loss are found on a spatial scale of less than 100 km. This was evidenced by the temporal EOFs (Empirical Orthogonal Functions) of the two wind speed data sets and by their two­dimensional spectra. Since the Na wind was based on the atmospheric pressures on the weather map, it overlooked small­scale features of less than 100 km. The center of the cold­air outbreak through Vladivostok, expressed by the Na wind in January 1997, was shifted towards the North Korean coast when compared with that of the NSCAT wind, whereas NSCAT winds revealed its temporal evolution as well as spatial distribution.