• Title/Summary/Keyword: Data Clustering

Search Result 2,747, Processing Time 0.031 seconds

3-tag-based Web Image Retrieval Technique (3-태그 기반의 웹 이미지 검색 기법)

  • Lee, Si-Hwa;Hwang, Dae-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.9
    • /
    • pp.1165-1173
    • /
    • 2012
  • One of the most popular technologies in Web2.0 is tagging, and it widely applies to Web content as well as multimedia data such as image and video. Web users have expected that tags by themselves would be reused in information search and maximize the search efficiency, but wrong tag by irresponsible Web users really has brought forth a incorrect search results. In past papers, we have gathered various information resources and tags scattered in Web, mapped one tag onto other tags, and clustered these tags according to the corelation between them. A 3-tag based search algorithm which use the clustered tags of past papers, is proposed in this paper. For performance evaluation of the proposed algorithm, our algorithm is compared with image search result of Flickr, typical tag based site, and is evaluated in accuracy and recall factor.

A Transaction Data Study of the Day-of-the-Week Clustering Patterns Induced by the Discreteness of Observed Stock Prices - Further Evidence : The Case of the Stock Market in Korea (이산성으로 인한 요일별 관찰주가의 군집현상에 관한 거래자료 연구 - 한국 주식시장에서의 일별주가변동을 중심으로 -)

  • Choi, Don-Il
    • Korean Business Review
    • /
    • v.7
    • /
    • pp.165-196
    • /
    • 1994
  • Harris(1986)[22]는 주식가격에 있어서의 요일효과(曜日效果)(day-of-the-week effect)의 증거는 광범위한 시장지수에서의 일별(日別) 종가(終價) 대 종가(終價)수익률(收益率)에 대한 연구들에서 나타난다고 한다. 이러한 연구들은 결론적으로 체계적 수익률 행태를, 특히 음(陰)의 월요일 수익률을 증명한다. Harris(1990)[24]는 군집현상(群集現象)은 가격이산성(價格離散性)이 추정량(推定量)에 미치는 영향을 분석할 때 고려되어야 한다고 주장한다. 특히, 군집현상(群集現象)이 거래자가 규정된 최소가격변동에 기초한 집합보다 더 큰 이산적(離散的)가격집합(價格集合)을 사용하기 때문에 결과한다면, Gottlieb 와 Kalay(1985)[21] 및 Harris(1990)[24]에서 확인된 분산(分散)과 시계열공분산(時系列共分散) 추정량(推定量) 편의(偏倚)는 훨씬 더 심각할 것이라고 한다. 또한 모든 연구들은 이산성(離散性)이 거래가격의 유의한 특성이기 때문에 군집현상(群集現象)을 고려하여야 한다고 한다. 주식시장의 경우 요일효과가 존재한다면, 관찰주가의 이산성(離散性)으로 인한 요일별 주가의 끝자리가격의 분포가 월요일과 다른 요일에 있어 차이가 있는지와 요일별 가격결정의 정도가 (1) 주가의 수준, (2) 주가수익률의 기복 및 (3) 시장에서의 주식거래량에 있어 차이가 있는지 둥에 대하여 의문을 갖게 한다. 따라서 본 연구는 이산성으로 인한 요일별 관찰주가의 군집현상에 관한 거래자료를 연구하기 위하여 한국 주식시장에서의 입수가능한 최근년도인 1990년 1월 4일에서 1994년 6월 30일까지의 4년 6개월 동안의 일별주가변동(日別株價變動) 거래자료(去來資料)를 조사하고 실증분석을 수행하였다. 본 연구의 결과에 의하면 주식가격에 있어서의 요일효과는 관찰가격의 이산성 특히, 호가(呼價)의 가격단위(價格單位)에 기인하는 것 같지는 않다. 그러나 본 연구의 결과에 의하면 최돈일(1993)[7]의 연구 결과에서와 같이 Gottlieb 와 Kalay(1985) [21] 및 Ball(1988)[9]의 주장을 받아들이기 어렵다. 최돈일(1993)[7]의 연구를 확장한 본 연구의 결과는 최돈일(1993)의 연구 결과와도 상이하다.

  • PDF

Molecular Characterization of 170 New gDNA-SSR Markers for Genetic Diversity in Button Mushroom (Agaricus bisporus)

  • An, Hyejin;Jo, Ick-Hyun;Oh, Youn-Lee;Jang, Kab-Yeul;Kong, Won-Sik;Sung, Jwa-Kyung;So, Yoon-Sup;Chung, Jong-Wook
    • Mycobiology
    • /
    • v.47 no.4
    • /
    • pp.527-532
    • /
    • 2019
  • We designed 170 new simple sequence repeat (SSR) markers based on the whole-genome sequence data of button mushroom (Agaricus bisporus), and selected 121 polymorphic markers. A total of 121 polymorphic markers, the average major allele frequency (MAF) and the average number of alleles (NA) were 0.50 and 5.47, respectively. The average number of genotypes (NG), observed heterozygosity (HO), expected heterozygosity (HE), and polymorphic information content (PIC) were 6.177, 0.227, 0.619, and 0.569, respectively. Pearson's correlation coefficient showed that MAF was negatively correlated with NG (-0.683), NA (-0.600), HO (-0.584), and PIC (-0.941). NG, NA, HO, and PIC were positively correlated with other polymorphic parameters except for MAF. UPGMA clustering showed that 26 A. bisporus accessions were classified into 3 groups, and each accession was differentiated. The 121 SSR markers should facilitate the use of molecular markers in button mushroom breeding and genetic studies.

Mining Approximate Sequential Patterns in a Large Sequence Database (대용량 순차 데이터베이스에서 근사 순차패턴 탐색)

  • Kum Hye-Chung;Chang Joong-Hyuk
    • The KIPS Transactions:PartD
    • /
    • v.13D no.2 s.105
    • /
    • pp.199-206
    • /
    • 2006
  • Sequential pattern mining is an important data mining task with broad applications. However, conventional methods may meet inherent difficulties in mining databases with long sequences and noise. They may generate a huge number of short and trivial patterns but fail to find interesting patterns shared by many sequences. In this paper, to overcome these problems, we propose the theme of approximate sequential pattern mining roughly defined as identifying patterns approximately shared by many sequences. The proposed method works in two steps: one is to cluster target sequences by their similarities and the other is to find consensus patterns that ire similar to the sequences in each cluster directly through multiple alignment. For this purpose, a novel structure called weighted sequence is presented to compress the alignment result, and the longest consensus pattern that represents each cluster is generated from its weighted sequence. Finally, the effectiveness of the proposed method is verified by a set of experiments.

Graph Construction Based on Fast Low-Rank Representation in Graph-Based Semi-Supervised Learning (그래프 기반 준지도 학습에서 빠른 낮은 계수 표현 기반 그래프 구축)

  • Oh, Byonghwa;Yang, Jihoon
    • Journal of KIISE
    • /
    • v.45 no.1
    • /
    • pp.15-21
    • /
    • 2018
  • Low-Rank Representation (LRR) based methods are widely used in many practical applications, such as face clustering and object detection, because they can guarantee high prediction accuracy when used to constructing graphs in graph - based semi-supervised learning. However, in order to solve the LRR problem, it is necessary to perform singular value decomposition on the square matrix of the number of data points for each iteration of the algorithm; hence the calculation is inefficient. To solve this problem, we propose an improved and faster LRR method based on the recently published Fast LRR (FaLRR) and suggests ways to introduce and optimize additional constraints on the underlying optimization goals in order to address the fact that the FaLRR is fast but actually poor in classification problems. Our experiments confirm that the proposed method finds a better solution than LRR does. We also propose Fast MLRR (FaMLRR), which shows better results when the goal of minimizing is added.

Application of Machine Learning Techniques for Resolving Korean Author Names (한글 저자명 중의성 해소를 위한 기계학습기법의 적용)

  • Kang, In-Su
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.3
    • /
    • pp.27-39
    • /
    • 2008
  • In bibliographic data, the use of personal names to indicate authors makes it difficult to specify a particular author since there are numerous authors whose personal names are the same. Resolving same-name author instances into different individuals is called author resolution, which consists of two steps: calculating author similarities and then clustering same-name author instances into different person groups. Author similarities are computed from similarities of author-related bibliographic features such as coauthors, titles of papers, publication information, using supervised or unsupervised methods. Supervised approaches employ machine learning techniques to automatically learn the author similarity function from author-resolved training samples. So far however, a few machine learning methods have been investigated for author resolution. This paper provides a comparative evaluation of a variety of recent high-performing machine learning techniques on author disambiguation, and compares several methods of processing author disambiguation features such as coauthors and titles of papers.

Comparison of several criteria for ordering independent components (독립성분의 순서화 방법 비교)

  • Choi, Eunbin;Cho, Sulim;Park, Mira
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.6
    • /
    • pp.889-899
    • /
    • 2017
  • Independent component analysis is a multivariate approach to separate mixed signals into original signals. It is the most widely used method of blind source separation technique. ICA uses linear transformations such as principal component analysis and factor analysis, but differs in that ICA requires statistical independence and non-Gaussian assumptions of original signals. PCA have a natural ordering based on cumulative proportion of explained variance; howerver, ICA algorithms cannot identify the unique optimal ordering of the components. It is meaningful to set order because major components can be used for further analysis such as clustering and low-dimensional graphs. In this paper, we compare the performance of several criteria to determine the order of the components. Kurtosis, absolute value of kurtosis, negentropy, Kolmogorov-Smirnov statistic and sum of squared coefficients are considered. The criteria are evaluated by their ability to classify known groups. Two types of data are analyzed for illustration.

Principal Component Analysis and Molecular Characterization of Reniform Nematode Populations in Alabama

  • Nyaku, Seloame T.;Kantety, Ramesh V.;Cebert, Ernst;Lawrence, Kathy S.;Honger, Joseph O.;Sharma, Govind C.
    • The Plant Pathology Journal
    • /
    • v.32 no.2
    • /
    • pp.123-135
    • /
    • 2016
  • U.S. cotton production is suffering from the yield loss caused by the reniform nematode (RN), Rotylenchulus reniformis. Management of this devastating pest is of utmost importance because, no upland cotton cultivar exhibits adequate resistance to RN. Nine populations of RN from distinct regions in Alabama and one population from Mississippi were studied and thirteen morphometric features were measured on 20 male and 20 female nematodes from each population. Highly correlated variables (positive) in female and male RN morphometric parameters were observed for body length (L) and distance of vulva from the lip region (V) (r = 0.7) and tail length (TL) and c' (r = 0.8), respectively. The first and second principal components for the female and male populations showed distinct clustering into three groups. These results show pattern of sub-groups within the RN populations in Alabama. A one-way ANOVA on female and male RN populations showed significant differences ($p{\leq}0.05$) among the variables. Multiple sequence alignment (MSA) of 18S rRNA sequences (421) showed lengths of 653 bp. Sites within the aligned sequences were conserved (53%), parsimony-informative (17%), singletons (28%), and indels (2%), respectively. Neighbor-Joining analysis showed intra and inter-nematodal variations within the populations as clone sequences from different nematodes irrespective of the sex of nematode isolate clustered together. Morphologically, the three groups (I, II and III) could not be distinctly associated with the molecular data from the 18S rRNA sequences. The three groups may be identified as being non-geographically contiguous.

Habitat and Phytosociological Characters of Ceratopteris thalictroides, Endangered Plant Species on Paddy Field, in Nakdong River (논 잡초 멸종위기식물인 물고사리의 낙동강유역 자생지 최초보고 및 군락분류)

  • Choi, Byoung-Ki;Lee, Chang-Woo;Huh, Man-Kyu
    • Weed & Turfgrass Science
    • /
    • v.3 no.1
    • /
    • pp.50-55
    • /
    • 2014
  • This study is aimed at classifying the syntaxa of Ceratopteris thalictroides dominant community on the Nakdong River, and to collect basic data for research of habitat. The communities were carried out by using the Z.-M. School's method and numerical classification technique. The result of syntaxa was classified three communities such as Persicaria japonica-Ceratopteris thalictroides community, Lindernia procumbens-Ceratropteris thalictroides community, and Limnophila indica-Ceratopteris thalictroides community. The ordination analysis displayed the vegetation types with respect to complex environmental gradients. After ordination and clustering analysis, the effective humidity, soil stability, trampling effects, anthropogenic effects and flooding frequency were identified as the important factors deciding the vegetation pattern. It was pointed out to establish a long-term ecological site for protecting such vulnerable vegetation against overexploitation and global climate change.

The Development of Evaluation Criteria Model for Discriminating Specialized General Hospital (종합전문요양기관 인정기준 모형 개발)

  • Chun Ki Hong;Kang Hye-Young;Kang Dae Ryong;Nam Chung Mo;Lee Gye-Cheol
    • Health Policy and Management
    • /
    • v.15 no.4
    • /
    • pp.46-64
    • /
    • 2005
  • This study was conducted to verify the current criteria and classification system used to determine specialized general hospitals status. In this study, we proposed a new classification system which Is simpler and more convenient than the current one. In the new classification system clinical procedure was chosen as the unit of analysis in order to reflect all the resource consumption and the complexities and degree of medical technologies in determining specialized general hospitals. We developed a statistical model and applied this model to 117 general hospitals which claim their national insurance through electronic data interchange(EDI). Analysis based on 984 clinical procedures and medical facilities' characteristic variable discriminated specialized general hospital in present without misclassification. It means that we can determine specialized general hospital's permission In new way without using the current complicated criteria. This study discriminated specialized general hospital by the new proposed model based on clinical procedures provided by each hospital. For clustering the same types of medical facilities using 984 clinical procedures, we executed multidimensional scale analysis and divided 117 hospitals into 4 groups by two axises : a variety of procedure and the Proportion of high technology Procedure. Therefore, we divided 117 hospitals into 4 groups and one of them was considered as specialized general hospital. In discriminating analysis, we abstracted proportion of 16 clinical procedures which effect on discriminating the specialized general hospital in statistical system also we identify discriminating function which include these variables. As a result, we identify 2 discriminating functions, one is for current discriminating system and the other two is for new discriminating system of specialized general hospital.