• Title/Summary/Keyword: Term Clustering

Search Result 177, Processing Time 0.023 seconds

Creation and clustering of proximity data for text data analysis (텍스트 데이터 분석을 위한 근접성 데이터의 생성과 군집화)

  • Jung, Min-Ji;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.451-462
    • /
    • 2019
  • Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various documents provided by the objects to be analyzed. When analyzing objects using this matrix, researchers generally select only terms that are common in documents belonging to one object as keywords. Keywords are used to analyze the object. However, this method misses the unique information of the individual document as well as causes a problem of removing potential keywords that occur frequently in a specific document. In this study, we define data that can overcome this problem as proximity data. We introduce twelve methods that generate proximity data and cluster the objects through two clustering methods of multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized for clustering the object.

A Study on the Analysis of Representative Bus Crash Types through Establishment of Bus In-depth Accident Data (버스 실사고 데이터 구축을 통한 대표 버스충돌유형 분석 연구)

  • Kim, Hyung Jun;Jang, Jeong Ah;Lee, Insik;Yi, Yongju;Oh, Sei Chang
    • Journal of Auto-vehicle Safety Association
    • /
    • v.12 no.4
    • /
    • pp.39-47
    • /
    • 2020
  • In this study, crash situations of representative bus crash types were elicited by analyzing a total of 1,416 bus repair record which were collected in 2018~2019. K-means clustering was used as a methodology for this study. Bus repair record contain the information of repair term, type of bus operation, responsibility of accident, weather condition, road surface condition, type of accident, other party, type of road and type of location for each data. Also, by checking collision parts of each bus repair record, each record was classified by types of collision regions. From this, 760 record are classified to frontal type, 363 record are classified to middle-frontal type, 374 record are classified to middle-rear type and 331 record are classified to rear type. As mentioned, k-means clustering was performed on each type of collision parts. As a result, this study analyzed the severity of bus crash based on actual bus accident data which are based on bus repair record not the crash data from the TAAS. Also, this study presented crash situation of representative bus crash types. It is expected that this study can be expanded to analyzing hydrogen bus crash and defining indicators of hydrogen bus safety.

Clustering based Novel Interference Management Scheme in Dense Small Cell Network (밀집한 소형셀 네트워크에서 클러스터링 기반 새로운 간섭 관리 기법)

  • Moon, Sangmi;Chu, Myeonghun;Lee, Jihye;Kwon, Soonho;Kim, Hanjong;Kim, Daejin;Hwang, Intae
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.5
    • /
    • pp.13-18
    • /
    • 2016
  • In Long Term Evolution-Advanced (LTE-A), small cell enhancement(SCE) has been developed as a cost-effective way of supporting exponentially increasing demand of wireless data services and satisfying the user quality of service(QoS). However, there are many problems such as the transmission rate and transmission quality degradation due to the dense and irregular distribution of a large number of small cells. In this paper, we propose a clustering based interference management scheme in dense small cell network. We divide the small cells into different clusters according to the reference signal received power(RSRP) from user equipment(UE). Within a cluster, an almost blank subframe(ABS) is implemented to mitigate interference between the small cells. In addition, we apply the power control to reduce the interference between the clusters. Simulation results show that proposed scheme can improve Signal to Interference plus Noise Ratio(SINR), throughput, and spectral efficiency of small cell users. Eventually, proposed scheme can improve overall cell performance.

The Development of Dynamic Forecasting Model for Short Term Power Demand using Radial Basis Function Network (Radial Basis 함수를 이용한 동적 - 단기 전력수요예측 모형의 개발)

  • Min, Joon-Young;Cho, Hyung-Ki
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.7
    • /
    • pp.1749-1758
    • /
    • 1997
  • This paper suggests the development of dynamic forecasting model for short-term power demand based on Radial Basis Function Network and Pal's GLVQ algorithm. Radial Basis Function methods are often compared with the backpropagation training, feed-forward network, which is the most widely used neural network paradigm. The Radial Basis Function Network is a single hidden layer feed-forward neural network. Each node of the hidden layer has a parameter vector called center. This center is determined by clustering algorithm. Theatments of classical approached to clustering methods include theories by Hartigan(K-means algorithm), Kohonen(Self Organized Feature Maps %3A SOFM and Learning Vector Quantization %3A LVQ model), Carpenter and Grossberg(ART-2 model). In this model, the first approach organizes the load pattern into two clusters by Pal's GLVQ clustering algorithm. The reason of using GLVQ algorithm in this model is that GLVQ algorithm can classify the patterns better than other algorithms. And the second approach forecasts hourly load patterns by radial basis function network which has been constructed two hidden nodes. These nodes are determined from the cluster centers of the GLVQ in first step. This model was applied to forecast the hourly loads on Mar. $4^{th},\;Jun.\;4^{th},\;Jul.\;4^{th},\;Sep.\;4^{th},\;Nov.\;4^{th},$ 1995, after having trained the data for the days from Mar. $1^{th}\;to\;3^{th},\;from\;Jun.\;1^{th}\;to\;3^{th},\;from\;Jul.\;1^{th}\;to\;3^{th},\;from\;Sep.\;1^{th}\;to\;3^{th},\;and\;from\;Nov.\;1^{th}\;to\;3^{th},$ 1995, respectively. In the experiments, the average absolute errors of one-hour ahead forecasts on utility actual data are shown to be 1.3795%.

  • PDF

A Term Weight Mensuration based on Popularity for Search Query Expansion (검색 질의 확장을 위한 인기도 기반 단어 가중치 측정)

  • Lee, Jung-Hun;Cheon, Suh-Hyun
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.8
    • /
    • pp.620-628
    • /
    • 2010
  • With the use of the Internet pervasive in everyday life, people are now able to retrieve a lot of information through the web. However, exponential growth in the quantity of information on the web has brought limits to online search engines in their search performance by showing piles and piles of unwanted information. With so much unwanted information, web users nowadays need more time and efforts than in the past to search for needed information. This paper suggests a method of using query expansion in order to quickly bring wanted information to web users. Popularity based Term Weight Mensuration better performance than the TF-IDF and Simple Popularity Term Weight Mensuration to experiments without changes of search subject. When a subject changed during search, Popularity based Term Weight Mensuration's performance change is smaller than others.

Short-term Forecasting of Power Demand based on AREA (AREA 활용 전력수요 단기 예측)

  • Kwon, S.H.;Oh, H.S.
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.1
    • /
    • pp.25-30
    • /
    • 2016
  • It is critical to forecast the maximum daily and monthly demand for power with as little error as possible for our industry and national economy. In general, long-term forecasting of power demand has been studied from both the consumer's perspective and an econometrics model in the form of a generalized linear model with predictors. Time series techniques are used for short-term forecasting with no predictors as predictors must be predicted prior to forecasting response variables and containing estimation errors during this process is inevitable. In previous researches, seasonal exponential smoothing method, SARMA (Seasonal Auto Regressive Moving Average) with consideration to weekly pattern Neuron-Fuzzy model, SVR (Support Vector Regression) model with predictors explored through machine learning, and K-means clustering technique in the various approaches have been applied to short-term power supply forecasting. In this paper, SARMA and intervention model are fitted to forecast the maximum power load daily, weekly, and monthly by using the empirical data from 2011 through 2013. $ARMA(2,\;1,\;2)(1,\;1,\;1)_7$ and $ARMA(0,\;1,\;1)(1,\;1,\;0)_{12}$ are fitted respectively to the daily and monthly power demand, but the weekly power demand is not fitted by AREA because of unit root series. In our fitted intervention model, the factors of long holidays, summer and winter are significant in the form of indicator function. The SARMA with MAPE (Mean Absolute Percentage Error) of 2.45% and intervention model with MAPE of 2.44% are more efficient than the present seasonal exponential smoothing with MAPE of about 4%. Although the dynamic repression model with the predictors of humidity, temperature, and seasonal dummies was applied to foretaste the daily power demand, it lead to a high MAPE of 3.5% even though it has estimation error of predictors.

Alleviating Semantic Term Mismatches in Korean Information Retrieval (한국어 정보 검색에서 의미적 용어 불일치 완화 방안)

  • Yun, Bo-Hyun;Park, Sung-Jin;Kang, Hyun-Kyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.12
    • /
    • pp.3874-3884
    • /
    • 2000
  • An information retrieval system has to retrieve all and only documents which are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and qucry terms have been a serious obstacle to the enhancement of retrieval performance. In this paper, we discuss automatic term normalization between words in text corpora and their application to a Korean information retrieval system. We perform two types of term normalizations to alleviate semantic term mismatches: equivalence class and co-occurrence cluster. First, transliterations, spelling errors, and synonyms are normalized into equivalence classes bv using contextual similarity. Second, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using K-means algorithm and co-occurrence clusters are identified. In this paper, these normalized term products are used in the query expansion to alleviate semantic tem1 mismatches. In other words, we utilize two kinds of tcrm normalizations, equivalence class and co-occurrence cluster, to expand user's queries with new tcrms, in an attempt to make user's queries more comprehensive (adding transliterations) or more specific (adding spc'Cializationsl. For query expansion, we employ two complementary methods: term suggestion and term relevance feedback. The experimental results show that our proposed system can alleviatl' semantic term mismatches and can also provide the appropriate similarity measurements. As a result, we know that our system can improve the rctrieval efficiency of the information retrieval system.

  • PDF

Variation Analysis of Long-term in vitro Cultured Cymbidium goeringii Lindley and Cymbidium kanran Makino (장기간 기내 배양한 춘란(Cymbidium goeringii Lindley) 및 한란(Cymbidium kanran Makino)의 변이 비교)

  • Ryu, Jai-Hyunk;Lee, Hyo-Yeon;Bae, Chang-Hyu
    • Korean Journal of Plant Resources
    • /
    • v.24 no.2
    • /
    • pp.139-149
    • /
    • 2011
  • RAPD (random amplified polymorphic DNA) analysis was examined to detect variation of in vitro cultured 30 rhizomes of Cymbidium goeringii Lindley and Cymbidium kanran Makino, with long-term (8 years) subculture, respectively. Out of 151 DNA bands detected, the 40 were polymorphic with a polymorphic rate 26.4% in the C. goeringii. Out of 155 DNA bands detected, the 56 were polymorphic with a polymorphic rate 36.1% in the C. kanran. Genetic similarity matrix (GSM) shows from 0.825 to 1.00 with an average of 0.944 in the rhizomes of C. goeringii and 0.812 to 1.00 with an average of 0.913 in the C. kanran. According to the clustering analysis, C. goeringii was divided into 1 group and 2 independent individuals and its structure of clustering was simple than that of C. kanran. The higher polymorphism and the decreased GSM were showed in the long-term in vitro cultured C. goeringii and C. kanran supplemented with growth regulators. The results provide as fundamental data to develop a new materials for plant breeding and resources plant.

Drought Classification Method for Jeju Island using Standard Precipitation Index (표준강수지수를 활용한 제주도 가뭄의 공간적 분류 방법 연구)

  • Park, Jae-Kyu;Lee, Jun-ho;Yang, Sung-Kee;Kim, Min-Chul;Yang, Se-Chang
    • Journal of Environmental Science International
    • /
    • v.25 no.11
    • /
    • pp.1511-1519
    • /
    • 2016
  • Jeju Island relies on subterranean water for over 98% of its water resources, and it is therefore necessary to continue to perform studies on drought due to climate changes. In this study, the representative standardized precipitation index (SPI) is classified by various criteria, and the spatial characteristics and applicability of drought in Jeju Island are evaluated from the results. As the result of calculating SPI of 4 weather stations (SPI 3, 6, 9, 12), SPI 12 was found to be relatively simple compared to SPI 6. Also, it was verified that the fluctuation of SPI was greater fot short-term data, and that long-term data was relatively more useful for judging extreme drought. Cluster analysis was performed using the K-means technique, with two variables extracted as the result of factor analysis, and the clustering was terminated with seven-time repeated calculations, and eventually two clusters were formed.

Design of WWW IR System Based on Keyword Clustering Architecture (색인어 말뭉치 처리를 기반으로 한 웹 정보검색 시스템의 설계)

  • 송점동;이정현;최준혁
    • The Journal of Information Technology
    • /
    • v.1 no.1
    • /
    • pp.13-26
    • /
    • 1998
  • In general Information retrieval systems, improper keywords are often extracted and different search results are offered comparing to user's aim bacause the systems use only term frequency informations for selecting keywords and don't consider their meanings. It represents that improving precision is limited without considering semantics of keywords because recall ratio and precision have inverse proportion relation. In this paper, a system which is able to improve precision without decreasing recall ratio is designed and implemented, as client user module is introduced which can send feedbacks to server with user's intention. For this purpose, keywords are selected using relative term frequency and inverse document frequency and co-occurrence words are extracted from original documents. Then, the keywords are clustered by their semantics using calculated mutual informations. In this paper, the system can reject inappropriate documents using segmented semantic informations according to feedbacks from client user module. Consequently precision of the system is improved without decreasing recall ratio.

  • PDF