• Title/Summary/Keyword: Top-K mining

Search Result 94, Processing Time 0.024 seconds

A Study on the Characteristics of Prematurely Discharged Patients and the Model for Predicting Premature Discharge (환자이탈군 특성요인과 이탈환자 예측모형에 관한 연구)

  • Min, Kyung-Jin;Song, Kyu-Moon;Kim, Kwang-Hwan
    • Quality Improvement in Health Care
    • /
    • v.9 no.1
    • /
    • pp.18-32
    • /
    • 2002
  • Background : We developed a model for predicting premature discharge and identifying related factors. Methods : Prediction model was developed by data mining techniques. Basic data were collected from the total discharge data base of a university hospital in Chungnam Province during the period from July 1, 1999 to June 30, 2000. Results : 1. Among 22,873 patients, the number of patients discharged with usual discharge orders were 21,695 or 94.8%. The number of the prematurely discharged patients were 1,178 or 5.2%. 2. The primary reason for unusual discharge was transfer to other hospital. Move to a local hospital closer to their home and burdensome medical expenses were main reasons. 3. Predictability of each model was tested using the top 10 percent of patients with the highest probabilities of premature discharge. The neural network model was chosen as the most appropriate model for predicting prematurely discharged patients. 4. Ten percent of the total number of patients had been selected randomly to test the effectiveness of the neural network model. We have chosen the threshold of the neural network model as 0.7. The number of patients who were expected to discharge prematurely was 312. Among them, 241 had been discharged prematurely (77.2%). Conclusion : Of the several data mining techniques used, the neural network model was the most effective, It can be used to identify and manage the patients who are expected to discharge prematurely.

  • PDF

A Personalized Recommendation Methodology based on Collaborative Filtering (협업 필터링 기법을 활용한 개인화된 상품 추천 방법론 개발에 관한 연구)

  • Kim, Jae-Kyeong;Suh, Ji-Hae;Ahn, Do-Hyun;Cho, Yoon-Ho
    • Journal of Intelligence and Information Systems
    • /
    • v.8 no.2
    • /
    • pp.139-157
    • /
    • 2002
  • The rapid growth of e-commerce has made both companies and customers face a new situation. Whereas companies have become to be harder to survive due to more and more competitions, the opportunity for customers to choose among more and more products has increased. So, the recommender systems that recommend suitable products to the customer have an important position in E-commerce. This research introduces collaborative filtering based recommender system which helps customers find the products they would like to purchase by producing a list of top-N recommended products. The suggested methodology is based on decision tree, product taxonomy, and association rule mining. Decision tree is used to select target customers, who have high possibility of purchasing recommended products. We applied the recommender system to a Korean department store. The methodology is evaluated with the analysis of a real department store case and is compared with other methodologies.

  • PDF

Latent topics-based product reputation mining (잠재 토픽 기반의 제품 평판 마이닝)

  • Park, Sang-Min;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.39-70
    • /
    • 2017
  • Data-drive analytics techniques have been recently applied to public surveys. Instead of simply gathering survey results or expert opinions to research the preference for a recently launched product, enterprises need a way to collect and analyze various types of online data and then accurately figure out customer preferences. In the main concept of existing data-based survey methods, the sentiment lexicon for a particular domain is first constructed by domain experts who usually judge the positive, neutral, or negative meanings of the frequently used words from the collected text documents. In order to research the preference for a particular product, the existing approach collects (1) review posts, which are related to the product, from several product review web sites; (2) extracts sentences (or phrases) in the collection after the pre-processing step such as stemming and removal of stop words is performed; (3) classifies the polarity (either positive or negative sense) of each sentence (or phrase) based on the sentiment lexicon; and (4) estimates the positive and negative ratios of the product by dividing the total numbers of the positive and negative sentences (or phrases) by the total number of the sentences (or phrases) in the collection. Furthermore, the existing approach automatically finds important sentences (or phrases) including the positive and negative meaning to/against the product. As a motivated example, given a product like Sonata made by Hyundai Motors, customers often want to see the summary note including what positive points are in the 'car design' aspect as well as what negative points are in thesame aspect. They also want to gain more useful information regarding other aspects such as 'car quality', 'car performance', and 'car service.' Such an information will enable customers to make good choice when they attempt to purchase brand-new vehicles. In addition, automobile makers will be able to figure out the preference and positive/negative points for new models on market. In the near future, the weak points of the models will be improved by the sentiment analysis. For this, the existing approach computes the sentiment score of each sentence (or phrase) and then selects top-k sentences (or phrases) with the highest positive and negative scores. However, the existing approach has several shortcomings and is limited to apply to real applications. The main disadvantages of the existing approach is as follows: (1) The main aspects (e.g., car design, quality, performance, and service) to a product (e.g., Hyundai Sonata) are not considered. Through the sentiment analysis without considering aspects, as a result, the summary note including the positive and negative ratios of the product and top-k sentences (or phrases) with the highest sentiment scores in the entire corpus is just reported to customers and car makers. This approach is not enough and main aspects of the target product need to be considered in the sentiment analysis. (2) In general, since the same word has different meanings across different domains, the sentiment lexicon which is proper to each domain needs to be constructed. The efficient way to construct the sentiment lexicon per domain is required because the sentiment lexicon construction is labor intensive and time consuming. To address the above problems, in this article, we propose a novel product reputation mining algorithm that (1) extracts topics hidden in review documents written by customers; (2) mines main aspects based on the extracted topics; (3) measures the positive and negative ratios of the product using the aspects; and (4) presents the digest in which a few important sentences with the positive and negative meanings are listed in each aspect. Unlike the existing approach, using hidden topics makes experts construct the sentimental lexicon easily and quickly. Furthermore, reinforcing topic semantics, we can improve the accuracy of the product reputation mining algorithms more largely than that of the existing approach. In the experiments, we collected large review documents to the domestic vehicles such as K5, SM5, and Avante; measured the positive and negative ratios of the three cars; showed top-k positive and negative summaries per aspect; and conducted statistical analysis. Our experimental results clearly show the effectiveness of the proposed method, compared with the existing method.

Analysis of Defense Communication-Electronics Technologies using Data Mining Technique (데이터 마이닝 기법을 이용한 군 통신·전자 분야 기술 분석)

  • Baek, Seong-Ho;Kang, Seok-Joong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.6
    • /
    • pp.687-699
    • /
    • 2020
  • The government-led top-down development approach for weapons system faces the problem of technological obsolescence now that technology has rapidly grown. As a result, the government has gradually expanded the corporate-led bottom-up project implementation method to the defense industry. The key success factor of the bottom-up project implementation is the ability of defense companies to plan their technologies. This paper presented a method of analyzing patent data through data mining technique so that domestic defense companies can utilize it for technology planning activities. The main content is to propose corporate selection techniques corresponding to the defense communication-electronics sectors and conduct principal component analysis and cluster analysis for the International Patent Classification. Through this, the technology was classified into four groups based on the patents of nine companies and the representative enterprises of each group were derived.

Mining Trip Patterns in the Large Trip-Transaction Database and Analysis of Travel Behavior (대용량 교통카드 트랜잭션 데이터베이스에서 통행 패턴 탐사와 통행 행태의 분석)

  • Park, Jong-Soo;Lee, Keum-Sook
    • Journal of the Economic Geographical Society of Korea
    • /
    • v.10 no.1
    • /
    • pp.44-63
    • /
    • 2007
  • The purpose of this study is to propose mining processes in the large trip-transaction database of the Metropolitan Seoul area and to analyze the spatial characteristics of travel behavior. For the purpose. this study introduces a mining algorithm developed for exploring trip patterns from the large trip-transaction database produced every day by transit users in the Metropolitan Seoul area. The algorithm computes trip chains of transit users by using the bus routes and a graph of the subway stops in the Seoul subway network. We explore the transfer frequency of the transit users in their trip chains in a day transaction database of three different years. We find the number of transit users who transfer to other bus or subway is increasing yearly. From the trip chains of the large trip-transaction database, trip patterns are mined to analyze how transit users travel in the public transportation system. The mining algorithm is a kind of level-wise approaches to find frequent trip patterns. The resulting frequent patterns are illustrated to show top-ranked subway stations and bus stops in their supports. From the outputs, we explore the travel patterns of three different time zones in a day. We obtain sufficient differences in the spatial structures in the travel patterns of origin and destination depending on time zones. In order to examine the changes in the travel patterns along time, we apply the algorithm to one day data per year since 2004. The results are visualized by utilizing GIS, and then the spatial characteristics of travel patterns are analyzed. The spatial distribution of trip origins and destinations shows the sharp distinction among time zones.

  • PDF

A Study of Perception of Golfwear Using Big Data Analysis (빅데이터를 활용한 골프웨어에 관한 인식 연구)

  • Lee, Areum;Lee, Jin Hwa
    • Fashion & Textile Research Journal
    • /
    • v.20 no.5
    • /
    • pp.533-547
    • /
    • 2018
  • The objective of this study is to examine the perception of golfwear and related trends based on major keywords and associated words related to golfwear utilizing big data. For this study, the data was collected from blogs, Jisikin and Tips, news articles, and web $caf{\acute{e}}$ from two of the most commonly used search engines (Naver & Daum) containing the keywords, 'Golfwear' and 'Golf clothes'. For data collection, frequency and matrix data were extracted through Textom, from January 1, 2016 to December 31, 2017. From the matrix created by Textom, Degree centrality, Closeness centrality, Betweenness centrality, and Eigenvector centrality were calculated and analyzed by utilizing Netminer 4.0. As a result of analysis, it was found that the keyword 'brand' showed the highest rank in web visibility followed by 'woman', 'size', 'man', 'fashion', 'sports', 'price', 'store', 'discount', 'equipment' in the top 10 frequency rankings. For centrality calculations, only the top 30 keywords were included because the density was extremely high due to high frequency of the co-occurring keywords. The results of centrality calculations showed that the keywords on top of the rankings were similar to the frequency of the raw data. When the frequency was adjusted by subtracting 100 and 500 words, it showed different results as the low-ranking keywords such as J. Lindberg in the frequency analysis ranked high along with changes in the rankings of all centrality calculations. Such findings of this study will provide basis for marketing strategies and ways to increase awareness and web visibility for Golfwear brands.

Physical Properties of Surface Sediments of the KR(Korea Reserved) 1, 2, and 5 Areas, Northeastern Equatorial Pacific (북동태평양 대한민국 광구 KR1, 2, 5 지역 표층 퇴적물의 물리적 특성 비교)

  • Lee, Hyun-Bok;Chi, Sang-Bum;Park, Cheong-Kee;Kim, Ki-Hyune;Ju, Se-Jong;Oh, Jae-Kyung
    • The Sea:JOURNAL OF THE KOREAN SOCIETY OF OCEANOGRAPHY
    • /
    • v.13 no.3
    • /
    • pp.168-177
    • /
    • 2008
  • Trafficablility of a miner and potential environmental impacts due to mining activities should be considered in the selection of a commercial manganese nodule mining site. These two factors can be evaluated comparatively with physical properties and shear strength of sea-bed sediments. For the qualitative comparison of potential minig sites in terms of these two factors, physical properties such as water contents, void ratios, porosities, and grain densities, and shear strengths of surface sediments were determined for the three potential manganese nodule mining sites(KR1, KR2, and KR5) in the Korean manganese nodule contract area, northeast Pacific. For the study, sediment samples were collected from 107 stations from 2004 to 2006. The physical properties of surface sediments showed more significant differences between northern(KR1, KR2) and southern(KR5) blocks than between northern blocks(KR1 vs. KR2). Water content, void ratio, and porosity of sediments from KR5 were relatively higher than those from KR1 and KR2. Grain density of sediments from KR5 was relatively lower than those from KR1 and KR2. Shear strengths of the top 10cm sediments were higher in KR1 and KR2, whereas those of the deeper part were highest in KR5 block. Generally, sediments of high water contents are less suspendible than those of the low water contents by benthic disturbances, thus less disturbance is expected in the sediments of high water content by mining activities. In terms of trafficability, the shear strength of sediment below 10 cm deep is more important than shallower part because miner will disturb at least top 10 cm interval of the surface sediments. Base on these results, we conclude that KR5 area will be the best site for commercial mining among three investigated sites in this study.

EXTENDED ONLINE DIVISIVE AGGLOMERATIVE CLUSTERING

  • Musa, Ibrahim Musa Ishag;Lee, Dong-Gyu;Ryu, Keun-Ho
    • Proceedings of the KSRS Conference
    • /
    • 2008.10a
    • /
    • pp.406-409
    • /
    • 2008
  • Clustering data streams has an importance over many applications like sensor networks. Existing hierarchical methods follow a semi fuzzy clustering that yields duplicate clusters. In order to solve the problems, we propose an extended online divisive agglomerative clustering on data streams. It builds a tree-like top-down hierarchy of clusters that evolves with data streams using geometric time frame for snapshots. It is an enhancement of the Online Divisive Agglomerative Clustering (ODAC) with a pruning strategy to avoid duplicate clusters. Our main features are providing update time and memory space which is independent of the number of examples on data streams. It can be utilized for clustering sensor data and network monitoring as well as web click streams.

  • PDF

Analysis of Waterpark Status and Recognition Using Big Data Analysis (빅데이터 분석을 활용한 워터파크 현황 및 인식 분석)

  • Kim, Jae-Hwan;Lee, Jae-Moon
    • Journal of Digital Convergence
    • /
    • v.15 no.10
    • /
    • pp.525-535
    • /
    • 2017
  • The purpose of this study aims to examine consumer perception and current status of water park. The Naver and Daum were used for data collection channels and the keyword 'water park' was used for data retrieval. The data analysis period was limited to the study period from January 1, 2015 to December 31, 2016 for a total of two years. First, as a result of the frequency analysis, hidden cameras, Lotte water park, arrests, suspects, gimhae were in top 5 in 2015, Lotte water park, swimming, summer, opening, admission ticket were in top 5 in 2016. Second, as a result of the connection degree central analysis, hidden camera, arrest, suspect, female, shower room were in top 5 in 2015, swimming, Lotte water park, summer and One Mount, admission ticket were in top 5 in 2016. Third, as a result of the N-GRAM network graph, the water park/hidden camera, the hidden camera/hidden camera, the suspect/arrest, the Gimhae/Lotte water park, water park/suspect were in top 5 in 2015, and One Mount/water park, Gimhae/Lotte water park, water park/admission ticket, water park/water park, water park/opening were in top 5 in 2016. Fourth, as a result of the CONCOR analysis, three groups in 2015 and two groups in 2016 were formed.

Prediction of golf scores on the PGA tour using statistical models (PGA 투어의 골프 스코어 예측 및 분석)

  • Lim, Jungeun;Lim, Youngin;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.1
    • /
    • pp.41-55
    • /
    • 2017
  • This study predicts the average scores of top 150 PGA golf players on 132 PGA Tour tournaments (2013-2015) using data mining techniques and statistical analysis. This study also aims to predict the Top 10 and Top 25 best players in 4 different playoffs. Linear and nonlinear regression methods were used to predict average scores. Stepwise regression, all best subset, LASSO, ridge regression and principal component regression were used for the linear regression method. Tree, bagging, gradient boosting, neural network, random forests and KNN were used for nonlinear regression method. We found that the average score increases as fairway firmness or green height or average maximum wind speed increases. We also found that the average score decreases as the number of one-putts or scrambling variable or longest driving distance increases. All 11 different models have low prediction error when predicting the average scores of PGA Tournaments in 2015 which is not included in the training set. However, the performances of Bagging and Random Forest models are the best among all models and these two models have the highest prediction accuracy when predicting the Top 10 and Top 25 best players in 4 different playoffs.