• Title/Summary/Keyword: Pattern mining

Search Result 624, Processing Time 0.023 seconds

Job Creation and Job Destruction in Korean Mining and Manufacturing, 1981-2000 (1981-2000년간 한국 광공업 5인 이상 사업체에서의 일자리 창출과 소멸)

  • Kim, Hye Won
    • Journal of Labour Economics
    • /
    • v.27 no.2
    • /
    • pp.29-66
    • /
    • 2004
  • In this paper, I investigate job creation and destruction in Korean mining and manufacturing between 1982 and 2000 using the raw data of Annual Mining and Manufacturing Survey. The rate of job creation and destruction of continuing plants averaged 9.75 and 10.33, respectively, which are higher than those of OECD countries, Chile, and Colombia. The created jobs showed weak persistence and the concentration of job reallocation is high, compared with other countries. Job reallocation accounts for major fraction of worker reallocation and the fraction has increased before 1997. Analysis of time series data of job flow revealed a general pattern of pro-cyclic job creation and counter-cyclic job destruction. However job reallocation in Korea is strongly acyclic whereas the rate is known to be counter-cyclical in the U.S.

  • PDF

An Active Co-Training Algorithm for Biomedical Named-Entity Recognition

  • Munkhdalai, Tsendsuren;Li, Meijing;Yun, Unil;Namsrai, Oyun-Erdene;Ryu, Keun Ho
    • Journal of Information Processing Systems
    • /
    • v.8 no.4
    • /
    • pp.575-588
    • /
    • 2012
  • Exploiting unlabeled text data with a relatively small labeled corpus has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Biomedical named-entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. This paper proposes an Active Co-Training (ACT) algorithm for biomedical named-entity recognition. ACT is a semi-supervised learning method in which two classifiers based on two different feature sets iteratively learn from informative examples that have been queried from the unlabeled data. We design a new classification problem to measure the informativeness of an example in unlabeled data. In this classification problem, the examples are classified based on a joint view of a feature set to be informative/non-informative to both classifiers. To form the training data for the classification problem, we adopt a query-by-committee method. Therefore, in the ACT, both classifiers are considered to be one committee, which is used on the labeled data to give the informativeness label to each example. The ACT method outperforms the traditional co-training algorithm in terms of f-measure as well as the number of training iterations performed to build a good classification model. The proposed method tends to efficiently exploit a large amount of unlabeled data by selecting a small number of examples having not only useful information but also a comprehensive pattern.

Discovery of Frequent Sequence Pattern in Moving Object Databases (이동 객체 데이터베이스에서 빈발 시퀀스 패턴 탐색)

  • Vu, Thi Hong Nhan;Lee, Bum-Ju;Ryu, Keun-Ho
    • The KIPS Transactions:PartD
    • /
    • v.15D no.2
    • /
    • pp.179-186
    • /
    • 2008
  • The converge of location-aware devices, GIS functionalities and the increasing accuracy and availability of positioning technologies pave the way to a range of new types of location-based services. The field of spatiotemporal data mining where relationships are defined by spatial and temporal aspect of data is encountering big challenges since the increased search space of knowledge. Therefore, we aim to propose algorithms for mining spatiotemporal patterns in mobile environment in this paper. Moving patterns are generated utilizing two algorithms called All_MOP and Max_MOP. The first one mines all frequent patterns and the other discovers only maximal frequent patterns. Our proposed approach is able to reduce consuming time through comparison with DFS_MINE algorithm. In addition, our approach is applicable to location-based services such as tourist service, traffic service, and so on.

Customer Satisfaction Analysis for Global Cosmetic Brands: Text-mining Based Online Review Analysis (글로벌 화장품 브랜드의 소비자 만족도 분석: 텍스트마이닝 기반의 사용자 후기 분석을 중심으로)

  • Park, Jaehun;Kim, Ye-Rim;Kang, Su-Bin
    • Journal of Korean Society for Quality Management
    • /
    • v.49 no.4
    • /
    • pp.595-607
    • /
    • 2021
  • Purpose: This study introduces a systematic framework to evaluate service satisfaction of cosmetic brands through online review analysis utilizing Text-Mining technique. Methods: The framework assumes that the service satisfaction is evaluated by positive comments from online reviews. That is, the service satisfaction of a cosmetic brand is evaluated higher as more positive opinions are commented in the online reviews. This study focuses on two approaches. First, it collects online review comments from the top 50 global cosmetic brands and evaluates customer service satisfaction for each cosmetic brands by applying Sentimental Analysis and Latent Dirichlet Allocation. Second, it analyzes the determinants that induce or influence service satisfaction and suggests the guidelines for cosmetic brands with low satisfaction to improve their service satisfaction. Results: For the satisfaction evaluation, online review data were extracted from the top 50 global cosmetic brands in the world based on 2018 sales announced by Brand Finance in the UK. As a result of the satisfaction analysis, it was found that overall there were more positive opinions than negative opinions and the averages for polarity, subjectivity, positive ratio, and negative ratio were calculated as 0.50, 0.76, 0.57, and 0.19, respectively. Polarity, subjectivity and positive ratio showed the opposite pattern to negative ratio, and although there was a slight difference in fluctuation range and ranking between them, the patterns are almost same. Conclusion: The usefulness of the proposed framework was verified through case study. Although some studies have suggested a method to analyze online reviews, they didn't deal with the satisfaction evaluation among competitors and cause analysis. This study is different from previous studies in that it evaluates service satisfaction from a relative point of view among cosmetic brands and analyze determinants.

Re-evaluation of Obesity Syndrome Differentiation Questionnaire Based on Real-world Survey Data Using Data Mining (데이터 마이닝을 이용한 한의비만변증 설문지 재평가: 실제 임상에서 수집한 설문응답 기반으로)

  • Oh, Jihong;Wang, Jing-Hua;Choi, Sun-Mi;Kim, Hojun
    • Journal of Korean Medicine for Obesity Research
    • /
    • v.21 no.2
    • /
    • pp.80-94
    • /
    • 2021
  • Objectives: The purpose of this study is to re-evaluate the importance of questions of obesity syndrome differentiation (OSD) questionnaire based on real-world survey and to explore the possibility of simplifying OSD types. Methods: The OSD frequency was identified, and variance threshold feature selection was performed to filter the questions. Filtered questions were clustered by K-means clustering and hierarchical clustering. After principal component analysis (PCA), the distribution patterns of the subjects were identified and the differences in the syndrome distribution were compared. Results: The frequency of OSD in spleen deficiency, phlegm (PH), and blood stasis (BS) was lower than in food retention (FR), liver qi stagnation (LS), and yang deficiency. We excluded 13 questions with low variance, 7 of which were related to BS. Filtered questions were clustered into 3 groups by K-means clustering; Cluster 1 (17 questions) mainly related to PH, BS syndromes; Cluster 2 (11 questions) related to swelling, and indigestion; Cluster 3 (11 questions) related to overeating or emotional symptoms. After PCA, significant different patterns of subjects were observed in the FR, LS, and other obesity syndromes. The questions that mainly affect the FR distribution were digestive symptoms. And emotional symptoms mainly affect the distribution of LS subjects. And other obesity syndrome was partially affected by both digestive and emotional symptoms, and also affected by symptoms related to poor circulation. Conclusions: In-depth data mining analysis identified relatively low importance questions and the potential to simplify OSD types.

Effective Utilization of Data based on Analysis of Spatial Data Mining (공간 데이터마이닝 분석을 통한 데이터의 효과적인 활용)

  • Kim, Kibum;An, Beongku
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.3
    • /
    • pp.157-163
    • /
    • 2013
  • Data mining is a useful technology that can support new discoveries based on the pattern analysis and a variety of linkages between data, and currently is utilized in various fields such as finance, marketing, medical. In this paper, we propose an effective utilization method of data based on analysis of spatial data mining. We make use of basic data of foreigners living in Seoul. However, the data has some features distinguished from other areas of data, classification as sensitive information and legal problem such as personal information protection. So, we use the basic statistical data that does not contain personal information. The main features and contributions of the proposed method are as follows. First, we can use Big Data as information through a variety of ways and can classify and cluster Big Data through refinement. Second. we can use these kinds of information for decision-making of future and new patterns. In the performance evaluation, we will use visual approach through graph of themes. The results of performance evaluation show that the analysis using data mining technology can support new discoveries of patterns and results.

Transaction Pattern Discrimination of Malicious Supply Chain using Tariff-Structured Big Data (관세 정형 빅데이터를 활용한 우범공급망 거래패턴 선별)

  • Kim, Seongchan;Song, Sa-Kwang;Cho, Minhee;Shin, Su-Hyun
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.2
    • /
    • pp.121-129
    • /
    • 2021
  • In this study, we try to minimize the tariff risk by constructing a hazardous cargo screening model by applying Association Rule Mining, one of the data mining techniques. For this, the risk level between supply chains is calculated using the Apriori Algorithm, which is an association analysis algorithm, using the big data of the import declaration form of the Korea Customs Service(KCS). We perform data preprocessing and association rule mining to generate a model to be used in screening the supply chain. In the preprocessing process, we extract the attributes required for rule generation from the import declaration data after the error removing process. Then, we generate the rules by using the extracted attributes as inputs to the Apriori algorithm. The generated association rule model is loaded in the KCS screening system. When the import declaration which should be checked is received, the screening system refers to the model and returns the confidence value based on the supply chain information on the import declaration data. The result will be used to determine whether to check the import case. The 5-fold cross-validation of 16.6% precision and 33.8% recall showed that import declaration data for 2 years and 6 months were divided into learning data and test data. This is a result that is about 3.4 times higher in precision and 1.5 times higher in recall than frequency-based methods. This confirms that the proposed method is an effective way to reduce tariff risks.

Pattern Classification Model Design and Performance Comparison for Data Mining of Time Series Data (시계열 자료의 데이터마이닝을 위한 패턴분류 모델설계 및 성능비교)

  • Lee, Soo-Yong;Lee, Kyoung-Joung
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.6
    • /
    • pp.730-736
    • /
    • 2011
  • In this paper, we designed the models for pattern classification which can reflect the latest trend in time series. It has been shown that fusion models based on statistical and AI methods are superior to traditional ones for the pattern classification model supporting decision making. Especially, the hit rates of pattern classification models combined with fuzzy theory are relatively increased. The statistical SVM models combined with fuzzy membership function, or the models combining neural network and FCM has shown good performance. BPN, PNN, FNN, FCM, SVM, FSVM, Decision Tree, Time Series Analysis, and Regression Analysis were used for pattern classification models in the experiments of this paper. The economical indices DB with time series properties of the financial market(Korea, KOSPI200 DB) and the electrocardiogram DB of arrhythmia patients in hospital emergencies(USA, MIT-BIH DB) were used for data base.

An Extended Frequent Pattern Tree for Hiding Sensitive Frequent Itemsets (민감한 빈발 항목집합 숨기기 위한 확장 빈발 패턴 트리)

  • Lee, Dan-Young;An, Hyoung-Geun;Koh, Jae-Jin
    • The KIPS Transactions:PartD
    • /
    • v.18D no.3
    • /
    • pp.169-178
    • /
    • 2011
  • Recently, data sharing between enterprises or organizations is required matter for task cooperation. In this process, when the enterprise opens its database to the affiliates, it can be occurred to problem leaked sensitive information. To resolve this problem it is needed to hide sensitive information from the database. Previous research hiding sensitive information applied different heuristic algorithms to maintain quality of the database. But there have been few studies analyzing the effects on the items modified during the hiding process and trying to minimize the hided items. This paper suggests eFP-Tree(Extended Frequent Pattern Tree) based FP-Tree(Frequent Pattern Tree) to hide sensitive frequent itemsets. Node formation of eFP-Tree uses border to minimize impacts of non sensitive frequent itemsets in hiding process, by organizing all transaction, sensitive and border information differently to before. As a result to apply eFP-Tree to the example transaction database, the lost items were less than 10%, proving it is more effective than the existing algorithm and maintain the quality of database to the optimal.

Data mining Algorithms for the Development of Sasang Type Diagnosis (사상체질 진단검사를 위한 데이터마이닝 알고리즘 연구)

  • Hong, Jin-Woo;Kim, Young-In;Park, So-Jung;Kim, Byoung-Chul;Eom, Il-Kyu;Hwang, Min-Woo;Shin, Sang-Woo;Kim, Byung-Joo;Kwon, Young-Kyu;Chae, Han
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.23 no.6
    • /
    • pp.1234-1240
    • /
    • 2009
  • This study was to compare the effectiveness and validity of various data-mining algorithm for Sasang type diagnostic test. We compared the sensitivity and specificity index of nine attribute selection and eleven class classification algorithms with 31 data-set characterizing Sasang typology and 10-fold validation methods installed in Waikato Environment Knowledge Analysis (WEKA). The highest classification validity score can be acquired as follows; 69.9 as Percentage Correctly Predicted index with Naive Bayes Classifier, 80 as sensitivity index with LWL/Tae-Eum type, 93.5 as specificity index with Naive Bayes Classifier/So-Eum type. The classification algorithm with highest PCP index of 69.62 after attribute selection was Naive Bayes Classifier. In this study we can find that the best-fit algorithm for traditional medicine is case sensitive and that characteristics of clinical circumstances, and data-mining algorithms and study purpose should be considered to get the highest validity even with the well defined data sets. It is also confirmed that we can't find one-fits-all algorithm and there should be many studies with trials and errors. This study will serve as a pivotal foundation for the development of medical instruments for Pattern Identification and Sasang type diagnosis on the basis of traditional Korean Medicine.