• Title/Summary/Keyword: Tree algorithm

Search Result 1,720, Processing Time 0.028 seconds

Wood Anatomy of Korean Symplocos Jacq. (Sympocaceae)

  • Balkrishna Ghimire;Beom Kyun Park;Seung-Hwan Oh;Dong Chan Son
    • Proceedings of the Plant Resources Society of Korea Conference
    • /
    • 2020.08a
    • /
    • pp.36-36
    • /
    • 2020
  • Symplocos Jacq. including about 350 species is the sole isolated genus of the family Symplocaceae. Despite poorly documented species delimitation and unresolved taxonomic nomenclature four species of Symplocos (S. coreana, S purnifolia, S sawafutagi, and S. tanakana) have been described in Korea. In this study, we carried the comparative wood anatomy of all the four species of Korean Symplocos to understand the wood anatomical variations within these four species. The result of this study indicated that Korean Symplocos are comparatively indistinguishable in terms of their qualitative wood features except for exclusively uniseriate rays present in S. purnifolia instead of uni- to- multiseriate in other three species. However, discrepancies are observed in quantitative wood variables such as vessel density, vessel size, and ray density. The vessel density of S. purnifolia (highest among the four species) is more than two times higher than the S. sawafutagi (lowest among the four species) and S. tanakana. On the other hand, vessel size is likewise reverse to the vessel number relationships i. e. vessel circumference and diameter in both planes of S. sawafutagi and S. tanakana is almost twice a larger than S. purnifolia. Interestingly, S. coreana remains in between of these two groups in terms of vessel features and closer to S. purnifolia in terms of ray density. The cluster analysis based on the paired group (UPGMA) algorithm using the Euclidean similarity index clearly differentiates S. purnifolia from the rest of the taxa representing the first isolated clade of the tree.

  • PDF

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

Development of Sentiment Analysis Model for the hot topic detection of online stock forums (온라인 주식 포럼의 핫토픽 탐지를 위한 감성분석 모형의 개발)

  • Hong, Taeho;Lee, Taewon;Li, Jingjing
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.187-204
    • /
    • 2016
  • Document classification based on emotional polarity has become a welcomed emerging task owing to the great explosion of data on the Web. In the big data age, there are too many information sources to refer to when making decisions. For example, when considering travel to a city, a person may search reviews from a search engine such as Google or social networking services (SNSs) such as blogs, Twitter, and Facebook. The emotional polarity of positive and negative reviews helps a user decide on whether or not to make a trip. Sentiment analysis of customer reviews has become an important research topic as datamining technology is widely accepted for text mining of the Web. Sentiment analysis has been used to classify documents through machine learning techniques, such as the decision tree, neural networks, and support vector machines (SVMs). is used to determine the attitude, position, and sensibility of people who write articles about various topics that are published on the Web. Regardless of the polarity of customer reviews, emotional reviews are very helpful materials for analyzing the opinions of customers through their reviews. Sentiment analysis helps with understanding what customers really want instantly through the help of automated text mining techniques. Sensitivity analysis utilizes text mining techniques on text on the Web to extract subjective information in the text for text analysis. Sensitivity analysis is utilized to determine the attitudes or positions of the person who wrote the article and presented their opinion about a particular topic. In this study, we developed a model that selects a hot topic from user posts at China's online stock forum by using the k-means algorithm and self-organizing map (SOM). In addition, we developed a detecting model to predict a hot topic by using machine learning techniques such as logit, the decision tree, and SVM. We employed sensitivity analysis to develop our model for the selection and detection of hot topics from China's online stock forum. The sensitivity analysis calculates a sentimental value from a document based on contrast and classification according to the polarity sentimental dictionary (positive or negative). The online stock forum was an attractive site because of its information about stock investment. Users post numerous texts about stock movement by analyzing the market according to government policy announcements, market reports, reports from research institutes on the economy, and even rumors. We divided the online forum's topics into 21 categories to utilize sentiment analysis. One hundred forty-four topics were selected among 21 categories at online forums about stock. The posts were crawled to build a positive and negative text database. We ultimately obtained 21,141 posts on 88 topics by preprocessing the text from March 2013 to February 2015. The interest index was defined to select the hot topics, and the k-means algorithm and SOM presented equivalent results with this data. We developed a decision tree model to detect hot topics with three algorithms: CHAID, CART, and C4.5. The results of CHAID were subpar compared to the others. We also employed SVM to detect the hot topics from negative data. The SVM models were trained with the radial basis function (RBF) kernel function by a grid search to detect the hot topics. The detection of hot topics by using sentiment analysis provides the latest trends and hot topics in the stock forum for investors so that they no longer need to search the vast amounts of information on the Web. Our proposed model is also helpful to rapidly determine customers' signals or attitudes towards government policy and firms' products and services.

An Analytical Approach Using Topic Mining for Improving the Service Quality of Hotels (호텔 산업의 서비스 품질 향상을 위한 토픽 마이닝 기반 분석 방법)

  • Moon, Hyun Sil;Sung, David;Kim, Jae Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.21-41
    • /
    • 2019
  • Thanks to the rapid development of information technologies, the data available on Internet have grown rapidly. In this era of big data, many studies have attempted to offer insights and express the effects of data analysis. In the tourism and hospitality industry, many firms and studies in the era of big data have paid attention to online reviews on social media because of their large influence over customers. As tourism is an information-intensive industry, the effect of these information networks on social media platforms is more remarkable compared to any other types of media. However, there are some limitations to the improvements in service quality that can be made based on opinions on social media platforms. Users on social media platforms represent their opinions as text, images, and so on. Raw data sets from these reviews are unstructured. Moreover, these data sets are too big to extract new information and hidden knowledge by human competences. To use them for business intelligence and analytics applications, proper big data techniques like Natural Language Processing and data mining techniques are needed. This study suggests an analytical approach to directly yield insights from these reviews to improve the service quality of hotels. Our proposed approach consists of topic mining to extract topics contained in the reviews and the decision tree modeling to explain the relationship between topics and ratings. Topic mining refers to a method for finding a group of words from a collection of documents that represents a document. Among several topic mining methods, we adopted the Latent Dirichlet Allocation algorithm, which is considered as the most universal algorithm. However, LDA is not enough to find insights that can improve service quality because it cannot find the relationship between topics and ratings. To overcome this limitation, we also use the Classification and Regression Tree method, which is a kind of decision tree technique. Through the CART method, we can find what topics are related to positive or negative ratings of a hotel and visualize the results. Therefore, this study aims to investigate the representation of an analytical approach for the improvement of hotel service quality from unstructured review data sets. Through experiments for four hotels in Hong Kong, we can find the strengths and weaknesses of services for each hotel and suggest improvements to aid in customer satisfaction. Especially from positive reviews, we find what these hotels should maintain for service quality. For example, compared with the other hotels, a hotel has a good location and room condition which are extracted from positive reviews for it. In contrast, we also find what they should modify in their services from negative reviews. For example, a hotel should improve room condition related to soundproof. These results mean that our approach is useful in finding some insights for the service quality of hotels. That is, from the enormous size of review data, our approach can provide practical suggestions for hotel managers to improve their service quality. In the past, studies for improving service quality relied on surveys or interviews of customers. However, these methods are often costly and time consuming and the results may be biased by biased sampling or untrustworthy answers. The proposed approach directly obtains honest feedback from customers' online reviews and draws some insights through a type of big data analysis. So it will be a more useful tool to overcome the limitations of surveys or interviews. Moreover, our approach easily obtains the service quality information of other hotels or services in the tourism industry because it needs only open online reviews and ratings as input data. Furthermore, the performance of our approach will be better if other structured and unstructured data sources are added.

Finding the One-to-One Optimum Path Considering User's Route Perception Characteristics of Origin and Destination (Focused on the Origin-Based Formulation and Algorithm) (출발지와 도착지의 경로인지특성을 반영한 One-to-One 최적경로탐색 (출발지기반 수식 및 알고리즘을 중심으로))

  • Shin, Seong-Il;Sohn, Kee-Min;Cho, Chong-Suk;Cho, Tcheol-Woong;Kim, Won-Keun
    • Journal of Korean Society of Transportation
    • /
    • v.23 no.7 s.85
    • /
    • pp.99-110
    • /
    • 2005
  • Total travel cost of route which connects origin with destination (O-D) is consist of the total sum of link travel cost and route perception cost. If the link perception cost is different according to the origin and destination, optimal route search has limitation to reflect the actual condition by route enumeration problem. The purpose of this study is to propose optimal route searching formulation and algorithm which is enable to reflect different link perception cost by each route, not only avoid the enumeration problem between origin and destination. This method defines minimum unit of route as a link and finally compares routes using link unit costs. The proposed method considers the perception travel cost at both origin and destination in optimal route searching process, while conventional models refect the perception cost only at origin. However this two-way searching algorithm is still not able to guarantee optimum solution. To overcome this problem, this study proposed an orign based optimal route searching method which was developed based on destination based optimal perception route tree. This study investigates whether proposed numerical formulas and algorithms are able to reflect route perception behavior reflected the feature of origin and destination in a real traffic network by the example research including the diversity of route information for the surrounding area and the perception cost for the road hierarchy.

Automated Detecting and Tracing for Plagiarized Programs using Gumbel Distribution Model (굼벨 분포 모델을 이용한 표절 프로그램 자동 탐색 및 추적)

  • Ji, Jeong-Hoon;Woo, Gyun;Cho, Hwan-Gue
    • The KIPS Transactions:PartA
    • /
    • v.16A no.6
    • /
    • pp.453-462
    • /
    • 2009
  • Studies on software plagiarism detection, prevention and judgement have become widespread due to the growing of interest and importance for the protection and authentication of software intellectual property. Many previous studies focused on comparing all pairs of submitted codes by using attribute counting, token pattern, program parse tree, and similarity measuring algorithm. It is important to provide a clear-cut model for distinguishing plagiarism and collaboration. This paper proposes a source code clustering algorithm using a probability model on extreme value distribution. First, we propose an asymmetric distance measure pdist($P_a$, $P_b$) to measure the similarity of $P_a$ and $P_b$ Then, we construct the Plagiarism Direction Graph (PDG) for a given program set using pdist($P_a$, $P_b$) as edge weights. And, we transform the PDG into a Gumbel Distance Graph (GDG) model, since we found that the pdist($P_a$, $P_b$) score distribution is similar to a well-known Gumbel distribution. Second, we newly define pseudo-plagiarism which is a sort of virtual plagiarism forced by a very strong functional requirement in the specification. We conducted experiments with 18 groups of programs (more than 700 source codes) collected from the ICPC (International Collegiate Programming Contest) and KOI (Korean Olympiad for Informatics) programming contests. The experiments showed that most plagiarized codes could be detected with high sensitivity and that our algorithm successfully separated real plagiarism from pseudo plagiarism.

Accelerated compression of sub-images by use of effective motion estimation and difference image methods in integral imaging (집적영상에서 효율적인 물체움직임 추정 및 차 영상 기법을 이용한 서브영상의 고속 압축)

  • Lee, Hyoung-Woo;Kim, Eun-Soo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.12
    • /
    • pp.2762-2770
    • /
    • 2012
  • In this paper, we propose a novel approach to effectively compress the sub-images transformed from the picked-up elemental images in integral imaging, in which motion vectors of the object in each sub-image are fast and accurately estimated and compensated by combined use of MSE(mean square error)-based TSS(tree-step search) and FS(full search) schemes. This is, the possible object areas in each sub-image are searched by using the fast TSS algorithm in advance, then the these selected object areas are fully searched with the accurate FS algorithm. Furthermore, the sub-images in which all object's motion vectors are compensated, are transformed into the residual images by using the difference image method and finally compressed with the MPEG-4 algorithm. Experimental results reveal that the proposed method shows 214% improvement in the compression time per each image frame compared to that of the conventional method while keeping the same compression ratio with the conventional method. These successful results confirm the feasibility of the proposed method in the practical application.

An Efficient CPLD Technology Mapping considering Area under Time Constraint (시간 제약 조건하에서 면적을 고려한 효율적인 CPLD 기술 매핑)

  • Kim, Jae-Jin;Kim, Hui-Seok
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.1
    • /
    • pp.79-85
    • /
    • 2001
  • In this paper, we propose a new technology mapping algorithm for CPLD consider area under time constraint(TMFCPLD). This technology mapping algorithm detect feedbacks from boolean networks, then variables that have feedback are replaced to temporary variables. Creating the temporary variables transform sequential circuit to combinational circuit. The transformed circuits are represented to DAG. After traversing all nodes in DAG, the nodes that have output edges more than two are replicated and reconstructed to fanout free tree. This method is for reason to reduce area and improve total run time of circuits by TEMPLA proposed previously. Using time constraints and delay time of device, the number of graph partitionable multi-level is decided. Initial cost of each node are the number of OR-terms that it have. Among mappable clusters, clusters of which the number of multi-level is least is selected, and the graph is partitioned. Several nodes in partitioned clusters are merged by collapsing, and are fitted to the number of OR-terms in a given CLB by bin packing. Proposed algorithm have been applied to MCNC logic synthesis benchmark circuits, and have reduced the number of CLBs by 62.2% than those of DDMAP. And reduced the number of CLBs by 17.6% than those of TEMPLA, and reduced the number of CLBs by 4.7% than those of TMCPLD. This results will give much efficiency to technology mapping for CPLDs.

  • PDF

A binary adaptive arithmetic coding algorithm based on adaptive symbol changes for lossless medical image compression (무손실 의료 영상 압축을 위한 적응적 심볼 교환에 기반을 둔 이진 적응 산술 부호화 방법)

  • 지창우;박성한
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.22 no.12
    • /
    • pp.2714-2726
    • /
    • 1997
  • In this paper, adaptive symbol changes-based medical image compression method is presented. First, the differenctial image domain is obtained using the differentiation rules or obaptive predictors applied to original mdeical image. Also, the algorithm determines the context associated with the differential image from the domain. Then prediction symbols which are thought tobe the most probable differential image values are maintained at a high value through the adaptive symbol changes procedure based on estimates of the symbols with polarity coincidence between the differential image values to be coded under to context and differential image values in the model template. At the coding step, the differential image values are encoded as "predicted" or "non-predicted" by the binary adaptive arithmetic encoder, where a binary decision tree is employed. The simlation results indicate that the prediction hit ratios of differential image values using the proposed algorithm improve the coding gain by 25% and 23% than arithmetic coder with ISO JPEG lossless predictor and arithmetic coder with differentiation rules or adaptive predictors, respectively. It can be used in compression part of medical PACS because the proposed method allows the encoder be directly applied to the full bit-planes medical image without a decomposition of the full bit-plane into a series of binary bit-planes as well as lower complexity of encoder through using an additions when sub-dividing recursively unit intervals.

  • PDF

Machine-learning Approaches with Multi-temporal Remotely Sensed Data for Estimation of Forest Biomass and Forest Reference Emission Levels (시계열 위성영상과 머신러닝 기법을 이용한 산림 바이오매스 및 배출기준선 추정)

  • Yong-Kyu, Lee;Jung-Soo, Lee
    • Journal of Korean Society of Forest Science
    • /
    • v.111 no.4
    • /
    • pp.603-612
    • /
    • 2022
  • The study aims were to evaluate a machine-learning, algorithm-based, forest biomass-estimation model to estimate subnational forest biomass and to comparatively analyze REDD+ forest reference emission levels. Time-series Landsat satellite imagery and ESA Biomass Climate Change Initiative information were used to build a machine-learning-based biomass estimation model. The k-nearest neighbors algorithm (kNN), which is a non-parametric learning model, and the tree-based random forest (RF) model were applied to the machine-learning algorithm, and the estimated biomasses were compared with the forest reference emission levels (FREL) data, which was provided by the Paraguayan government. The root mean square error (RMSE), which was the optimum parameter of the kNN model, was 35.9, and the RMSE of the RF model was lower at 34.41, showing that the RF model was superior. As a result of separately using the FREL, kNN, and RF methods to set the reference emission levels, the gradient was set to approximately -33,000 tons, -253,000 tons, and -92,000 tons, respectively. These results showed that the machine learning-based estimation model was more suitable than the existing methods for setting reference emission levels.