• Title/Summary/Keyword: fast search algorithm

Search Result 530, Processing Time 0.029 seconds

Finding the time sensitive frequent itemsets based on data mining technique in data streams (데이터 스트림에서 데이터 마이닝 기법 기반의 시간을 고려한 상대적인 빈발항목 탐색)

  • Park, Tae-Su;Chun, Seok-Ju;Lee, Ju-Hong;Kang, Yun-Hee;Choi, Bum-Ghi
    • Journal of The Korean Association of Information Education
    • /
    • v.9 no.3
    • /
    • pp.453-462
    • /
    • 2005
  • Recently, due to technical improvements of storage devices and networks, the amount of data increase rapidly. In addition, it is required to find the knowledge embedded in a data stream as fast as possible. Huge data in a data stream are created continuously and changed fast. Various algorithms for finding frequent itemsets in a data stream are actively proposed. Current researches do not offer appropriate method to find frequent itemsets in which flow of time is reflected but provide only frequent items using total aggregation values. In this paper we proposes a novel algorithm for finding the relative frequent itemsets according to the time in a data stream. We also propose the method to save frequent items and sub-frequent items in order to take limited memory into account and the method to update time variant frequent items. The performance of the proposed method is analyzed through a series of experiments. The proposed method can search both frequent itemsets and relative frequent itemsets only using the action patterns of the students at each time slot. Thus, our method can enhance the effectiveness of learning and make the best plan for individual learning.

  • PDF

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

Development of a back analysis program for reasonable derivation of tunnel design parameters (합리적인 터널설계정수 산정을 위한 역해석 프로그램 개발)

  • Kim, Young-Joon;Lee, Yong-Joo
    • Journal of Korean Tunnelling and Underground Space Association
    • /
    • v.15 no.3
    • /
    • pp.357-373
    • /
    • 2013
  • In this paper, a back analysis program for analyzing the behavior of tunnel-ground system and evaluating the material properties and tunnel design parameters was developed. This program was designed to be able to implement the back analysis of underground structure by combination of using FLAC and optimized algorithm as direct method. In particular, Rosenbrock method which is able to do direct search without obtaining differential coefficient was adopted for the back analysis algorithm among optimization methods. This back analysis program was applied to the site to evaluate the design parameters. The back analysis was carried out using field measurement results from 5 sites. In the course of back analysis, nonlinear regression analysis was carried out to identify the optimum function of the measured ground displacement. Exponential function and fractional function were used for the regression analysis and total displacement calculated by optimum function was used as the back analysis input data. As a result, displacement recalculated through the back analysis using measured displacement of the structure showed 4.5% of error factor comparing to the measured data. Hence, the program developed in this study proved to be effectively applicable to tunnel analysis.

Determining the number of Clusters in On-Line Document Clustering Algorithm (온라인 문서 군집화에서 군집 수 결정 방법)

  • Jee, Tae-Chang;Lee, Hyun-Jin;Lee, Yill-Byung
    • The KIPS Transactions:PartB
    • /
    • v.14B no.7
    • /
    • pp.513-522
    • /
    • 2007
  • Clustering is to divide given data and automatically find out the hidden meanings in the data. It analyzes data, which are difficult for people to check in detail, and then, makes several clusters consisting of data with similar characteristics. On-Line Document Clustering System, which makes a group of similar documents by use of results of the search engine, is aimed to increase the convenience of information retrieval area. Document clustering is automatically done without human interference, and the number of clusters, which affect the result of clustering, should be decided automatically too. Also, the one of the characteristics of an on-line system is guarantying fast response time. This paper proposed a method of determining the number of clusters automatically by geometrical information. The proposed method composed of two stages. In the first stage, centers of clusters are projected on the low-dimensional plane, and in the second stage, clusters are combined by use of distance of centers of clusters in the low-dimensional plane. As a result of experimenting this method with real data, it was found that clustering performance became better and the response time is suitable to on-line circumstance.

Extracting Silhouettes of a Polyhedral Model from a Curved Viewpoint Trajectory (곡선 궤적의 이동 관측점에 대한 다면체 모델의 윤곽선 추출)

  • Kim, Gu-Jin;Baek, Nak-Hun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.8 no.2
    • /
    • pp.1-7
    • /
    • 2002
  • The fast extraction of the silhouettes of a model is very useful for many applications in computer graphics and animation. In this paper, we present an efficient algorithm to compute a sequence of perspective silhouettes for a polyhedral model from a moving viewpoint. The viewpoint is assumed to move along a trajectory q(t), which is a space curve of a time parameter t. Then, we can compute the time-intervals for each edge of the model to be contained in the silhouette by two major computations: (i) intersecting q(t) with two planes and (ii) a number of dot products. If q(t) is a curve of degree n, then there are at most of n + 1 time-intervals for an edge to be in a silhouette. For each time point $t_i$ we can extract silhouette edges by searching the intervals containing $t_i$ among the computed intervals. For the efficient search, we propose two kinds of data structures for storing the intervals: an interval tree and an array. Our algorithm can be easily extended to compute the parallel silhouettes with minor modifications.

  • PDF

A Study on Shape Optimization of Plane Truss Structures (평면(平面) 트러스 구조물(構造物)의 형상최적화(形狀最適化)에 관한 구연(究研))

  • Lee, Gyu won;Byun, Keun Joo;Hwang, Hak Joo
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.5 no.3
    • /
    • pp.49-59
    • /
    • 1985
  • Formulation of the geometric optimization for truss structures based on the elasticity theory turn out to be the nonlinear programming problem which has to deal with the Cross sectional area of the member and the coordinates of its nodes simultaneously. A few techniques have been proposed and adopted for the analysis of this nonlinear programming problem for the time being. These techniques, however, bear some limitations on truss shapes loading conditions and design criteria for the practical application to real structures. A generalized algorithm for the geometric optimization of the truss structures which can eliminate the above mentioned limitations, is developed in this study. The algorithm developed utilizes the two-phases technique. In the first phase, the cross sectional area of the truss member is optimized by transforming the nonlinear problem into SUMT, and solving SUMT utilizing the modified Newton-Raphson method. In the second phase, the geometric shape is optimized utilizing the unidirctional search technique of the Rosenbrock method which make it possible to minimize only the objective function. The algorithm developed in this study is numerically tested for several truss structures with various shapes, loading conditions and design criteria, and compared with the results of the other algorithms to examme its applicability and stability. The numerical comparisons show that the two-phases algorithm developed in this study is safely applicable to any design criteria, and the convergency rate is very fast and stable compared with other iteration methods for the geometric optimization of truss structures.

  • PDF

Molecular Analysis of Pathogenic Molds Isolated from Clinical Specimen (임상검체에서 분리된 병원성 사상균의 분자생물학적 분석)

  • Lee, Jang Ho;Kwon, Kye Chul;Koo, Sun Hoe
    • Korean Journal of Clinical Laboratory Science
    • /
    • v.52 no.3
    • /
    • pp.229-236
    • /
    • 2020
  • Sixty-five molds isolated from clinical specimens were included in this study. All the isolates were molds that could be identified morphologically, strains that are difficult to identify because of morphological similarities, and strains that require species-level identification. PCR and direct sequencing were performed to target the internal transcribed spacer (ITS) region, the D1/D2 region, and the β-tubulin gene. Comparative sequence analysis using the GenBank database was performed using the basic local alignment search tool (BLAST) algorithm. The fungi identified morphologically to the genus level were 67%. Sequencing analysis was performed on 62 genera and species level of the 65 strains. Discrepancies were 14 (21.5%) of the 65 strains between the results of phenotypic and molecular identification. B. dermatitidis, T. marneffei, and G. argillacea were identified for the first time in Korea using the DNA sequencing method. Morphological identification is a very useful method in terms of the reporting time and costs in cases of frequently isolated and rapid growth, such as Aspergillus. When molecular methods are employed, the cost and clinical significance should be considered. On the other hand, the molecular identification of molds can provide fast and accurate results.

Analysis of the Precedence of Stock Price Variables Using Cultural Content Big Data (문화콘텐츠 빅데이터를 이용한 주가 변수 선행성 분석)

  • Ryu, Jae Pil;Lee, Ji Young;Jeong, Jeong Young
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.4
    • /
    • pp.222-230
    • /
    • 2022
  • Recently, Korea's cultural content industry is developing, and behind the growing recognition around the world is the real-time sharing service of global network users due to the development of science and technology. In particular, in the case of YouTube, its propagation power is fast and powerful in that everyone, not limited users, can become potential video providers. As more than 80% of mobile phone users are using YouTube in Korea, YouTube's information means that psychological factors of users are reflected. For example, information such as the number of video views, likes, and comments of a channel with a specific personality shows a measure of the channel's personality interest. This is highly related to the fact that information such as the frequency of keyword search on portal sites is closely related to the stock market economically and psychologically. Therefore, in this study, YouTube information from a representative entertainment company is collected through a crawling algorithm and analyzed for the causal relationship with major variables related to stock prices. This study is considered meaningful in that it conducted research by combining cultural content, IT, and financial fields in accordance with the era of the fourth industry.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Development of Multimedia Annotation and Retrieval System using MPEG-7 based Semantic Metadata Model (MPEG-7 기반 의미적 메타데이터 모델을 이용한 멀티미디어 주석 및 검색 시스템의 개발)

  • An, Hyoung-Geun;Koh, Jae-Jin
    • The KIPS Transactions:PartD
    • /
    • v.14D no.6
    • /
    • pp.573-584
    • /
    • 2007
  • As multimedia information recently increases fast, various types of retrieval of multimedia data are becoming issues of great importance. For the efficient multimedia data processing, semantics based retrieval techniques are required that can extract the meaning contents of multimedia data. Existing retrieval methods of multimedia data are annotation-based retrieval, feature-based retrieval and annotation and feature integration based retrieval. These systems take annotator a lot of efforts and time and we should perform complicated calculation for feature extraction. In addition. created data have shortcomings that we should go through static search that do not change. Also, user-friendly and semantic searching techniques are not supported. This paper proposes to develop S-MARS(Semantic Metadata-based Multimedia Annotation and Retrieval System) which can represent and extract multimedia data efficiently using MPEG-7. The system provides a graphical user interface for annotating, searching, and browsing multimedia data. It is implemented on the basis of the semantic metadata model to represent multimedia information. The semantic metadata about multimedia data is organized on the basis of multimedia description schema using XML schema that basically comply with the MPEG-7 standard. In conclusion. the proposed scheme can be easily implemented on any multimedia platforms supporting XML technology. It can be utilized to enable efficient semantic metadata sharing between systems, and it will contribute to improving the retrieval correctness and the user's satisfaction on embedding based multimedia retrieval algorithm method.