• Title/Summary/Keyword: contiguous sequential pattern algorithm

Search Result 4, Processing Time 0.02 seconds

Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences

  • Kang, Tae-Ho;Yoo, Jae-Soo;Kim, Hak-Yong;Lee, Byoung-Yup
    • International Journal of Contents
    • /
    • v.3 no.2
    • /
    • pp.18-24
    • /
    • 2007
  • Biological sequences such as DNA and amino acid sequences typically contain a large number of items. They have contiguous sequences that ordinarily consist of more than hundreds of frequent items. In biological sequences analysis(BSA), a frequent contiguous sequence search is one of the most important operations. Many studies have been done for mining sequential patterns efficiently. Most of the existing methods for mining sequential patterns are based on the Apriori algorithm. In particular, the prefixSpan algorithm is one of the most efficient sequential pattern mining schemes based on the Apriori algorithm. However, since the algorithm expands the sequential patterns from frequent patterns with length-1, it is not suitable for biological datasets with long frequent contiguous sequences. In recent years, the MacosVSpan algorithm was proposed based on the idea of the prefixSpan algorithm to significantly reduce its recursive process. However, the algorithm is still inefficient for mining frequent contiguous sequences from long biological data sequences. In this paper, we propose an efficient method to mine maximal frequent contiguous sequences in large biological data sequences by constructing the spanning tree with a fixed length. To verify the superiority of the proposed method, we perform experiments in various environments. The experiments show that the proposed method is much more efficient than MacosVSpan in terms of retrieval performance.

Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences (생물학적 데이터 서열들에서 빈번한 최대길이 연속 서열 마이닝)

  • Kang, Tae-Ho;Yoo, Jae-Soo
    • The KIPS Transactions:PartD
    • /
    • v.15D no.2
    • /
    • pp.155-162
    • /
    • 2008
  • Biological sequences such as DNA sequences and amino acid sequences typically contain a large number of items. They have contiguous sequences that ordinarily consist of hundreds of frequent items. In biological sequences analysis(BSA), a frequent contiguous sequence search is one of the most important operations. Many studies have been done for mining sequential patterns efficiently. Most of the existing methods for mining sequential patterns are based on the Apriori algorithm. In particular, the prefixSpan algorithm is one of the most efficient sequential pattern mining schemes based on the Apriori algorithm. However, since the algorithm expands the sequential patterns from frequent patterns with length-1, it is not suitable for biological dataset with long frequent contiguous sequences. In recent years, the MacosVSpan algorithm was proposed based on the idea of the prefixSpan algorithm to significantly reduce its recursive process. However, the algorithm is still inefficient for mining frequent contiguous sequences from long biological data sequences. In this paper, we propose an efficient method to mine maximal frequent contiguous sequences in large biological data sequences by constructing the spanning tree with the fixed length. To verify the superiority of the proposed method, we perform experiments in various environments. As the result, the experiments show that the proposed method is much more efficient than MacosVSpan in terms of retrieval performance.

A Study on the Inference of Detailed Protocol Structure in Protocol Reverse Engineering (상세한 프로토콜 구조를 추론하는 프로토콜 리버스 엔지니어링 방법에 대한 연구)

  • Chae, Byeong-Min;Moon, Ho-Won;Goo, Young-Hoon;Shim, Kyu-Seok;Lee, Min-Seob;Kim, Myung-Sup
    • KNOM Review
    • /
    • v.22 no.1
    • /
    • pp.42-51
    • /
    • 2019
  • Recently, the amount of internet traffic is increasing due to the increase in speed and capacity of the network environment, and protocol data is increasing due to mobile, IoT, application, and malicious behavior. Most of these private protocols are unknown in structure. For efficient network management and security, analysis of the structure of private protocols must be performed. Many protocol reverse engineering methodologies have been proposed for this purpose, but there are disadvantages to applying them. In this paper, we propose a methodology for inferring a detailed protocol structure based on network trace analysis by hierarchically combining CSP (Contiguous Sequential Pattern) and SP (Sequential Pattern) Algorithm. The proposed methodology is designed and implemented in a way that improves the preceeding study, A2PRE, We describe performance index for comparing methodologies and demonstrate the superiority of the proposed methodology through the example of HTTP, DNS protocol.

Two-Pathway Model for Enhancement of Protocol Reverse Engineering

  • Goo, Young-Hoon;Shim, Kyu-Seok;Baek, Ui-Jun;Kim, Myung-Sup
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.11
    • /
    • pp.4310-4330
    • /
    • 2020
  • With the continuous emergence of new applications and cyberattacks and their frequent updates, the need for automatic protocol reverse engineering is gaining recognition. Although several methods for automatic protocol reverse engineering have been proposed, each method still faces major limitations in extracting clear specifications and in its universal application. In order to overcome such limitations, we propose an automatic protocol reverse engineering method using a two-pathway model based on a contiguous sequential pattern (CSP) algorithm. By using this model, the method can infer both command-oriented protocols and non-command-oriented protocols clearly and in detail. The proposed method infers all the key elements of the protocol, which are syntax, semantics, and finite state machine (FSM), and extracts clear syntax by defining fine-grained field types and three types of format: field format, message format, and flow format. We evaluated the efficacy of the proposed method over two non-command-oriented protocols and three command-oriented protocols: the former are HTTP and DNS, and the latter are FTP, SMTP, and POP3. The experimental results show that this method can reverse engineer with high coverage and correctness rates, more than 98.5% and 99.1% respectively, and be general for both command-oriented and non-command-oriented protocols.