• Title/Summary/Keyword: Sequence pattern analysis

Search Result 316, Processing Time 0.034 seconds

A Pattern Summary System Using BLAST for Sequence Analysis

  • Choi, Han-Suk;Kim, Dong-Wook;Ryu, Tae-W.
    • Genomics & Informatics
    • /
    • v.4 no.4
    • /
    • pp.173-181
    • /
    • 2006
  • Pattern finding is one of the important tasks in a protein or DNA sequence analysis. Alignment is the widely used technique for finding patterns in sequence analysis. BLAST (Basic Local Alignment Search Tool) is one of the most popularly used tools in bio-informatics to explore available DNA or protein sequence databases. BLAST may generate a huge output for a large sequence data that contains various sequence patterns. However, BLAST does not provide a tool to summarize and analyze the patterns or matched alignments in the BLAST output file. BLAST lacks of general and robust parsing tools to extract the essential information out from its output. This paper presents a pattern summary system which is a powerful and comprehensive tool for discovering pattern structures in huge amount of sequence data in the BLAST. The pattern summary system can identify clusters of patterns, extract the cluster pattern sequences from the subject database of BLAST, and display the clusters graphically to show the distribution of clusters in the subject database.

Sequence Analysis in Women's Work Transition (여성취업이행 경로의 생애과정 씨퀀스(sequence) 분석)

  • 은기수;박수미
    • Korea journal of population studies
    • /
    • v.25 no.2
    • /
    • pp.107-138
    • /
    • 2002
  • In general, women's labor force participation follows a M-curve pattern because women's state of economic activity usually changes by their life course stage. This research attentions that the effect of sequence of life course as well as the effects of‘marriage bar’, or‘maternity leave’is very important in understanding women's chaning economic activity status. First, this research hypothesizes that women's four patterns of job career such as‘continuous pattern’,‘discontinuous pattern’,‘non-economic activity pattern’,‘marriage leave pattern’result a significant difference in social and demographic variables. Second, this research analyzes the effect of ordering and timing of life events on women's work transition. This research investigates labor market dynamics to conceptualize labor market behaviors using longitudinal data and sequence analysis and event history analysis. We find that four patterns of job career vary by age, educational attainment, having a certificate or not, their parents’human capital and health status. And we find that the ordering and timing of‘participation in labor market’and‘marriage’determine the pattern of women's work transition.

Finding Weighted Sequential Patterns over Data Streams via a Gap-based Weighting Approach (발생 간격 기반 가중치 부여 기법을 활용한 데이터 스트림에서 가중치 순차패턴 탐색)

  • Chang, Joong-Hyuk
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.3
    • /
    • pp.55-75
    • /
    • 2010
  • Sequential pattern mining aims to discover interesting sequential patterns in a sequence database, and it is one of the essential data mining tasks widely used in various application fields such as Web access pattern analysis, customer purchase pattern analysis, and DNA sequence analysis. In general sequential pattern mining, only the generation order of data element in a sequence is considered, so that it can easily find simple sequential patterns, but has a limit to find more interesting sequential patterns being widely used in real world applications. One of the essential research topics to compensate the limit is a topic of weighted sequential pattern mining. In weighted sequential pattern mining, not only the generation order of data element but also its weight is considered to get more interesting sequential patterns. In recent, data has been increasingly taking the form of continuous data streams rather than finite stored data sets in various application fields, the database research community has begun focusing its attention on processing over data streams. The data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. In data stream processing, each data element should be examined at most once to analyze the data stream, and the memory usage for data stream analysis should be restricted finitely although new data elements are continuously generated in a data stream. Moreover, newly generated data elements should be processed as fast as possible to produce the up-to-date analysis result of a data stream, so that it can be instantly utilized upon request. To satisfy these requirements, data stream processing sacrifices the correctness of its analysis result by allowing some error. Considering the changes in the form of data generated in real world application fields, many researches have been actively performed to find various kinds of knowledge embedded in data streams. They mainly focus on efficient mining of frequent itemsets and sequential patterns over data streams, which have been proven to be useful in conventional data mining for a finite data set. In addition, mining algorithms have also been proposed to efficiently reflect the changes of data streams over time into their mining results. However, they have been targeting on finding naively interesting patterns such as frequent patterns and simple sequential patterns, which are found intuitively, taking no interest in mining novel interesting patterns that express the characteristics of target data streams better. Therefore, it can be a valuable research topic in the field of mining data streams to define novel interesting patterns and develop a mining method finding the novel patterns, which will be effectively used to analyze recent data streams. This paper proposes a gap-based weighting approach for a sequential pattern and amining method of weighted sequential patterns over sequence data streams via the weighting approach. A gap-based weight of a sequential pattern can be computed from the gaps of data elements in the sequential pattern without any pre-defined weight information. That is, in the approach, the gaps of data elements in each sequential pattern as well as their generation orders are used to get the weight of the sequential pattern, therefore it can help to get more interesting and useful sequential patterns. Recently most of computer application fields generate data as a form of data streams rather than a finite data set. Considering the change of data, the proposed method is mainly focus on sequence data streams.

A Pattern Matching Extended Compression Algorithm for DNA Sequences

  • Murugan., A;Punitha., K
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.196-202
    • /
    • 2021
  • DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Those large volumes of data also require a fast transmission, effective storage, superior functionality and provision of quick access to any record. Data storage costs have a considerable proportion of total cost in the formation and analysis of DNA sequences. In particular, there is a need of highly control of disk storage capacity of DNA sequences but the standard compression techniques unsuccessful to compress these sequences. Several specialized techniques were introduced for this purpose. Therefore, to overcome all these above challenges, lossless compression techniques have become necessary. In this paper, it is described a new DNA compression mechanism of pattern matching extended Compression algorithm that read the input sequence as segments and find the matching pattern and store it in a permanent or temporary table based on number of bases. The remaining unmatched sequence is been converted into the binary form and then it is been grouped into binary bits i.e. of seven bits and gain these bits are been converted into an ASCII form. Finally, the proposed algorithm dynamically calculates the compression ratio. Thus the results show that pattern matching extended Compression algorithm outperforms cutting-edge compressors and proves its efficiency in terms of compression ratio regardless of the file size of the data.

Frequent Origin-Destination Sequence Pattern Analysis from Taxi Trajectories (택시 기종점 빈번 순차 패턴 분석)

  • Lee, Tae Young;Jeon, Seung Bae;Jeong, Myeong Hun;Choi, Yun Woong
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.39 no.3
    • /
    • pp.461-467
    • /
    • 2019
  • Advances in location-aware and IoT (Internet of Things) technology increase the rapid generation of massive movement data. Knowledge discovery from massive movement data helps us to understand the urban flow and traffic management. This paper proposes a method to analyze frequent origin-destination sequence patterns from irregular spatiotemporal taxi pick-up locations. The proposed method starts by conducting cluster analysis and then run a frequent sequence pattern analysis based on identified clusters as a base unit. The experimental data is Seoul taxi trajectory data between 7 a.m. and 9 a.m. during one week. The experimental results present that significant frequent sequence patterns occur within Gangnam. The significant frequent sequence patterns of different regions are identified between Gangnam and Seoul City Hall area. Further, this study uses administrative boundaries as a base unit. The results based on administrative boundaries fails to detect the frequent sequence patterns between different regions. The proposed method can be applied to decrease not only taxis' empty-loaded rate, but also improve urban flow management.

Some Considerations on the Problems of PSA(Pulse Sequence Analysis) as a Partial Discharge Analysis Method (부분방전 해석 방법으로 PSA(Pulse Sequence Analysis)의 문제점에 대한 고찰)

  • Kim, Jeong-Tae;Lee, Ho-Keun
    • Proceedings of the Korean Institute of Electrical and Electronic Material Engineers Conference
    • /
    • 2004.11a
    • /
    • pp.327-330
    • /
    • 2004
  • Because of its effectiveness for the PD(partial discharge) pattern recognition, PSA(Pulse Sequence Analysis) has been considered as a new analytic method instead of conventional PRPDA(Phase Resolved Partial Discharge Analysis). However, PSA has a big problem that can misanalyze patterns in case of data missing resulting from poor sensitivity because it analyses the correlation between sequential pulses, which leads to hesitate to apply it to on-site. Therefore, in this paper, the problems of PSA such as data missing and noise adding cases were investigated. For the purpose, PD data obtained from various defects including noise adding data were used and analysed, The result showed that both cases can cause fatal errors in recognizing PD patterns. In case of the data missing, the error depends on the kinds of defect and the degree of degradation. Also, it could be noticed that the error due to adding noises was larger than that due to some data missing.

  • PDF

Some Considerations on the On-site Applicability of PSA(Pulse Sequence Analysis) as a Partial Discharge Analysis Method (부분방전 해석 방법으로 PSA(Pulse Sequence Analysis)의 현장 적용성에 대한 고찰)

  • Kim, Jeong-Tae;Lee, Ho-Keun
    • Journal of the Korean Institute of Electrical and Electronic Material Engineers
    • /
    • v.18 no.5
    • /
    • pp.484-489
    • /
    • 2005
  • Because of its effectiveness for the PD(Partial Discharge) pattern recognition, PSA(Pulse Sequence Analysis) has been considered as a new analytic method instead of conventional PRPDA(Phase Resolved Partial Discharge Analysis). However, it is generally thought that PSA has some possibility to misjudge patterns in case of data-missing resulting from poor sensitivity because it analyses the correlation between sequential pulses, which leads to hesitate to apply it to on-site. Therefore, in this paper, the problems of PSA such as data-missing and noise-adding cases were investigated. for the purpose, PD data obtained from various defects including noise-adding data were used and analyzed. As a result, it was shown that both cases could cause fatal errors in recognizing PD patterns. In case of the data missing, the error was dependant on the kinds of defect and the degree of degradation Also, it could be noticed that the error due to adding noises was larger than that due to some data missing.

Sign Language Transformation System based on a Morpheme Analysis (형태소분석에 기초한 수화영상변환시스템에 관한 연구)

  • Lee, Yong-Dong;Kim, Hyoung-Geun;Jeong, Woon-Dal
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.6
    • /
    • pp.90-98
    • /
    • 1996
  • In this paper we have proposed the sign language transformation system for deaf based on a morpheme analysis. The proposed system extracts phoneme components and connection informations of the input character sequence by using a morpheme analysis. And then the sign image obtained by component analysis is correctly and automatically generated through the sign image database. For the effective sign language transformation, the language description dictionary which consists of a morpheme analysis part for analysis of input character sequence and sign language description part for reference of sign language pattern is costructed. To avoid the duplicating sign language pattern, the pattern is classified a basic, a compound and a similar sign word. The computer simulation shows the usefulness of the proposed system.

  • PDF

Biogeography and Distribution Pattern of a Korean Wood-eating Cockroach Species, Cryptocercus kyebangensis, Based on Genetic Network Analysis and DNA Sequence Information

  • Park, Yung-Chul;Choe, Jae-Chun
    • Journal of Ecology and Environment
    • /
    • v.30 no.4
    • /
    • pp.331-340
    • /
    • 2007
  • We examined the evolutionary and ecological processes shaping current geographical distributions of a Korean wood-eating cockroach species, Cryptocercus kyebangensis. Our research aims were to understand evolutionary pattern of DNA sequences, to construct genetic network of Cryptocercus kyebangensis local populations and to understand evolutionary and ecological processes shaping their current geographical distribution patterns via DNA sequence information and genetic networks, using sequence data of two genes (ITS-2 and AT region) from local populations of C. kyebangensis. The results suggest that the ITS-2 and AT region are appropriate molecular markers for elucidating C. kyebangensis geographic patterns at the population level. The MSN-A based on the ITS-2 showed two possible routes, the Hwaak-san and Myeongji-san route and the Seorak-san and Gyebang-san route, for migration of ancestral C. kyebangensis into South Korea. The MSNs (MSN-A and -B) elucidate migration routes well within South Korea, especially the route of Group I and Group II.

An Analysis System for Whole Genomic Sequence Using String B-Tree (스트링 B-트리를 이용한 게놈 서열 분석 시스템)

  • Choe, Jeong-Hyeon;Jo, Hwan-Gyu
    • The KIPS Transactions:PartA
    • /
    • v.8A no.4
    • /
    • pp.509-516
    • /
    • 2001
  • As results of many genome projects, genomic sequences of many organisms are revealed. Various methods such as global alignment, local alignment are used to analyze the sequences of the organisms, and k -mer analysis is one of the methods for analyzing the genomic sequences. The k -mer analysis explores the frequencies of all k-mers or the symmetry of them where the k -mer is the sequenced base with the length of k. However, existing on-memory algorithms are not applicable to the k -mer analysis because a whole genomic sequence is usually a large text. Therefore, efficient data structures and algorithms are needed. String B-tree is a good data structure that supports external memory and fits into pattern matching. In this paper, we improve the string B-tree in order to efficiently apply the data structure to k -mer analysis, and the results of k -mer analysis for C. elegans and other 30 genomic sequences are shown. We present a visualization system which enables users to investigate the distribution and symmetry of the frequencies of all k -mers using CGR (Chaotic Game Representation). We also describe the method to find the signature which is the part of the sequence that is similar to the whole genomic sequence.

  • PDF