• Title/Summary/Keyword: N-GRAM

Search Result 575, Processing Time 0.024 seconds

Performance Analysis of n-Gram Indexing Methods for Korean text Retrieval (한글 문서 검색에서 n-Gram 색인방법의 성능 분석)

  • 이준규;심수정;박혁로
    • Proceedings of the IEEK Conference
    • /
    • 2003.11b
    • /
    • pp.145-148
    • /
    • 2003
  • The agglutinative nature of Korean language makes the problem of automatic indexing of Korean much different from that of Indo-Eroupean languages. Especially, indexing with compound nouns in Korean is very problematic because of the exponential number of possible analysis and the existence of unknown words. To deal with this compound noun indexing problem, we propose a new indexing methods which combines the merits of the morpheme-based indexing methods and the n-gram based indexing methods. Through the experiments, we also find that the best performance of n-gram indexing methods can be achieved with 1.75-gram which is never considered in the previous researches.

  • PDF

Studies on the New Antimetabolites Produced by Microorganisms (미생물이 생산하는 새로운 대사길항물질에 관한 연구)

  • 박부길
    • Microbiology and Biotechnology Letters
    • /
    • v.6 no.4
    • /
    • pp.187-196
    • /
    • 1978
  • Antimetabolite N-2292 substance, an antagonist of L-aspartic acid and L-glutamic acid was isolated from the fermentation broth of Streptomyces. Taxonomical study on the producing strain made it a related species of Streptomyces albulus judged by cultural characteristics and carbon utilization. N-2292 substance was isolated as amorphous white powder with melting point at 185$^{\circ}C$. From the physicochemical characteristics of the substance, it was peptide like substance. It was active against Gram positive and Gram negative bacteria but negative against yeast and mold in its biological properties. It was reversed by L-Asp and L-Glu on the synthetic medium.

  • PDF

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

Etiology of Bacteremia in Children With Hemato-Oncologic Diseases From 2013 to 2023: A Single Center Study

  • Sun Woo Park;Ji Young Park;Hyoung Soo Choi;Hyunju Lee
    • Pediatric Infection and Vaccine
    • /
    • v.31 no.1
    • /
    • pp.46-54
    • /
    • 2024
  • Purpose: This study aimed to identify the pathogens of bloodstream infection in children with underlying hemato-oncologic diseases, analyze susceptibility patterns, compare temporal trends with those of previous studies, and assess empirical antimicrobial therapy. Methods: Retrospective review study of children bacteremia in hemato-oncologic diseases was conducted at Seoul National University Bundang Hospital from January 2013 to July 2023. Results: Overall, 98 episodes of bacteremia were observed in 74 patients. Among pathogens isolated, 57.1% (n=56) were Gram-positive bacteria, 38.8% (n=38) were Gram-negative bacteria, and 4.1% (n=4) were Candida spp. The most common Gram-positive bacteria were coagulase-negative staphylococci (n=21, 21.4%) and Staphylococcus aureus, (n=14, 14.3%) whereas the most common Gram-negative bacteria were Klebsiella pneumoniae (n=16, 16.3%) and Escherichia coli (n=10, 10.2%). The susceptibility of Gram-positive bacteria to penicillin, oxacillin, and vancomycin was 11.5%, 32.7%, and 94.2%, respectively and the susceptibility of Gram-negative bacteria to cefotaxime, piperacillin/tazobactam, imipenem, gentamicin, and amikacin was 68.6%, 80%, 97.1%, 82.9%, and 91.4%, respectively. Methicillin-resistant S. aureus was detected in 1 strain and among Gram-negative strains, extended spectrum β-lactamase accounted for 28.9% (12/38). When analyzing the antibiotic susceptibility and empirical antibiotics, the mismatch rate was 25.5% (n=25). The mortality rate of children within 30 days of bacteremia was 7.1% (n=7). Conclusions: Empirical antibiotic therapy for bacteremia in children with hemato-oncologic diseases should be based on the local antibiogram in each institution and continuous monitoring is necessary.

Weighted N-Gram Indexing for Image Search Engine (영상검색엔진을 위한 가중치 N-Gram색인 방법)

  • 이상열;정성호;황병곤
    • Proceedings of the Korea Society of Information Technology Applications Conference
    • /
    • 2002.11a
    • /
    • pp.412-416
    • /
    • 2002
  • 멀티미디어 검색 시스템들은 아직까지 내용 기발에 의한 검색기술이 실용적으로 쓰일 만큼 높은 성능을 보이고 있지 않기 때문에 텍스트에 의한 검색만을 지원하고 있는 실정이다. HTML 문서에 나타나는 텍스트 중 이미지 아래에 붙은 표제나 이미지 링크에 붙어 있는 텍스트를 골라내어 이미지의 색인 정보로 이용하여 텍스트를 추출하는 기법을 제안하였다. 텍스트를 추출하기 위해 N-Gram 색인 방법을 사용하였으며 한편 검색 효율을 높이기 위해서 질의 의도가 큰 단어에 가중치를 부여하였다.

  • PDF

A post processing of continuous speech recognition using N-gram words and sentence patterns (문형정보와 N-gram 단어정보를 이용한 연속음성인식 후처리)

  • 엄한용;황도삼
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.04b
    • /
    • pp.324-326
    • /
    • 2000
  • 본 논문에서는 항공편 예약이라는 제한 영역에서의 연속음성인식 시스템을 위한 후처리 본 논문에서는 항공편 예약이라는 제한 영역에서의 연속음성인식 시스템을 위한 후처리 방안을 제시한다. 제안하는 후처리 방안은 200 문장의 항공편 예약 텍스트 데이터를 이용하여 문형 정보를 추출한 뒤 특정 문형별로 분류하였다. 분류된 문형과 음성인식 후의 문장을 비교하여 가장 유사한 문형을 추론한다. 추론한 특정 문형에서 나올 수 있는 형태소를 형태소들간의 N-gram 정보가 수록된 데이터베이스를 이용하여 형태소를 수정하고 보완한 결과를 최종 문장으로 출력한다.

  • PDF

Weighted N-Gram Indexing for Image Search Engine (영상검색엔진을 위한 가중치 N-Gram색인 방법)

  • 이상열;정성호;황병곤
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2002.11a
    • /
    • pp.412-416
    • /
    • 2002
  • 멀티미디어 검색 시스템들은 아직까지 내용 기반에 의한 검색기술이 실용적으로 쓰일 만큼 높은 성능을 보이고 있지 않기 때문에 텍스트에 의한 검색만을 지원하고 있는 실정이다. HTML 문서에 나타나는 텍스트 중 이미지 아래에 붙은 표제나 이미지 링크에 붙어 있는 텍스트를 골라내어 이미지의 색인 정보로 이용하여 텍스트를 추출하는 기법을 제안하였다. 텍스트를 추출하기 위해 N-Gram 색인 방법을 사용하였으며 한편 검색 효율을 높이기 위해서 질의 의도가 큰 단어에 가중치를 부여하였다.

  • PDF

A Numerical Coding System (MCRCODE-N) for Identification of Glucose Nonfermenting Gram-Negative Bacilli (숫자표기에 의한 포도당 비발효균의 동정시안(MCRCODE-N))

  • Hong, Seok-Il;Kim, Chung-Suk
    • Journal of Yeungnam Medical Science
    • /
    • v.2 no.1
    • /
    • pp.183-190
    • /
    • 1985
  • The glucose nonfermenting gram-negative bacilli encountered about 10% of all gram-negative bacilli isolated from clinical material. Therefore, a rapid and correct identification of glucose nonfermenting gram-negative bacilli is impostent for a better management of infectious disease. There are many conventional systems for the Identification of glucose nonfermenting gram-negative bacilli but most of them have problems and difficulties. Commercial Kit Systems exist and they are too expensive for dally use 10 Korea because of high cost. Based on 12 selected tests we propose a new code system, MCRCODE-N for rapid and 10-expensive identification of glucose nonfermenting gram-negative bacilli. The selective 12 tests are oxidase, glucose oxidation motility, urease, DNase arginine dehydrolase, nitrate reduction, gelatin Liquefaction, esculin hydrolysis, mannitol oxidation, maltose oxidation, Lactose oxidation. The 12 tests are divided 4 group and then each group has 3 tests. The result of each group is expressed by the number as below. The positive test is given by specific number (1st test = 1, 2nd test = 2, 3rd test = 4), while any negative result is 0. Each 3 numbers of one group are added and make number of 1 digit. Four digit number is refered to the code book of MCRCODE-N system or MCRCODE system using computer (Apple-II model) created by authors. This MCRCODE-N system is suitable ones for our use 10 Korea. We propose the MCRCODEN-N system for clinical use.

  • PDF

N-gram Feature Selection for Text Classification Based on Symmetrical Conditional Probability and TF-IDF (대칭 조건부 확률과 TF-IDF 기반 텍스트 분류를 위한 N-gram 특질 선택)

  • Choi, Woo-Sik;Kim, Seoung Bum
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.41 no.4
    • /
    • pp.381-388
    • /
    • 2015
  • The rapid growth of the World Wide Web and online information services has generated and made accessible a huge number of text documents. To analyze texts, selecting important keywords is an essential step. In this paper, we propose a feature selection method that combines a term frequency-inverse document frequency technique and symmetrical conditional probability. The proposed method can identify features with N-gram, the sequential multiword. The effectiveness of the proposed method is demonstrated through a real text data from the machine learning repository, University of California, Irvine.

N-gram Based Robust Spoken Document Retrievals for Phoneme Recognition Errors (음소인식 오류에 강인한 N-gram 기반 음성 문서 검색)

  • Lee, Su-Jang;Park, Kyung-Mi;Oh, Yung-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.149-166
    • /
    • 2008
  • In spoken document retrievals (SDR), subword (typically phonemes) indexing term is used to avoid the out-of-vocabulary (OOV) problem. It makes the indexing and retrieval process independent from any vocabulary. It also requires a small corpus to train the acoustic model. However, subword indexing term approach has a major drawback. It shows higher word error rates than the large vocabulary continuous speech recognition (LVCSR) system. In this paper, we propose an probabilistic slot detection and n-gram based string matching method for phone based spoken document retrievals to overcome high error rates of phone recognizer. Experimental results have shown 9.25% relative improvement in the mean average precision (mAP) with 1.7 times speed up in comparison with the baseline system.

  • PDF