• 제목/요약/키워드: Annotation tool

검색결과 72건 처리시간 0.023초

Robust Syntactic Annotation of Corpora and Memory-Based Parsing

  • Hinrichs, Erhard W.
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2002년도 Language, Information, and Computation Proceedings of The 16th Pacific Asia Conference
    • /
    • pp.1-1
    • /
    • 2002
  • This talk provides an overview of current work in my research group on the syntactic annotation of the T bingen corpus of spoken German and of the German Reference Corpus (Deutsches Referenzkorpus: DEREKO) of written texts. Morpho-syntactic and syntactic annotation as well as annotation of function-argument structure for these corpora is performed automatically by a hybrid architecture that combines robust symbolic parsing with finite-state methods ("chunk parsing" in the sense Abney) with memory-based parsing (in the sense of Daelemans). The resulting robust annotations can be used by theoretical linguists, who lire interested in large-scale, empirical data, and by computational linguists, who are in need of training material for a wide range of language technology applications. To aid retrieval of annotated trees from the treebank, a query tool VIQTORYA with a graphical user interface and a logic-based query language has been developed. VIQTORYA allows users to query the treebanks for linguistic structures at the word level, at the level of individual phrases, and at the clausal level.

  • PDF

CaGe: A Web-Based Cancer Gene Annotation System for Cancer Genomics

  • Park, Young-Kyu;Kang, Tae-Wook;Baek, Su-Jin;Kim, Kwon-Il;Kim, Seon-Young;Lee, Do-Heon;Kim, Yong-Sung
    • Genomics & Informatics
    • /
    • 제10권1호
    • /
    • pp.33-39
    • /
    • 2012
  • High-throughput genomic technologies (HGTs), including next-generation DNA sequencing (NGS), microarray, and serial analysis of gene expression (SAGE), have become effective experimental tools for cancer genomics to identify cancer-associated somatic genomic alterations and genes. The main hurdle in cancer genomics is to identify the real causative mutations or genes out of many candidates from an HGT-based cancer genomic analysis. One useful approach is to refer to known cancer genes and associated information. The list of known cancer genes can be used to determine candidates of cancer driver mutations, while cancer gene-related information, including gene expression, protein-protein interaction, and pathways, can be useful for scoring novel candidates. Some cancer gene or mutation databases exist for this purpose, but few specialized tools exist for an automated analysis of a long gene list from an HGT-based cancer genomic analysis. This report presents a new web-accessible bioinformatic tool, called CaGe, a cancer genome annotation system for the assessment of candidates of cancer genes from HGT-based cancer genomics. The tool provides users with information on cancer-related genes, mutations, pathways, and associated annotations through annotation and browsing functions. With this tool, researchers can classify their candidate genes from cancer genome studies into either previously reported or novel categories of cancer genes and gain insight into underlying carcinogenic mechanisms through a pathway analysis. We show the usefulness of CaGe by assessing its performance in annotating somatic mutations from a published small cell lung cancer study.

의미적 멀티미디어 메타데이터 생성을 위한 MPEG-7 기술기반 주석도구의 개발 (Development of MPEG-7 Description-based Annotation Tool for Production of Semantic Multimedia Metadata)

  • 안형근;고재진
    • 정보처리학회논문지D
    • /
    • 제14D권1호
    • /
    • pp.35-44
    • /
    • 2007
  • 최근 멀티미디어 데이터의 급격한 양적 팽창은 원하는 데이터를 빠르고 정확하게 검색해야 한다는 새로운 과제를 안겨주었다. 이러한 효율적 검색을 위해서 가장 중요한 기반이 되는 것이 바로 멀티미디어 데이터의 적절한 표현이다. 국제 표준으로 제정된 MPEG-7은 바로 이러한 이유로 멀티미디어 데이터의 표현에 대한 표준화를 다루고 있다. 본 논문에서 메타데이터 생성을 위한 새로운 접근법을 제안한다. 사용자는 주어진 멀티미디어 컨텐츠를 작은 단위들로 분해를 하고, 분해된 단위들에 시간, 위치 둥과 같은 추가적인 기본정보뿐만 아니라 MPEG-7표준을 따르는 사건, 관계 등과 같은 분류정보를 쉽게 주석할 수 있다. 이 주석의 목적은 자동적으로 의미기술을 만들기 위한 것이고, 이 의미기술에서 노드들은 사건들이고, 링크는 그들 사이의 관계인 하나의 그래프이다. 마지막으로 제안된 기법을 기반으로 의미기술을 위한 주석도구(SMAT)를 구현하였고, 실제 실험을 통하여 성능을 평가하였다. 최종적으로, 제안 도구는 재사용성과 확장성의 두 개의 중요한 타당의 특징이 있다고 말할 수 있다.

물체인식 딥러닝 모델 구성을 위한 파이썬 기반의 Annotation 툴 개발 (Development of Python-based Annotation Tool Program for Constructing Object Recognition Deep-Learning Model)

  • 임송원;박구만
    • 한국방송∙미디어공학회:학술대회논문집
    • /
    • 한국방송∙미디어공학회 2019년도 추계학술대회
    • /
    • pp.162-164
    • /
    • 2019
  • 본 논문에서는 물체인식 딥러닝 모델 생성에 필요한 라벨링(Labeling)과정에서 사용자가 다양한 기능을 활용하여 효과적인 학습 데이터를 구성할 수 있는 GUI 프로그램을 구현했다. 프로그램의 인터페이스는 파이썬 기반의 GUI 모듈인 Tkinter 를 활용하여, 실시간으로 이미지 데이터를 수집할 수 있는 크롤링(Crawling)기능과 미리 학습된 Retinanet 을 통해 이미지 데이터를 인식함으로써 자동으로 주석(Annotation) 과정을 수행할 수 있는 기능을 구성했다. 또한, 수집한 이미지 데이터를 다양한 효과와 노이즈, 변형 등으로 Augmentation 기능을 추가함으로써, 사용자가 모델을 학습하기 위한 데이터 전처리 단계를 하나의 GUI 프로그램에서 수행할 수 있도록 했다. 또한 사용자가 직접 학습한 모델을 추정 모델(Inference Model)로 변환하여 프로그램에 입력할 수 있도록 설계한다.

  • PDF

한국어 감정분석을 위한 말뭉치 구축 가이드라인 및 말뭉치 구축 도구 (Annotation Guidelines for Korean Sentiment Analysis and Annotation Tool)

  • 하은주;오진영;차정원
    • 한국정보과학회 언어공학연구회:학술대회논문집(한글 및 한국어 정보처리)
    • /
    • 한국정보과학회언어공학연구회 2018년도 제30회 한글 및 한국어 정보처리 학술대회
    • /
    • pp.84-87
    • /
    • 2018
  • 한국어 감정분석에 대한 연구는 활발하게 진행되고 있다. 그렇지만 학습 및 평가 말뭉치 표현에 대한 논의가 부족하다. 본 논문은 한국어 감정분석에 대해 정의하고, 말뭉치 제작을 위한 가이드라인을 제시한다. 또한, 태깅 가이드라인에 따라 말뭉치를 구축하였으며 한국어 감정분석을 위한 반자동 태깅 도구를 구현하였다.

  • PDF

ORF Miner: a Web-based ORF Search Tool

  • Park, Sin-Gi;Kim, Ki-Bong
    • Genomics & Informatics
    • /
    • 제7권4호
    • /
    • pp.217-219
    • /
    • 2009
  • The primary clue for locating protein-coding regions is the open reading frame and the determination of ORFs (Open Reading Frames) is the first step toward the gene prediction, especially for prokaryotes. In this respect, we have developed a web-based ORF search tool called ORF Miner. The ORF Miner is a graphical analysis utility which determines all possible open reading frames of a selectable minimum size in an input sequence. This tool identifies all open reading frames using alternative genetic codes as well as the standard one and reports a list of ORFs with corresponding deduced amino acid sequences. The ORF Miner can be employed for sequence annotation and give a crucial clue to determination of actual protein-coding regions.

컴퓨터기반 협력학습에서 공유지식 형성을 위한 표상도구설계 (The design of representation tool for constructing shared knowledge in CSCL)

  • 신윤희;김동식
    • 컴퓨터교육학회논문지
    • /
    • 제19권2호
    • /
    • pp.73-85
    • /
    • 2016
  • 컴퓨터기반 협력학습도구를 사용하여 한 공간에서 다양한 관점을 가진 사람들이 토의하고자 할 때, 작성된 글이 과제내용 중 어느 부분에 해당하는 것인지를 파악하는 것이 어렵고 서로의 지식과 의견을 공유하는데 어려움이 따른다. 본 연구에서는 컴퓨터 기반 협력학습에서 공유지식 형성을 방해하는 요인을 문헌연구를 통해 분석하고 도출된 원리를 기반으로 협력표상도구를 설계하였다. 설계된 도구는 평가 준거에 따른 체크리스트와 F.G.I를 통해 교수자, 설계자, 학습자의 다양한 의견을 수렴함으로써 반복 조정되었다. 최종 조정된 도구는 복합 과제를 해결해야하는 컴퓨터 기반 협력학습상황에서 학습자 간 지식 및 의견을 공유하는 데 방해요소를 최소화하여 협의를 촉진하고 고차원의 해결책을 도출하는 데 기여할 것이라 기대한다.

Comparative Evaluation of Intron Prediction Methods and Detection of Plant Genome Annotation Using Intron Length Distributions

  • Yang, Long;Cho, Hwan-Gue
    • Genomics & Informatics
    • /
    • 제10권1호
    • /
    • pp.58-64
    • /
    • 2012
  • Intron prediction is an important problem of the constantly updated genome annotation. Using two model plant (rice and $Arabidopsis$) genomes, we compared two well-known intron prediction tools: the Blast-Like Alignment Tool (BLAT) and Sim4cc. The results showed that each of the tools had its own advantages and disadvantages. BLAT predicted more than 99% introns of whole genomic introns with a small number of false-positive introns. Sim4cc was successful at finding the correct introns with a false-negative rate of 1.02% to 4.85%, and it needed a longer run time than BLAT. Further, we evaluated the intron information of 10 complete plant genomes. As non-coding sequences, intron lengths are not limited by a triplet codon frame; so, intron lengths have three phases: a multiple of three bases (3n), a multiple of three bases plus one (3n + 1), and a multiple of three bases plus two (3n + 2). It was widely accepted that the percentages of the 3n, 3n + 1, and 3n + 2 introns were quite similar in genomes. Our studies showed that 80% (8/10) of species were similar in terms of the number of three phases. The percentages of 3n introns in $Ostreococcus$ $lucimarinus$ was excessive (47.7%), while in $Ostreococcus$ $tauri$, it was deficient (29.1%). This discrepancy could have been the result of errors in intron prediction. It is suggested that a three-phase evaluation is a fast and effective method of detecting intron annotation problems.

Lessons from Developing an Annotated Corpus of Patient Histories

  • Rost, Thomas Brox;Huseth, Ola;Nytro, Oystein;Grimsmo, Anders
    • Journal of Computing Science and Engineering
    • /
    • 제2권2호
    • /
    • pp.162-179
    • /
    • 2008
  • We have developed a tool for annotation of electronic health record (EHR) data. Currently we are in the process of manually annotating a corpus of Norwegian general practitioners' EHRs with mainly linguistic information. The purpose of this project is to attain a linguistically annotated corpus of patient histories from general practice. This corpus will be put to future use in medical language processing and information extraction applications. The paper outlines some of our practical experiences from developing such a corpus and, in particular, the effects of semi-automated annotation. We have also done some preliminary experiments with part-of-speech tagging based on our corpus. The results indicated that relevant training data from the clinical domain gives better results for the tagging task in this domain than training the tagger on a corpus form a more general domain. We are planning to expand the corpus annotations with medical information at a later stage.

Comparative genome characterization of Leptospira interrogans from mild and severe leptospirosis patients

  • Anuntakarun, Songtham;Sawaswong, Vorthon;Jitvaropas, Rungrat;Praianantathavorn, Kesmanee;Poomipak, Witthaya;Suputtamongkol, Yupin;Chirathaworn, Chintana;Payungporn, Sunchai
    • Genomics & Informatics
    • /
    • 제19권3호
    • /
    • pp.31.1-31.9
    • /
    • 2021
  • Leptospirosis is a zoonotic disease caused by spirochetes from the genus Leptospira. In Thailand, Leptospira interrogans is a major cause of leptospirosis. Leptospirosis patients present with a wide range of clinical manifestations from asymptomatic, mild infections to severe illness involving organ failure. For better understanding the difference between Leptospira isolates causing mild and severe leptospirosis, illumina sequencing was used to sequence genomic DNA in both serotypes. DNA of Leptospira isolated from two patients, one with mild and another with severe symptoms, were included in this study. The paired-end reads were removed adapters and trimmed with Q30 score using Trimmomatic. Trimmed reads were constructed to contigs and scaffolds using SPAdes. Cross-contamination of scaffolds was evaluated by ContEst16s. Prokka tool for bacterial annotation was used to annotate sequences from both Leptospira isolates. Predicted amino acid sequences from Prokka were searched in EggNOG and David gene ontology database to characterize gene ontology. In addition, Leptospira from mild and severe patients, that passed the criteria e-value < 10e-5 from blastP against virulence factor database, were used to analyze with Venn diagram. From this study, we found 13 and 12 genes that were unique in the isolates from mild and severe patients, respectively. The 12 genes in the severe isolate might be virulence factor genes that affect disease severity. However, these genes should be validated in further study.