• Title/Summary/Keyword: 문장 정렬

Search Result 36, Processing Time 0.027 seconds

Integrated Sentence Preprocessing System for Web Indexing (웹 인덱싱을 위한 통합 전처리 시스템의 개발)

  • Shim, Jun-Hyuk;Cha, Jong-Won;Lee, Geun-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 2000.10d
    • /
    • pp.216-223
    • /
    • 2000
  • 웹 문서는 일반 문서들과 달리 자유로운 형식으로 기술되어 있고, 원문에 태그나 코드 등 불필요한 내용들을 많이 포함하고 있어 언어 처리에 바로 사용하기에 적합하지 못하다. 본 논문은 인덱싱 대상 문서로 사용되는 웹 문서를 자동으로 수집하여, 문장 단위로 정렬된 문서로 제작, 관리하는 통합 전처리 시스템인 Web Tagger의 구조와 전처리 방법을 소개한다. Web Tagger는 문서 정제, 문장 분할, 띄어쓰기의 과정을 거쳐 웹 문서에서 표준화된 정보를 추출하고, 형태소 분석기를 포함한 응용 시스템의 목적에 맞게 XML 형식의 원문 코퍼스를 자동으로 생성하고 관리한다. '정규문법(Regexp)', '휴리스틱', '품사 인덱스 참조', 'C4.5를 사용한 학습 규칙' 등의 다양한 전처리 기법은 형태소 분석 정확도 향상과 시스템 안정성 보장에 기여한다.

  • PDF

Searching Similar Example Sentences for the Computer-Aided Translation System (번역지원 시스템을 위한 유사 예문 검색)

  • Kim Dong-Joo;Kim Han-Woo
    • KSCI Review
    • /
    • v.14 no.1
    • /
    • pp.197-204
    • /
    • 2006
  • This paper proposes an similar sentence searching algorithm for the computer-aided translation. The Proposed algorithm, which is based on the Needleman-Wunsch algorithm, measures the similarity between the input sentence and the example sentences through combining surface. lemma, part-of-speech information of words with the multi-layered information. It also carries out the alignment between them. The accuracy of the proposed algorithm was very high in the experiment for the example sentences of the area of electricity and communication.

  • PDF

Automatically Extracting Unknown Translations Using Phrase Alignment (정렬기법을 이용한 미등록 대역어의 자동 추출)

  • Kim, Jae-Hoon;Yang, Sung-Il
    • The KIPS Transactions:PartB
    • /
    • v.14B no.3 s.113
    • /
    • pp.231-240
    • /
    • 2007
  • In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.

Analysis of Korean Language Parsing System and Speed Improvement of Machine Learning using Feature Module (한국어 의존 관계 분석과 자질 집합 분할을 이용한 기계학습의 성능 개선)

  • Kim, Seong-Jin;Ock, Cheol-Young
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.8
    • /
    • pp.66-74
    • /
    • 2014
  • Recently a variety of study of Korean parsing system is carried out by many software engineers and linguists. The parsing system mainly uses the method of machine learning or symbol processing paradigm. But the parsing system using machine learning has long training time because the data of Korean sentence is very big. And the system shows the limited recognition rate because the data has self error. In this thesis we design system using feature module which can reduce training time and analyze the recognized rate each the number of training sentences and repetition times. The designed system uses the separated modules and sorted table for binary search. We use the refined 36,090 sentences which is extracted by Sejong Corpus. The training time is decreased about three hours and the comparison of recognized rate is the highest as 84.54% when 10,000 sentences is trained 50 times. When all training sentence(32,481) is trained 10 times, the recognition rate is 82.99%. As a result it is more efficient that the system is used the refined data and is repeated the training until it became the steady state.

Searching Similar Example-Sentences Using the Needleman-Wunsch Algorithm (Needleman-Wunsch 알고리즘을 이용한 유사예문 검색)

  • Kim Dong-Joo;Kim Han-Woo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.4 s.42
    • /
    • pp.181-188
    • /
    • 2006
  • In this paper, we propose a search algorithm for similar example-sentences in the computer-aided translation. The search for similar examples, which is a main part in the computer-aided translation, is to retrieve the most similar examples in the aspect of structural and semantical analogy for a given query from examples. The proposed algorithm is based on the Needleman-Wunsch algorithm, which is used to measure similarity between protein or nucleotide sequences in bioinformatics. If the original Needleman-Wunsch algorithm is applied to the search for similar sentences, it is likely to fail to find them since similarity is sensitive to word's inflectional components. Therefore, we use the lemma in addition to (typographical) surface information. In addition, we use the part-of-speech to capture the structural analogy. In other word, this paper proposes the similarity metric combining the surface, lemma, and part-of-speech information of a word. Finally, we present a search algorithm with the proposed metric and present pairs contributed to similarity between a query and a found example. Our algorithm shows good performance in the area of electricity and communication.

  • PDF

Foreign Language Education of Korean Peninsula: Insights from Nogeldae (『노걸대』 분석을 통해서 바라본 우리 반도의 외국어 교육)

  • Kim, Jeong-ryeol
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.6
    • /
    • pp.408-414
    • /
    • 2017
  • This paper aims to investigate the value and resilience of Nogeoldae which was written at the end of Koryo dynasty and has been used as the most important foreign language education materials throughout the 500 years of Chosun dynasty. To this end, 106 volumes of dialogues, 12 of meeting, 17 of lodging, 21 of Daedo bound, 34 of Daedo lives and 11 of return in Nogeoldae are analyzed by an average length of the sentences, an average length of words, type-token ratio, number of words before main verbs and number of words before nouns to identify the progressive degree of the complexity. The result of the analysis shows that Nogeoldae presents a desired progressive complexity found in modern foreign language textbooks.

Analysis of the Usability of Machine Translators as an English Learning Tool -Through backtranslation of the as phrase (영어학습 도구로서 기계번역기의 가용성 분석 - as구문 역번역을 통하여)

  • Park, Kwonho;Kim, Jeong-ryeol
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.5
    • /
    • pp.259-267
    • /
    • 2021
  • Machine translators first appeared in the 1950s and made a leap in translation accuracy by applying the neural translation system in the 2010s. However, it is still having difficulty in translating complex sentences, which made it inconvenient to use machine translators as an English learning tool. Therefore, this study analyzed the usability of a machine translator as an English learning tool through a backtranslation experiment of as phrases. As analysis tools, Google Translator, Naver Papago, and Microsoft Translator, were used since they are representative machine translators using a neural translation system. As a result of the study, findings are: The usability was significantly different according to each as usage when using a machine translator. Accordingly, as usages in sentences were classified into high, ordinary, and low usability. Unlike previous studies, this study has a research contribution in analyzing the machine translator as a direct learning tool and quantifying the usability of the conjunction as.

An Analysis on Problem Solving Ability of 3rd Grade Types of Multiplication and Division Word Problem (곱셈과 나눗셈 문장제 유형에 따른 문제해결능력)

  • Lim, Ja Sun;Kim, Sung Joon
    • Journal of Elementary Mathematics Education in Korea
    • /
    • v.19 no.4
    • /
    • pp.501-525
    • /
    • 2015
  • This study analyzes arithmetic word problem of multiplication and division in the mathematics textbooks and workbooks of 3rd grade in elementary school according to 2009 revised curriculum. And we analyzes type of the problem solving ability which 4th graders prefer in the course of arithmetic word problem solving and the problem solving ability as per the type in order to seek efficient teaching methods on arithmetic word problem solving of students. First, in the mathematics textbook and workbook of 3rd grade, arithmetic word problem of multiplication and division suggested various things such as thought opening, activities, finish, and let's check. As per the semantic element, multiplication was classified into 5 types of cumulated addition of same number, rate, comparison, arrayal and combination while division was classified into 2 types of division into equal parts and division by equal part. According to result of analysis, the type of cumulated addition of same number was the most one for multiplication while 2 types of division into equal parts and division by equal part were evenly spread in division. Second, according to 1st test result of arithmetic word problem solving ability in the element of arithmetic operation meaning, 4th grade showed type of cumulated addition of same number as the highest correct answer ratio for multiplication. As for division, 4th grade showed 90% correct answer ratio in 4 questionnaires out of 5 questionnaires. And 2nd test showed arithmetic word problem solving ability in the element of arithmetic operation construction, as for multiplication and division, correct answer ratio was higher in the case that 4th grade students did not know the result than the case they did not know changed amount or initial amount. This was because the case of asking the result was suggested in the mathematics textbook and workbook and therefore, it was difficult for students to understand such questions as changed amount or initial amount which they did not see frequently. Therefore, it is required for students to experience more varied types of problems so that they can more easily recognize problems seen from a textbook and then, improve their understanding of problems and problem solving ability.

Automatic Product Review Helpfulness Estimation based on Review Information Types (상품평의 정보 분류에 기반한 자동 상품평 유용성 평가)

  • Kim, Munhyong;Shin, Hyopil
    • Journal of KIISE
    • /
    • v.43 no.9
    • /
    • pp.983-997
    • /
    • 2016
  • Many available online product reviews for any given product makes it difficult for a consumer to locate the helpful reviews. The purpose of this study was to investigate automatic helpfulness evaluation of online product reviews according to review information types based on the target of information. The underlying assumption was that consumers find reviews containing specific information related to the product itself or the reliability of reviewers more helpful than peripheral information, such as shipping or customer service. Therefore, each sentence was categorized by given information types, which reduced the semantic space of review sentences. Subsequently, we extracted specific information from sentences by using a topic-based representation of the sentences and a clustering algorithm. Review ranking experiments indicated more effective results than other comparable approaches.

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space (전문어의 범용 공간 매핑을 위한 비선형 벡터 정렬 방법론)

  • Kim, Junwoo;Yoon, Byungho;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.127-146
    • /
    • 2022
  • Recently, as word embedding has shown excellent performance in various tasks of deep learning-based natural language processing, researches on the advancement and application of word, sentence, and document embedding are being actively conducted. Among them, cross-language transfer, which enables semantic exchange between different languages, is growing simultaneously with the development of embedding models. Academia's interests in vector alignment are growing with the expectation that it can be applied to various embedding-based analysis. In particular, vector alignment is expected to be applied to mapping between specialized domains and generalized domains. In other words, it is expected that it will be possible to map the vocabulary of specialized fields such as R&D, medicine, and law into the space of the pre-trained language model learned with huge volume of general-purpose documents, or provide a clue for mapping vocabulary between mutually different specialized fields. However, since linear-based vector alignment which has been mainly studied in academia basically assumes statistical linearity, it tends to simplify the vector space. This essentially assumes that different types of vector spaces are geometrically similar, which yields a limitation that it causes inevitable distortion in the alignment process. To overcome this limitation, we propose a deep learning-based vector alignment methodology that effectively learns the nonlinearity of data. The proposed methodology consists of sequential learning of a skip-connected autoencoder and a regression model to align the specialized word embedding expressed in each space to the general embedding space. Finally, through the inference of the two trained models, the specialized vocabulary can be aligned in the general space. To verify the performance of the proposed methodology, an experiment was performed on a total of 77,578 documents in the field of 'health care' among national R&D tasks performed from 2011 to 2020. As a result, it was confirmed that the proposed methodology showed superior performance in terms of cosine similarity compared to the existing linear vector alignment.