Search | Korea Science

Building an Annotated English-Vietnamese Parallel Corpus for Training Vietnamese-related NLPs

Dien Dinh;Kiem Hoang
- Proceedings of the IEEK Conference
- /
- summer
- /
- pp.103-109
- /
- 2004
In NLP (Natural Language Processing) tasks, the highest difficulty which computers had to face with, is the built-in ambiguity of Natural Languages. To disambiguate it, formerly, they based on human-devised rules. Building such a complete rule-set is time-consuming and labor-intensive task whilst it doesn't cover all the cases. Besides, when the scale of system increases, it is very difficult to control that rule-set. So, recently, many NLP tasks have changed from rule-based approaches into corpus-based approaches with large annotated corpora. Corpus-based NLP tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for Vietnamese are at a deadlock due to absence of annotated training data. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we present our building an annotated English-Vietnamese parallel aligned corpus named EVC to train for Vietnamese-related NLP tasks such as Word Segmentation, POS-tagger, Word Order transfer, Word Sense Disambiguation, English-to-Vietnamese Machine Translation, etc.
PDF

A Splog Detection System Using Support Vector Machines and $x^2$ Statistics (지지벡터기계와 카이제곱 통계량을 이용한 스팸 블로그(Splog) 판별 시스템)

Lee, Song-Wook
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2010.05a
- /
- pp.905-908
- /
- 2010
Our purpose is to develope the system which detects splogs automatically among blogs on Web environment. After removing HTML of blogs, they are tagged by part of speech(POS) tagger. Words and their POS tags information is used as a feature type. Among features, we select useful features with $x^2$ statistics and train the SVM with the selected features. Our system acquired 90.5% of F1 measure with SPLOG data set.
PDF

A study of methods for Oriental.Western medical approach of Child Neuropsychiatric Disorders (소아신경정신 질환의 한.양방적 접근 방법론 연구)

Kim, Geun-Woo
- Journal of Oriental Neuropsychiatry
- /
- v.14 no.2
- /
- pp.15-25
- /
- 2003
Objectives : This study aimed investigation of clinical development to child neuropsychiatry through the oriental western medical approach of child neuropsychiatric disorders Methods : As DSM-IV and ICD-10 set a standard for clinical expression. According to this standard and oriental medical diseases, child neuropsychiatric disorders are divided into six symptoms Results and Conclusion : 1. View point of oriental medicine, Psycho Somatic stroke(inclusive of the spasm) place under the category 'Epilepsy(癎)', 'Children's fit(驚風)' and 'Chi-Kyeung(?痙)'. 2. View point of oriental medicine, Mental Retardation place under the category 'Dementia(?)', 'Amnesia(健忘)' and 'Speech Disorder(語遲)' 3. View point of oriental medicine, Emotional Disorder place under the category 'Adjustment Disorder(客?)', 'Cry with anxiety at night(夜啼症)', 'Gi-Byung(?病)' and 'Child depressive Disorder(小兒癲症)' 4. View point of oriental medicine, Conduct development Disorder place under the category 'Physical frail of five part(五軟)' and 'Physical stiff of five part(五硬)'. 5. View point of oriental medicine, Childhood Psychosis place under the category 'Insanity(癲狂)'. 6. View point of oriental medicine, Somatoform Disorder place under the category 'Palpitation of the heart(驚悸)', 'Vomiting and Diarrhea(吐瀉)', 'Asthma(喘)', 'Headache(頭痛)' and 'Enuresis(遺尿)'
PDF

Linguistic Modeling for Multilingual Machine Translation based on Common Transfer (공통변환 기반 다국어 자동번역을 위한 언어학적 모델링)

Choi, Sungkwon;Kim, Younggil
- Language and Information
- /
- v.18 no.1
- /
- pp.77-97
- /
- 2014
Multilingual machine translation means the machine translation that is for more than two languages. Common transfer means the transfer in which we can reuse the transfer rules among similar languages according to linguistic typology. Therefore, the multilingual machine translation based on common transfer is the multilingual machine translation that can share the transfer rules among languages with similar linguistic typology. This paper describes the linguistic modeling for multilingual machine translation based on common transfer under development. This linguistic modeling consists of the linguistic devices such as 1) multilingual common Part-of-Speech set, 2) multilingual common transfer format, 3) multilingual common transfer chunking, and 4) multilingual common transfer rules based on linguistic typology. Validity of this linguistic modeling for multilingual machine translation is shown in the simulation. The multilingual machine translation system based on common transfer including Korean, English, Chinese, Spanish, and French will be developed till 2018.
PDF

Context-sensitive Word Error Detection and Correction for Automatic Scoring System of English Writing (영작문 자동 채점 시스템을 위한 문맥 고려 단어 오류 검사기)

Choi, Yong Seok;Lee, Kong Joo
- KIPS Transactions on Software and Data Engineering
- /
- v.4 no.1
- /
- pp.45-56
- /
- 2015
In this paper, we present a method that can detect context-sensitive word errors and generate correction candidates. Spelling error detection is one of the most widespread research topics, however, the approach proposed in this paper is adjusted for an automated English scoring system. A common strategy in context-sensitive word error detection is using a pre-defined confusion set to generate correction candidates. We automatically generate a confusion set in order to consider the characteristics of sentences written by second-language learners. We define a word error that cannot be detected by a conventional grammar checker because of part-of-speech ambiguity, and propose how to detect the error and generate correction candidates for this kind of error. An experiment is performed on the English writings composed by junior-high school students whose mother tongue is Korean. The f1 value of the proposed method is 70.48%, which shows that our method is promising comparing to the current-state-of-the art.
https://doi.org/10.3745/KTSDE.2015.4.1.45 인용 PDF KSCI

HMM Based Part of Speech Tagging for Hadith Isnad

Abdelkarim Abdelkader
- International Journal of Computer Science & Network Security
- /
- v.23 no.3
- /
- pp.151-160
- /
- 2023
The Hadith is the second source of Islamic jurisprudence after Qur'an. Both sources are indispensable for muslims to practice Islam. All Ahadith are collected and are written. But most books of Hadith contain Ahadith that can be weak or rejected. So, quite a long time, scholars of Hadith have defined laws, rules and principles of Hadith to know the correct Hadith (Sahih) from the fair (Hassen) and weak (Dhaif). Unfortunately, the application of these rules, laws and principles is done manually by the specialists or students until now. The work presented in this paper is part of the automatic treatment of Hadith, and more specifically, it aims to automatically process the chain of narrators (Hadith Isnad) to find its different components and affect for each component its own tag using a statistical method: the Hidden Markov Models (HMM). This method is a power abstraction for times series data and a robust tool for representing probability distributions over sequences of observations. In this paper, we describe an important tool in the Hadith isnad processing: A chunker with HMM. The role of this tool is to decompose the chain of narrators (Isnad) and determine the tag of each part of Isnad (POI). First, we have compiled a tagset containing 13 tags. Then, we have used these tags to manually conceive a corpus of 100 chains of narrators from "Sahih Alboukhari" and we have extracted a lexicon from this corpus. This lexicon is a set of XML documents based on HPSG features and it contains the information of 134 narrators. After that, we have designed and implemented an analyzer based on HMM that permit to assign for each part of Isnad its proper tag and for each narrator its features. The system was tested on 2661 not duplicated Isnad from "Sahih Alboukhari". The obtained result achieved F-scores of 93%.
https://doi.org/10.22937/IJCSNS.2023.23.3.16 인용 PDF

A Study On Generation and Reduction of the Notation Candidate for the Notation Restoration of Korean Phonetic Value (한국어 음가의 표기 복원을 위한 표기 후보 생성 및 감소에 관한 연구)

Rhee, Sang-Burm;Park, Sung-Hyun
- The KIPS Transactions:PartB
- /
- v.11B no.1
- /
- pp.99-106
- /
- 2004
The syllable restoration is a process restoring a phonetic value recognized in a speech recognition device with the notation form that a vocalization is former. In this paper a syllable restoration rule was composed of a based on standard pronunciation for a syllable restoration process. A syllable restoring regulation was used, and a generation method of a notation candidate set was researched. Also, A study is held to reduce the number of created notation candidate. Three phases of reduction processes were suggested. Reduction of a notation candidate has the non-notation syllable, non-vocabulary syllable and non-stem syllable. As a result of experiment, an average of 74％ notation candidate decrease rates were shown.
https://doi.org/10.3745/KIPSTB.2004.11B.1.099 인용 PDF KSCI

A Splog Detection System Using Support Vector Systems (지지벡터기계를 이용한 스팸 블로그(Splog) 판별 시스템)

Lee, Song-Wook
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.15 no.1
- /
- pp.163-168
- /
- 2011
Blogs are an easy way to publish information, engage in discussions, and form communities on the Internet. Recently, there are several varieties of spam blog whose purpose is to host ads or raise the PageRank of target sites. Our purpose is to develope the system which detects these spam blogs (splogs) automatically among blogs on Web environment. After removing HTML of blogs, they are tagged by part of speech(POS) tagger. Words and their POS tags information is used as a feature type. Among features, we select useful features with X2 statistics and train the SVM with the selected features. Our system acquired 90.5% of F1 measure with SPLOG data set.
https://doi.org/10.6109/jkiice.2011.15.1.163 인용 PDF KSCI

Porting POSTAG using Part-Of-Speech TagSet Mapping (품사 태그 세트의 매핑을 이용한 한국어 품사 태거 (POSTAG) 이식)

Kim, Jun-Seok;Shim, Jun-Hyuk;Lee, Geun-Bae
- Annual Conference on Human and Language Technology
- /
- 1999.10e
- /
- pp.484-490
- /
- 1999
품사 태그세트 매핑은 서로 다른 품사 태그세트로 태깅되어 있는 대량의 코퍼스들로부터 정보를 얻고 또한 제공함을 통해 코퍼스의 재사용성(reusability)을 높이는데 유용하게 사용된다. 본 논문은 포항공대 자연언어처리 연구실의 자연언어처리 엔진(SKOPE)의 품사 태거(POSTAG)에서 사용되는 태그세트와 한국전자통신연구원의 표준 태그세트 간의 양방향 태그세트 매핑을 다룬다. 매핑을 통해 표준태그세트로 태깅된 코퍼스로부터 POSTAG를 위한 대용량 학습자료를 얻고 POSTAG 가 두 가지 태그세트로 결과를 출력할 수 있다. 특히 한국어 태그세트 매핑에서 발생할 수 있는 여러 가지 문제점들, 즉 사전 표제어 차이 (형태소 분할 차이), 태그 할당 차이, 축약 처리 차이 등과 그것들의 기계적인 해결책을 살펴보고, 태그세트 매핑의 정확도를 측정하기 위해서 매핑 전과 후의 태깅 시스템의 정확도를 서로 비교함으로써 매핑의 정확도를 측정하는 실험을 수행하였다. 본 자동 매핑 방법을 반영한 POSTAG 는 제 1회 형태소 분석기 평가 대회(MATEC'99)에 적용되어 성공적으로 사용되었다.
PDF

Integrated Indexing Method using Compound Noun Segmentation and Noun Phrase Synthesis (복합명사 분할과 명사구 합성을 이용한 통합 색인 기법)

Won, Hyung-Suk;Park, Mi-Hwa;Lee, Geun-Bae
- Journal of KIISE:Software and Applications
- /
- v.27 no.1
- /
- pp.84-95
- /
- 2000
In this paper, we propose an integrated indexing method with compound noun segmentation and noun phrase synthesis. Statistical information is used in the compound noun segmentation and natural language processing techniques are carefully utilized in the noun phrase synthesis. Firstly, we choose index terms from simple words through morphological analysis and part-of-speech tagging results. Secondly, noun phrases are automatically synthesized from the syntactic analysis results. If syntactic analysis fails, only morphological analysis and tagging results are applied. Thirdly, we select compound nouns from the tagging results and then segment and re-synthesize them using statistical information. In this way, segmented and synthesized terms are used together as index terms to supplement the single terms. We demonstrate the effectiveness of the proposed integrated indexing method for Korean compound noun processing using KTSET2.0 and KRIST SET which are a standard test collection for Korean information retrieval.
PDF

Search Result 37, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)