• 제목/요약/키워드: Hindi Language

검색결과 9건 처리시간 0.02초

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa;Nain, Neeta;Nehra, Maninder
    • Journal of Multimedia Information System
    • /
    • 제5권3호
    • /
    • pp.147-154
    • /
    • 2018
  • Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

Optical Character Recognition for Hindi Language Using a Neural-network Approach

  • Yadav, Divakar;Sanchez-Cuadrado, Sonia;Morato, Jorge
    • Journal of Information Processing Systems
    • /
    • 제9권1호
    • /
    • pp.117-140
    • /
    • 2013
  • Hindi is the most widely spoken language in India, with more than 300 million speakers. As there is no separation between the characters of texts written in Hindi as there is in English, the Optical Character Recognition (OCR) systems developed for the Hindi language carry a very poor recognition rate. In this paper we propose an OCR for printed Hindi text in Devanagari script, using Artificial Neural Network (ANN), which improves its efficiency. One of the major reasons for the poor recognition rate is error in character segmentation. The presence of touching characters in the scanned documents further complicates the segmentation process, creating a major problem when designing an effective character segmentation technique. Preprocessing, character segmentation, feature extraction, and finally, classification and recognition are the major steps which are followed by a general OCR. The preprocessing tasks considered in the paper are conversion of gray scaled images to binary images, image rectification, and segmentation of the document's textual contents into paragraphs, lines, words, and then at the level of basic symbols. The basic symbols, obtained as the fundamental unit from the segmentation process, are recognized by the neural classifier. In this work, three feature extraction techniques-: histogram of projection based on mean distance, histogram of projection based on pixel value, and vertical zero crossing, have been used to improve the rate of recognition. These feature extraction techniques are powerful enough to extract features of even distorted characters/symbols. For development of the neural classifier, a back-propagation neural network with two hidden layers is used. The classifier is trained and tested for printed Hindi texts. A performance of approximately 90% correct recognition rate is achieved.

An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language

  • Younas, Farah;Nadir, Jumana;Usman, Muhammad;Khan, Muhammad Attique;Khan, Sajid Ali;Kadry, Seifedine;Nam, Yunyoung
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권6호
    • /
    • pp.2049-2068
    • /
    • 2021
  • AI combined with NLP techniques has promoted the use of Virtual Assistants and have made people rely on them for many diverse uses. Conversational Agents are the most promising technique that assists computer users through their operation. An important challenge in developing Conversational Agents globally is transferring the groundbreaking expertise obtained in English to other languages. AI is making it possible to transfer this learning. There is a dire need to develop systems that understand secular languages. One such difficult language is Hindi, which is the fourth most spoken language in the world. Semantic similarity is an important part of Natural Language Processing, which involves applications such as ontology learning and information extraction, for developing conversational agents. Most of the research is concentrated on English and other European languages. This paper presents a Corpus-based word semantic similarity measure for Hindi. An experiment involving the translation of the English benchmark dataset to Hindi is performed, investigating the incorporation of the corpus, with human and machine similarity ratings. A significant correlation to the human intuition and the algorithm ratings has been calculated for analyzing the accuracy of the proposed similarity measures. The method can be adapted in various applications of word semantic similarity or module for any other language.

Hindi version of short form of douleur neuropathique 4 (S-DN4) questionnaire for assessment of neuropathic pain component: a cross-cultural validation study

  • Gudala, Kapil;Ghai, Babita;Bansal, Dipika
    • The Korean Journal of Pain
    • /
    • 제30권3호
    • /
    • pp.197-206
    • /
    • 2017
  • Background: Pain with neuropathic characteristics is generally more severe and associated with a lower quality of life compared to nociceptive pain (NcP). Short form of the Douleur Neuropathique en 4 Questions (S-DN4) is one of the most used and reliable screening questionnaires and is reported to have good diagnostic properties. This study was aimed to cross-culturally validate the Hindi version of the S-DN4 in patients with various chronic pain conditions. Methods: The S-DN4 is already translated into the Hindi language by Mapi Research Trust. This study assessed the psychometric properties of the Hindi version of the S-DN4 including internal consistency and test-retest reliability after 3 days' post-baseline assessment. Diagnostic performance was also assessed. Results: One hundred sixty patients with chronic pain, 80 each in the neuropathic pain (NeP) present and NeP absent groups, were recruited. Patients with NeP present reported significantly higher S-DN4 scores in comparison to patients in the NeP absent group (mean (SD), 4.7 (1.7) vs. 1.8 (1.6), P < 0.01). The S-DN4 was found to have an AUC of 0.88 with adequate internal consistency (Cronbach's ${\alpha}=0.80$) and a test-retest reliability (ICC = 0.92) with an optimal cut-off value of 3 (Youden's index = 0.66, sensitivity and specificity of 88.7% and 77.5%). The diagnostic concordance rate between clinician diagnosis and the S-DN4 questionnaire was 83.1% (kappa = 0.66). Conclusions: Overall, the Hindi version of the S-DN4 has good internal consistency and test-retest reliability along with good diagnostic accuracy.

The Role of Contrast in Prosodically Induced Acoustic Variation

  • Choi, Han-Sook
    • 말소리와 음성과학
    • /
    • 제1권3호
    • /
    • pp.29-37
    • /
    • 2009
  • This paper presents results from speech production experiments on English, Korean, and Hindi that compare variation in the acoustic expression of dissimilar phonological laryngeal contrast in stops conditioned by prosodic prominence. Target stops are analyzed from utterance-initial, -medial, and -final positions, with a variation in contrastive focal accent, from the speech data by six male American English speakers, five male Seoul Korean speakers, and five male Delhi Hindi speakers. The results show that prosodic prominence conditions enhanced distinctiveness between contrastive segments in the three languages. The manner in which prosodic prominence and prosodic phrase structure is marked at the level of segmental variation is, however, found to be language-specific to some extent. In addition, a correlation between the size of the phonological inventory and the corresponding acoustic variation was found but the linear correlation was not strongly supported with the findings in the present study.

  • PDF

안드로이드에서 힌디어 텍스트 처리 방법 (A Text Processing Method for Devanagari Scripts in Andriod)

  • 김재혁;맹승렬
    • 한국콘텐츠학회논문지
    • /
    • 제11권12호
    • /
    • pp.560-569
    • /
    • 2011
  • 본 논문에서는 개방형 OS인 안드로이드에서 힌디어 텍스트 처리방법을 제안한다. 텍스트 처리의 핵심은 알파벳을 문자로 조합하는 규칙을 정의하는 오토마타와 폰트 파일에서 문자에 대응하는 이미지를 검색하고 이를 화면에 표시하는 폰트 렌더링이다. 오토마타는 입력 문자의 종류와 개수에 좌우되는데 유니코드를 기반으로 자음 14자와 모음 34자를 알파벳으로 사용하는 오토마타를 제안한다. 조합된 음절은 테이블 매핑 방식을 사용하여 그립 인덱스로 변환하고 해당하는 폰트를 로드하기 위한 핸들로 사용한다. 프리 타입 폰트엔진의 다국어 지원 프레임워크에 따라 제안방법을 별도의 모듈로 추가함으로서 시스템 수준에서 힌디어를 지원할 수 있다. 메시지 어플리케이션을 통해 제안방법의 타당성을 보인다.

Korean NPIs amu-(N)-to and amu-(N)-rato

  • Yoon, Young-Eun
    • 한국언어정보학회지:언어와정보
    • /
    • 제12권2호
    • /
    • pp.21-47
    • /
    • 2008
  • This paper reviews the analysis of the so-called Korean NPIs, amu-(N)-to and amu-(N)-rato, proposed by An (2007). An proposes that the two so-called polarity items are identical semantically, tantamount to English even, but they are in complementary distribution due to the opposite scope properties of the emphatic particles to and rato contained in the NPIs in question. Resorting to Karttunen and Peters' (1979) and Wilkinson's (1996) scope analysis of even, Lahiri's (1998) analysis of Hindi NPIs, and Guerzoni's (2002) analysis of the negative bias of yes/no-questions containing minimizers, An accounts for the distributional properties of the two Korean NPIs. Given this, however, it is observed that unlike amu-(N)-to, amu-(N)-rato could be licensed in much broader contexts. Based on this observation, this paper proposes that the two particles to and rato are two different particles with different meanings.

  • PDF

탑의 원조 인도 스투파의 형태 해석 - 인도 전역의 현장 답사를 바탕으로 - (The Interpreggtation of the Indian Stupa as Origin of Korean Pagoda)

  • 이희봉
    • 건축역사연구
    • /
    • 제18권6호
    • /
    • pp.103-126
    • /
    • 2009
  • This study aims to discover historical trends and change of form of all stupas in India with observation of field study that is as direct as possible, by classifying, analyzing, and synthesizing the stupas. Study of Indian stupa in Korea has a number of shortcomings since only introductory partial approach has been made in order to seek the origin of Korean pagoda. This study also aims to correct errors of stupa terminology in Chinese character committed by misinterpretation of Hindi language which was established by precedent Japanese scholars several decades ago. Piled-up stupas were totally destroyed by pagans, therefore their remains tell us only of structure, material, sizeand disposition. However remains of carved stone at torana and drum give us clues as to the original form of stupa and worshipping activity, as well as change to a more luxurious form. Many rock cave stupas of India show us both simple forms matching the ascetic age of early Buddhism and luxurious changes in Mahayanan era introducing us to statues of Buddha. Indians recovered the spheric form of 'anda,' a Hindi term meaning cosmic egg, from the hemispheric form of the piled-up stupa. Therefore we might discard the erratic term of 'bokbal', which means an upset vessel. Railings and parasols became main factors of stupa design. Carved railings around stupa became a sign of divinity. Serious worshipping activity made drums long or high and created multi-embossed stripes. Bases of circular drums of some cave stupas changed their shapes to rectangular or octagonal. Single parasols became multiparasols of affluent flowerlike curved stems on carved stupa. Multistoried, elongated and high parasols of Gandhara stupas are closely related to such factors as diverse changes of form in Indian subcontinent. Four-sided torana gate and ayaka column of the circular form of original stupas suggest the rectangular form of subsequent East Asian pagoda, and higher and wider base of Indian stupas became the origin of East Asian rectangular pagoda.

  • PDF

인도 내 람사르 습지 현황 : 생태계 이점, 위협 및 관리 전략 (The Status of Ramsar wetlands in India: A review of ecosystem benefits, threats, and management strategies)

  • ;;전민수;김이형
    • 한국습지학회지
    • /
    • 제24권2호
    • /
    • pp.123-141
    • /
    • 2022
  • 세계적으로 자연적 습지는 천연자원 중 하나이며, 다양한 경제적 이점과 건전한 생태계를 구축한다. 본 연구는 인도에서 "Jheelon"로 알려진 람사르 습지 내 야생동물 생태계, 보존현황 등에 대한 현황에 대해 분석하였다. 2022년 현재 인도에는 약 1,09363.6 km2의 면적을 차지하는 49개의 람사르 습지가 있으며, 규모가 가장 큰 Sundarbans 습지와 규모가 작은 Chandertal 습지가 있다. 인도와 선진국에서의 인간활동에 의한 습지의 규모 감축, 기능 상실 등의 피해규모에 관한 연구는 미흡하지만 습지의 유지, 보존, 복원에 대한 중요성은 보고되고 있다. 국가 정책 및 관련 지자체들은 습지를 통한 생태계서비스 구축, 습지 보존, 복원방향, 오염물질 저감 및 배출 규제 등의 법안 마련과 습지에 대한 이해관계를 유지해야 한다.