• 제목/요약/키워드: Corpus-based Study

검색결과 204건 처리시간 0.022초

디지털 개인비서 동향과 미래 (Trends and Future of Digital Personal Assistant)

  • 권오욱;이기영;이요한;노윤형;조민수;황금하;임수종;최승권;김영길
    • 전자통신동향분석
    • /
    • 제36권1호
    • /
    • pp.1-11
    • /
    • 2021
  • In this study, we introduce trends in and the future of digital personal assistants. Recently, digital personal assistants have begun to handle many tasks like humans by communicating with users in human language on smart devices such as smart phones, smart speakers, and smart cars. Their capabilities range from simple voice commands and chitchat to complex tasks such as device control, reservation, ordering, and scheduling. The digital personal assistants of the future will certainly speak like a person, have a person-like personality, see, hear, and analyze situations like a person, and become more human. Dialogue processing technology that makes them more human-like has developed into an end-to-end learning model based on deep neural networks in recent years. In addition, language models pre-trained from a large corpus make dialogue processing more natural and better understood. Advances in artificial intelligence such as dialogue processing technology will enable digital personal assistants to serve with more familiar and better performance in various areas.

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text

  • Atwan, Jaffar
    • International Journal of Computer Science & Network Security
    • /
    • 제22권7호
    • /
    • pp.65-74
    • /
    • 2022
  • In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.

Burmese Sentiment Analysis Based on Transfer Learning

  • Mao, Cunli;Man, Zhibo;Yu, Zhengtao;Wu, Xia;Liang, Haoyuan
    • Journal of Information Processing Systems
    • /
    • 제18권4호
    • /
    • pp.535-548
    • /
    • 2022
  • Using a rich resource language to classify sentiments in a language with few resources is a popular subject of research in natural language processing. Burmese is a low-resource language. In light of the scarcity of labeled training data for sentiment classification in Burmese, in this study, we propose a method of transfer learning for sentiment analysis of a language that uses the feature transfer technique on sentiments in English. This method generates a cross-language word-embedding representation of Burmese vocabulary to map Burmese text to the semantic space of English text. A model to classify sentiments in English is then pre-trained using a convolutional neural network and an attention mechanism, where the network shares the model for sentiment analysis of English. The parameters of the network layer are used to learn the cross-language features of the sentiments, which are then transferred to the model to classify sentiments in Burmese. Finally, the model was tuned using the labeled Burmese data. The results of the experiments show that the proposed method can significantly improve the classification of sentiments in Burmese compared to a model trained using only a Burmese corpus.

Korean Text to Gloss: Self-Supervised Learning approach

  • Thanh-Vu Dang;Gwang-hyun Yu;Ji-yong Kim;Young-hwan Park;Chil-woo Lee;Jin-Young Kim
    • 스마트미디어저널
    • /
    • 제12권1호
    • /
    • pp.32-46
    • /
    • 2023
  • Natural Language Processing (NLP) has grown tremendously in recent years. Typically, bilingual, and multilingual translation models have been deployed widely in machine translation and gained vast attention from the research community. On the contrary, few studies have focused on translating between spoken and sign languages, especially non-English languages. Prior works on Sign Language Translation (SLT) have shown that a mid-level sign gloss representation enhances translation performance. Therefore, this study presents a new large-scale Korean sign language dataset, the Museum-Commentary Korean Sign Gloss (MCKSG) dataset, including 3828 pairs of Korean sentences and their corresponding sign glosses used in Museum-Commentary contexts. In addition, we propose a translation framework based on self-supervised learning, where the pretext task is a text-to-text from a Korean sentence to its back-translation versions, then the pre-trained network will be fine-tuned on the MCKSG dataset. Using self-supervised learning help to overcome the drawback of a shortage of sign language data. Through experimental results, our proposed model outperforms a baseline BERT model by 6.22%.

Arabic Words Extraction and Character Recognition from Picturesque Image Macros with Enhanced VGG-16 based Model Functionality Using Neural Networks

  • Ayed Ahmad Hamdan Al-Radaideh;Mohd Shafry bin Mohd Rahim;Wad Ghaban;Majdi Bsoul;Shahid Kamal;Naveed Abbas
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제17권7호
    • /
    • pp.1807-1822
    • /
    • 2023
  • Innovation and rapid increased functionality in user friendly smartphones has encouraged shutterbugs to have picturesque image macros while in work environment or during travel. Formal signboards are placed with marketing objectives and are enriched with text for attracting people. Extracting and recognition of the text from natural images is an emerging research issue and needs consideration. When compared to conventional optical character recognition (OCR), the complex background, implicit noise, lighting, and orientation of these scenic text photos make this problem more difficult. Arabic language text scene extraction and recognition adds a number of complications and difficulties. The method described in this paper uses a two-phase methodology to extract Arabic text and word boundaries awareness from scenic images with varying text orientations. The first stage uses a convolution autoencoder, and the second uses Arabic Character Segmentation (ACS), which is followed by traditional two-layer neural networks for recognition. This study presents the way that how can an Arabic training and synthetic dataset be created for exemplify the superimposed text in different scene images. For this purpose a dataset of size 10K of cropped images has been created in the detection phase wherein Arabic text was found and 127k Arabic character dataset for the recognition phase. The phase-1 labels were generated from an Arabic corpus of quotes and sentences, which consists of 15kquotes and sentences. This study ensures that Arabic Word Awareness Region Detection (AWARD) approach with high flexibility in identifying complex Arabic text scene images, such as texts that are arbitrarily oriented, curved, or deformed, is used to detect these texts. Our research after experimentations shows that the system has a 91.8% word segmentation accuracy and a 94.2% character recognition accuracy. We believe in the future that the researchers will excel in the field of image processing while treating text images to improve or reduce noise by processing scene images in any language by enhancing the functionality of VGG-16 based model using Neural Networks.

젖소에서 CIDR 투여에 의한 발정 유도 후 수태율과 다른 인자와의 관계 (Relationship between the Conception Rate after Estrus Induction using CIDR and Other Parameters in Dairy Cows)

  • 박철호;손창호
    • 한국수정란이식학회지
    • /
    • 제26권1호
    • /
    • pp.1-7
    • /
    • 2011
  • The purpose of this study was to determine the relationship between conception rate and other parameters (body condition score; BCS, progesterone concentrations and follicle size) before estrus induction with CIDR(intravaginal progesterone-releasing controlled internal drug release). The conception rate in cows with < 2.75, 2.75 to 3.25, and 3.25 <, BCS regardless of AI (artificial insemination) time was 46.6%, 63.3%, and 46.6% at CIDR insertion, respectively. The conception rate regardless of BCS was 54.9% in cows inseminated based on detected estrus, and 48.7% in cows inseminated at 72 to 80 hours (timed artificial insemination, TAI) after removal of CIDR. The conception rate regardless of AI time was 40.0% in cows with low progesterone concentrations (less than 1.0 ng/ml), and 56.6% in cows with high progesterone concentrations (more than 1.0 ng/ml) at CIDR injection. The conception rate regardless of progesterone concentrations was 53.8% in cows inseminated based on detected estrus, and 38.0% in cows of TAI after removal of CIDR. The conception rate regardless of AI time was 43.3% in cows with small follicle (less than 5 mm), 53.3% in cows between 5 mm to 10 mm of follicle, and 63.3% in cows with large folliclc (more than 10 mm) at CIDR injection, respectively. The conception rate regardless of follicle size was 58.4% in cows inseminated based on detected estrus, and 45.9% in cows of TAI after removal of CIDR. These results indicated that if the cows with BCS 2.75 to 3.25, active corpus luteum, and/or large dominant follicle (more than 10 mm) are used for estrus induction, the conception rate will be greater.

연관법령 검색을 위한 워드 임베딩 기반 Law2Vec 모형 연구 (A Study on the Law2Vec Model for Searching Related Law)

  • 김나리;김형중
    • 디지털콘텐츠학회 논문지
    • /
    • 제18권7호
    • /
    • pp.1419-1425
    • /
    • 2017
  • 법률 지식 검색의 궁극적 목적은 법령과 판례를 근거로 최적의 법례정보 획득이라고 할 수 있다. 최근, 대규모 자료에서 효율적으로 검색하여야 하는목적을 달성하기 위하여텍스트 마이닝 연구가 활발히 이루어지고 있다. 대표적인 방법으로 Neural Net 기반 학습방법인 워드 임베딩 알고리즘을 들 수 있다. 본 논문에서는 한국 법령정보를 워드임베딩에 적용하여 연관정보 검색방법을 연구하였다. 우선 판례의 참조법령을 순서대로 추출하여 모형의 입력정보로 활용하였다. 추출한 참조법령들은 중심법령을 기준으로 주변 법령을 학습하고 임베딩하는 Law2Vec 모형을 작성하였다. 이 모형으로 법령에 대하여 학습을 수행하고 법령 간의 관계를 추론하였다. 본 연구의 모형을 평가하기 위하여 연관법령으로 도출된 결과가 키워드와 밀접한 관련이 있는지 정밀도와 재현율을 계산하여 검증하였다. 실험결과, 본 연구의 제안방식이기존의 키워드 검색방법보다 연관된 법령을추론하는데유용함을 알 수 있었다.

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기 (LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing)

  • 이태석;강승식
    • 스마트미디어저널
    • /
    • 제7권4호
    • /
    • pp.17-23
    • /
    • 2018
  • 자동 띄어쓰기 특성을 효과적으로 처리할 수 있는 LSTM(Long Short-Term Memory Neural Networks) 기반의 RNN 모델을 제시하고 적용한 결과를 분석하였다. 문장이 길거나 일부 노이즈가 포함된 경우에 신경망 학습이 쉽지 않은 문제를 해결하기 위하여 입력 데이터 형식과 디코딩 데이터 형식을 정의하고, 신경망 학습에서 드롭아웃, 양방향 다층 LSTM 셀, 계층 정규화 기법, 주목 기법(attention mechanism)을 적용하여 성능을 향상시키는 방법을 제안하였다. 학습 데이터로는 세종 말뭉치 자료를 사용하였으며, 학습 데이터가 부분적으로 불완전한 띄어쓰기가 포함되어 있었음에도 불구하고, 대량의 학습 데이터를 통해 한글 띄어쓰기에 대한 패턴이 의미 있게 학습되었다. 이것은 신경망에서 드롭아웃 기법을 통해 학습 모델의 오버피팅이 되지 않도록 함으로써 노이즈에 강한 모델을 만들었기 때문이다. 실험결과로 LSTM sequence-to-sequence 모델이 재현율과 정확도를 함께 고려한 평가 점수인 F1 값이 0.94로 규칙 기반 방식과 딥러닝 GRU-CRF보다 더 높은 성능을 보였다.

대뇌 신경교종증 : 임상특징 및 예후에 관한 연구 (Gliomatosis Cerebri : Clinical Features and Prognosis)

  • 조대철;황정현;성주경;황성규;함인석;박연묵;변승열;김승래
    • Journal of Korean Neurosurgical Society
    • /
    • 제30권12호
    • /
    • pp.1399-1405
    • /
    • 2001
  • Objectives : Gliomatosis cerebri is an uncommon primary brain tumor characterized by diffuse neoplastic proliferation of glial cells, with the preservation of the underlying cytoarchitecture. The aim of this study is to evaluate clinical features, outcome of surgical treatment and adjuvant therapy of gliomatosis cerebri. Methods : Between Jan. 1990 and Dec. 2000, 12 patients were diagnosed with gliomatosis cerebri based on characteristic radiological and histological findings. The patients' age ranged from 18 to 77(mean 44) years and the male to female ratio was 7 : 5. Nine patients underwent decompressive surgery and three, biopsy only. Postoperative radiation therapy was given in all cases except three. In addition to radiation therapy, four patients received chemotherapy. The mean duration of follow-up period was 18.8 months. Results : The most common presenting symptom were seizure and motor weakness. The mean duration of symptom was 5.9 months. There was 5 bilateral lesions and tumor involved corpus callosum in 5, basal ganglia-thalamus in 4, and brain stem in 2. There was no operative mortality but four patients died during the follow-up. The mean survival period for 11 patients was 20.5 months from the time of diagnosis. In univariate analysis, the lesion involving corpus callosum, basal ganglia-thalamus and brain stem correlated significantly with the short length of survival(p<0.05). Also, postoperative radiation as a adjuvant therapy prolonged the patient's survival(p<0.05). Conclusions : In the management of gliomatosis cerebri patients, early detection by MR imaging, active management of increased intracranial pressure, decompressive surgical removal and postoperative adjuvant therapy such as radiation is thought to be a good treatment modality.

  • PDF

An MRI-Based Quantification for Correlation of Imaging Biomarker and Clinical Performance in Chronic Phase of Carbon Monoxide Poisoning

  • Lee, Aleum;Hwang, Ji-sun;Bae, Won-kyung;Park, Jai-soung;Goo, Dong Erk;Park, Sung-Tae
    • Investigative Magnetic Resonance Imaging
    • /
    • 제23권3호
    • /
    • pp.241-250
    • /
    • 2019
  • Purpose: The purpose of this study was to determine the relation between quantitative magnetic resonance imaging biomarkers, and clinical performances in chronic phase of carbon monoxide intoxication. Materials and Methods: Eighteen magnetic resonance scans and cognitive evaluations were performed, on patients with carbon monoxide intoxication in chronic phase. Apparent diffusion coefficient (ADC) ratios of affected versus unaffected centrum semiovale, and corpus callosum were obtained. Signal intensity (SI) ratios between affected centrum semiovale, and normal pons in T2-FLAIR (fluid-attenuated inversion recovery) images were obtained. The Mini-Mental State Exam, and clinical outcome scores were assessed. Correlation coefficients were calculated, between MRI and clinical markers. Patients were further classified into poor-outcome and good-outcome groups based on clinical performance, and imaging parameters were compared. T2-SI ratio of centrum semiovale was compared, with that of 18 sex-matched and age-matched controls. Results: T2-SI ratio of centrum semiovale was significantly higher in the poor-outcome group, than that in the good-outcome group and was strongly inversely correlated, with results from the Mini-Mental State Exam. ADC ratios of centrum semiovale were significantly lower in the poor outcome group than in the good outcome group, and were moderately correlated with the Mini-Mental State Exam score. Conclusion: A higher T2-SI and a lower ratio of ADC values in the centrum semiovale, may indicate presence of more severe white matter injury and clinical impairment. T2-SI ratio and ADC values in the centrum semiovale, are useful quantitative imaging biomarkers for correlation with clinical performance in individuals with carbon monoxide intoxication.