• Title/Summary/Keyword: Corpus-based Study

Search Result 204, Processing Time 0.021 seconds

Trends and Future of Digital Personal Assistant (디지털 개인비서 동향과 미래)

  • Kwon, O.W.;Lee, K.Y.;Lee, Y.H.;Roh, Y.H.;Cho, M.S.;Huang, J.X.;Lim, S.J.;Choi, S.K.;Kim, Y.K.
    • Electronics and Telecommunications Trends
    • /
    • v.36 no.1
    • /
    • pp.1-11
    • /
    • 2021
  • In this study, we introduce trends in and the future of digital personal assistants. Recently, digital personal assistants have begun to handle many tasks like humans by communicating with users in human language on smart devices such as smart phones, smart speakers, and smart cars. Their capabilities range from simple voice commands and chitchat to complex tasks such as device control, reservation, ordering, and scheduling. The digital personal assistants of the future will certainly speak like a person, have a person-like personality, see, hear, and analyze situations like a person, and become more human. Dialogue processing technology that makes them more human-like has developed into an end-to-end learning model based on deep neural networks in recent years. In addition, language models pre-trained from a large corpus make dialogue processing more natural and better understood. Advances in artificial intelligence such as dialogue processing technology will enable digital personal assistants to serve with more familiar and better performance in various areas.

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text

  • Atwan, Jaffar
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.7
    • /
    • pp.65-74
    • /
    • 2022
  • In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.

Burmese Sentiment Analysis Based on Transfer Learning

  • Mao, Cunli;Man, Zhibo;Yu, Zhengtao;Wu, Xia;Liang, Haoyuan
    • Journal of Information Processing Systems
    • /
    • v.18 no.4
    • /
    • pp.535-548
    • /
    • 2022
  • Using a rich resource language to classify sentiments in a language with few resources is a popular subject of research in natural language processing. Burmese is a low-resource language. In light of the scarcity of labeled training data for sentiment classification in Burmese, in this study, we propose a method of transfer learning for sentiment analysis of a language that uses the feature transfer technique on sentiments in English. This method generates a cross-language word-embedding representation of Burmese vocabulary to map Burmese text to the semantic space of English text. A model to classify sentiments in English is then pre-trained using a convolutional neural network and an attention mechanism, where the network shares the model for sentiment analysis of English. The parameters of the network layer are used to learn the cross-language features of the sentiments, which are then transferred to the model to classify sentiments in Burmese. Finally, the model was tuned using the labeled Burmese data. The results of the experiments show that the proposed method can significantly improve the classification of sentiments in Burmese compared to a model trained using only a Burmese corpus.

Korean Text to Gloss: Self-Supervised Learning approach

  • Thanh-Vu Dang;Gwang-hyun Yu;Ji-yong Kim;Young-hwan Park;Chil-woo Lee;Jin-Young Kim
    • Smart Media Journal
    • /
    • v.12 no.1
    • /
    • pp.32-46
    • /
    • 2023
  • Natural Language Processing (NLP) has grown tremendously in recent years. Typically, bilingual, and multilingual translation models have been deployed widely in machine translation and gained vast attention from the research community. On the contrary, few studies have focused on translating between spoken and sign languages, especially non-English languages. Prior works on Sign Language Translation (SLT) have shown that a mid-level sign gloss representation enhances translation performance. Therefore, this study presents a new large-scale Korean sign language dataset, the Museum-Commentary Korean Sign Gloss (MCKSG) dataset, including 3828 pairs of Korean sentences and their corresponding sign glosses used in Museum-Commentary contexts. In addition, we propose a translation framework based on self-supervised learning, where the pretext task is a text-to-text from a Korean sentence to its back-translation versions, then the pre-trained network will be fine-tuned on the MCKSG dataset. Using self-supervised learning help to overcome the drawback of a shortage of sign language data. Through experimental results, our proposed model outperforms a baseline BERT model by 6.22%.

Arabic Words Extraction and Character Recognition from Picturesque Image Macros with Enhanced VGG-16 based Model Functionality Using Neural Networks

  • Ayed Ahmad Hamdan Al-Radaideh;Mohd Shafry bin Mohd Rahim;Wad Ghaban;Majdi Bsoul;Shahid Kamal;Naveed Abbas
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.7
    • /
    • pp.1807-1822
    • /
    • 2023
  • Innovation and rapid increased functionality in user friendly smartphones has encouraged shutterbugs to have picturesque image macros while in work environment or during travel. Formal signboards are placed with marketing objectives and are enriched with text for attracting people. Extracting and recognition of the text from natural images is an emerging research issue and needs consideration. When compared to conventional optical character recognition (OCR), the complex background, implicit noise, lighting, and orientation of these scenic text photos make this problem more difficult. Arabic language text scene extraction and recognition adds a number of complications and difficulties. The method described in this paper uses a two-phase methodology to extract Arabic text and word boundaries awareness from scenic images with varying text orientations. The first stage uses a convolution autoencoder, and the second uses Arabic Character Segmentation (ACS), which is followed by traditional two-layer neural networks for recognition. This study presents the way that how can an Arabic training and synthetic dataset be created for exemplify the superimposed text in different scene images. For this purpose a dataset of size 10K of cropped images has been created in the detection phase wherein Arabic text was found and 127k Arabic character dataset for the recognition phase. The phase-1 labels were generated from an Arabic corpus of quotes and sentences, which consists of 15kquotes and sentences. This study ensures that Arabic Word Awareness Region Detection (AWARD) approach with high flexibility in identifying complex Arabic text scene images, such as texts that are arbitrarily oriented, curved, or deformed, is used to detect these texts. Our research after experimentations shows that the system has a 91.8% word segmentation accuracy and a 94.2% character recognition accuracy. We believe in the future that the researchers will excel in the field of image processing while treating text images to improve or reduce noise by processing scene images in any language by enhancing the functionality of VGG-16 based model using Neural Networks.

Relationship between the Conception Rate after Estrus Induction using CIDR and Other Parameters in Dairy Cows (젖소에서 CIDR 투여에 의한 발정 유도 후 수태율과 다른 인자와의 관계)

  • Park, Chul-Ho;Son, Chang-Ho
    • Journal of Embryo Transfer
    • /
    • v.26 no.1
    • /
    • pp.1-7
    • /
    • 2011
  • The purpose of this study was to determine the relationship between conception rate and other parameters (body condition score; BCS, progesterone concentrations and follicle size) before estrus induction with CIDR(intravaginal progesterone-releasing controlled internal drug release). The conception rate in cows with < 2.75, 2.75 to 3.25, and 3.25 <, BCS regardless of AI (artificial insemination) time was 46.6%, 63.3%, and 46.6% at CIDR insertion, respectively. The conception rate regardless of BCS was 54.9% in cows inseminated based on detected estrus, and 48.7% in cows inseminated at 72 to 80 hours (timed artificial insemination, TAI) after removal of CIDR. The conception rate regardless of AI time was 40.0% in cows with low progesterone concentrations (less than 1.0 ng/ml), and 56.6% in cows with high progesterone concentrations (more than 1.0 ng/ml) at CIDR injection. The conception rate regardless of progesterone concentrations was 53.8% in cows inseminated based on detected estrus, and 38.0% in cows of TAI after removal of CIDR. The conception rate regardless of AI time was 43.3% in cows with small follicle (less than 5 mm), 53.3% in cows between 5 mm to 10 mm of follicle, and 63.3% in cows with large folliclc (more than 10 mm) at CIDR injection, respectively. The conception rate regardless of follicle size was 58.4% in cows inseminated based on detected estrus, and 45.9% in cows of TAI after removal of CIDR. These results indicated that if the cows with BCS 2.75 to 3.25, active corpus luteum, and/or large dominant follicle (more than 10 mm) are used for estrus induction, the conception rate will be greater.

A Study on the Law2Vec Model for Searching Related Law (연관법령 검색을 위한 워드 임베딩 기반 Law2Vec 모형 연구)

  • Kim, Nari;Kim, Hyoung Joong
    • Journal of Digital Contents Society
    • /
    • v.18 no.7
    • /
    • pp.1419-1425
    • /
    • 2017
  • The ultimate goal of legal knowledge search is to obtain optimal legal information based on laws and precedent. Text mining research is actively being undertaken to meet the needs of efficient retrieval from large scale data. A typical method is to use a word embedding algorithm based on Neural Net. This paper demonstrates how to search relevant information, applying Korean law information to word embedding. First, we extracts reference laws from precedents in order and takes reference laws as input of Law2Vec. The model learns a law by predicting its surrounding context law. The algorithm then moves over each law in the corpus and repeats the training step. After the training finished, we could infer the relationship between the laws via the embedding method. The search performance was evaluated based on precision and the recall rate which are computed from how closely the results are associated to the search terms. The test result proved that what this paper proposes is much more useful compared to existing systems utilizing only keyword search when it comes to extracting related laws.

LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing (LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기)

  • Lee, Tae Seok;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.4
    • /
    • pp.17-23
    • /
    • 2018
  • We proposed a LSTM-based RNN model that can effectively perform the automatic spacing characteristics. For those long or noisy sentences which are known to be difficult to handle within Neural Network Learning, we defined a proper input data format and decoding data format, and added dropout, bidirectional multi-layer LSTM, layer normalization, and attention mechanism to improve the performance. Despite of the fact that Sejong corpus contains some spacing errors, a noise-robust learning model developed in this study with no overfitting through a dropout method helped training and returned meaningful results of Korean word spacing and its patterns. The experimental results showed that the performance of LSTM sequence-to-sequence model is 0.94 in F1-measure, which is better than the rule-based deep-learning method of GRU-CRF.

Gliomatosis Cerebri : Clinical Features and Prognosis (대뇌 신경교종증 : 임상특징 및 예후에 관한 연구)

  • Jo, Dae-Chuol;Hwang, Jeong-Hyun;Sung, Joo-Kyung;Hwang, Sung-Kyu;Hamm, In-Suk;Park, Yeun-Mook;Byun, Seung-Yul;Kim, Seung-Lae
    • Journal of Korean Neurosurgical Society
    • /
    • v.30 no.12
    • /
    • pp.1399-1405
    • /
    • 2001
  • Objectives : Gliomatosis cerebri is an uncommon primary brain tumor characterized by diffuse neoplastic proliferation of glial cells, with the preservation of the underlying cytoarchitecture. The aim of this study is to evaluate clinical features, outcome of surgical treatment and adjuvant therapy of gliomatosis cerebri. Methods : Between Jan. 1990 and Dec. 2000, 12 patients were diagnosed with gliomatosis cerebri based on characteristic radiological and histological findings. The patients' age ranged from 18 to 77(mean 44) years and the male to female ratio was 7 : 5. Nine patients underwent decompressive surgery and three, biopsy only. Postoperative radiation therapy was given in all cases except three. In addition to radiation therapy, four patients received chemotherapy. The mean duration of follow-up period was 18.8 months. Results : The most common presenting symptom were seizure and motor weakness. The mean duration of symptom was 5.9 months. There was 5 bilateral lesions and tumor involved corpus callosum in 5, basal ganglia-thalamus in 4, and brain stem in 2. There was no operative mortality but four patients died during the follow-up. The mean survival period for 11 patients was 20.5 months from the time of diagnosis. In univariate analysis, the lesion involving corpus callosum, basal ganglia-thalamus and brain stem correlated significantly with the short length of survival(p<0.05). Also, postoperative radiation as a adjuvant therapy prolonged the patient's survival(p<0.05). Conclusions : In the management of gliomatosis cerebri patients, early detection by MR imaging, active management of increased intracranial pressure, decompressive surgical removal and postoperative adjuvant therapy such as radiation is thought to be a good treatment modality.

  • PDF

An MRI-Based Quantification for Correlation of Imaging Biomarker and Clinical Performance in Chronic Phase of Carbon Monoxide Poisoning

  • Lee, Aleum;Hwang, Ji-sun;Bae, Won-kyung;Park, Jai-soung;Goo, Dong Erk;Park, Sung-Tae
    • Investigative Magnetic Resonance Imaging
    • /
    • v.23 no.3
    • /
    • pp.241-250
    • /
    • 2019
  • Purpose: The purpose of this study was to determine the relation between quantitative magnetic resonance imaging biomarkers, and clinical performances in chronic phase of carbon monoxide intoxication. Materials and Methods: Eighteen magnetic resonance scans and cognitive evaluations were performed, on patients with carbon monoxide intoxication in chronic phase. Apparent diffusion coefficient (ADC) ratios of affected versus unaffected centrum semiovale, and corpus callosum were obtained. Signal intensity (SI) ratios between affected centrum semiovale, and normal pons in T2-FLAIR (fluid-attenuated inversion recovery) images were obtained. The Mini-Mental State Exam, and clinical outcome scores were assessed. Correlation coefficients were calculated, between MRI and clinical markers. Patients were further classified into poor-outcome and good-outcome groups based on clinical performance, and imaging parameters were compared. T2-SI ratio of centrum semiovale was compared, with that of 18 sex-matched and age-matched controls. Results: T2-SI ratio of centrum semiovale was significantly higher in the poor-outcome group, than that in the good-outcome group and was strongly inversely correlated, with results from the Mini-Mental State Exam. ADC ratios of centrum semiovale were significantly lower in the poor outcome group than in the good outcome group, and were moderately correlated with the Mini-Mental State Exam score. Conclusion: A higher T2-SI and a lower ratio of ADC values in the centrum semiovale, may indicate presence of more severe white matter injury and clinical impairment. T2-SI ratio and ADC values in the centrum semiovale, are useful quantitative imaging biomarkers for correlation with clinical performance in individuals with carbon monoxide intoxication.