• Title/Summary/Keyword: sentence processing

Search Result 325, Processing Time 0.038 seconds

The Method of Deriving Japanese Keyword Using Dependence (의존관계에 기초한 일본어 키워드 추출방법)

  • Lee, Tae-Hun;Jung, Kyu-Cheol;Park, Ki-Hong
    • The KIPS Transactions:PartB
    • /
    • v.10B no.1
    • /
    • pp.41-46
    • /
    • 2003
  • This thesis composes separated words in text for extracting keywords from Japanese, proposes extracting indexing keywords which consist of a compound noun using words and sentences information with the rules in the sentences. It constructs generative rules of compound nouns to be based In dependence as a result of analysing character of keywords in the text not the same way as before. To hold other extracting keywords and the content of sentence, and suggest how to decide importance concerned some restriction and repetition of words about generative rules. To verify the validity of keywords extracting, we have used titles and abstracts from Japanese thesis 65 files about natural language and/or voice processing, and obtain 63% in outputting one in the top rank.

Intrusion Detection System based on Packet Payload Analysis using Transformer

  • Woo-Seung Park;Gun-Nam Kim;Soo-Jin Lee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.11
    • /
    • pp.81-87
    • /
    • 2023
  • Intrusion detection systems that learn metadata of network packets have been proposed recently. However these approaches require time to analyze packets to generate metadata for model learning, and time to pre-process metadata before learning. In addition, models that have learned specific metadata cannot detect intrusion by using original packets flowing into the network as they are. To address the problem, this paper propose a natural language processing-based intrusion detection system that detects intrusions by learning the packet payload as a single sentence without an additional conversion process. To verify the performance of our approach, we utilized the UNSW-NB15 and Transformer models. First, the PCAP files of the dataset were labeled, and then two Transformer (BERT, DistilBERT) models were trained directly in the form of sentences to analyze the detection performance. The experimental results showed that the binary classification accuracy was 99.03% and 99.05%, respectively, which is similar or superior to the detection performance of the techniques proposed in previous studies. Multi-class classification showed better performance with 86.63% and 86.36%, respectively.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

Natural Photography Generation with Text Guidance from Spherical Panorama Image (360 영상으로부터 텍스트 정보를 이용한 자연스러운 사진 생성)

  • Kim, Beomseok;Jung, Jinwoong;Hong, Eunbin;Cho, Sunghyun;Lee, Seungyong
    • Journal of the Korea Computer Graphics Society
    • /
    • v.23 no.3
    • /
    • pp.65-75
    • /
    • 2017
  • As a 360-degree image carries information of all directions, it often has too much information. Moreover, in order to investigate a 360-degree image on a 2D display, a user has to either click and drag the image with a mouse, or project it to a 2D panorama image, which inevitably introduces severe distortions. In consequence, investigating a 360-degree image and finding an object of interest in such a 360-degree image could be a tedious task. To resolve this issue, this paper proposes a method to find a region of interest and produces a 2D naturally looking image from a given 360-degree image that best matches a description given by a user in a natural language sentence. Our method also considers photo composition so that the resulting image is aesthetically pleasing. Our method first converts a 360-degree image to a 2D cubemap. As objects in a 360-degree image may appear distorted or split into multiple pieces in a typical cubemap, leading to failure of detection of such objects, we introduce a modified cubemap. Then our method applies a Long Short Term Memory (LSTM) network based object detection method to find a region of interest with a given natural language sentence. Finally, our method produces an image that contains the detected region, and also has aesthetically pleasing composition.

The Design of Keyword Spotting System based on Auditory Phonetical Knowledge-Based Phonetic Value Classification (청음 음성학적 지식에 기반한 음가분류에 의한 핵심어 검출 시스템 구현)

  • Kim, Hack-Jin;Kim, Soon-Hyub
    • The KIPS Transactions:PartB
    • /
    • v.10B no.2
    • /
    • pp.169-178
    • /
    • 2003
  • This study outlines two viewpoints the classification of phone likely unit (PLU) which is the foundation of korean large vocabulary speech recognition, and the effectiveness of Chiljongseong (7 Final Consonants) and Paljogseong (8 Final Consonants) of the korean language. The phone likely classifies the phoneme phonetically according to the location of and method of articulation, and about 50 phone-likely units are utilized in korean speech recognition. In this study auditory phonetical knowledge was applied to the classification of phone likely unit to present 45 phone likely unit. The vowels 'ㅔ, ㅐ'were classified as phone-likely of (ee) ; 'ㅒ, ㅖ' as [ye] ; and 'ㅚ, ㅙ, ㅞ' as [we]. Secondly, the Chiljongseong System of the draft for unified spelling system which is currently in use and the Paljongseonggajokyong of Korean script haerye were illustrated. The question on whether the phonetic value on 'ㄷ' and 'ㅅ' among the phonemes used in the final consonant of the korean fan guage is the same has been argued in the academic world for a long time. In this study, the transition stages of Korean consonants were investigated, and Ciljonseeng and Paljongseonggajokyong were utilized in speech recognition, and its effectiveness was verified. The experiment was divided into isolated word recognition and speech recognition, and in order to conduct the experiment PBW452 was used to test the isolated word recognition. The experiment was conducted on about 50 men and women - divided into 5 groups - and they vocalized 50 words each. As for the continuous speech recognition experiment to be utilized in the materialized stock exchange system, the sentence corpus of 71 stock exchange sentences and speech corpus vocalizing the sentences were collected and used 5 men and women each vocalized a sentence twice. As the result of the experiment, when the Paljongseonggajokyong was used as the consonant, the recognition performance elevated by an average of about 1.45% : and when phone likely unit with Paljongseonggajokyong and auditory phonetic applied simultaneously, was applied, the rate of recognition increased by an average of 1.5% to 2.02%. In the continuous speech recognition experiment, the recognition performance elevated by an average of about 1% to 2% than when the existing 49 or 56 phone likely units were utilized.

Linguistic Productivity and Chomskyan Grammar: A Critique (언어창조성과 춈스키 문법 비판)

  • Bong-rae Seok
    • Lingua Humanitatis
    • /
    • v.1 no.1
    • /
    • pp.235-251
    • /
    • 2001
  • According to Chomskyan grammar, humans can generate and understand an unbounded number of grammatical sentences. Against the background of pure and idealized linguistic competence, this linguistic productivity is argued and understood. In actual utterances, however, there are many limitations of productivity but they are said to come from the general constraints on performances such as capacity of short term memory or attention. In this paper I discuss a problem raised against idealized productivity. I argue that linguistic productivity idealizes our linguistic competence too much. By separating idealized competence from the various constraints of performance, Chomskyan theorists can argue for unlimited productivity. However, the absolute distinction between grammar (pure competence) and parser (actual psychological processes) makes little sense when we explain the low acceptability(intelligibility) of center embedded sentences. Usually, the problem of center embedded sentence is explained in terms of memory shortage or other performance constraints. To explain the low acceptability, however, we need to assume specialized memory structure because the low acceptability occurs only with a specific type of syntactic pattern. 1 argue that this special memory structure should not be considered as a general performance constraint. It is a domain specific (specifically linguistic) constraints and an intrinsic part of human language processing. Recent development of Chomskyan grammar, i.e., minimalist approach seems to close the gap between pure competence and this type of specialized constraints. Chomsky's earlier approach of generative grammar focuses on end result of the generative derivation. However, economy principle (of minimalist approach) focuses on actual derivational processes. By having less mathematical or less idealized grammar, we can come closer to the actual computational processes that build syntactic structure of a sentence. In this way, we can have a more concrete picture of our linguistic competence, competence that is not detached from actual computational processes.

  • PDF

Building Sentence Meaning Identification Dataset Based on Social Problem-Solving R&D Reports (사회문제 해결 연구보고서 기반 문장 의미 식별 데이터셋 구축)

  • Hyeonho Shin;Seonki Jeong;Hong-Woo Chun;Lee-Nam Kwon;Jae-Min Lee;Kanghee Park;Sung-Pil Choi
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.4
    • /
    • pp.159-172
    • /
    • 2023
  • In general, social problem-solving research aims to create important social value by offering meaningful answers to various social pending issues using scientific technologies. Not surprisingly, however, although numerous and extensive research attempts have been made to alleviate the social problems and issues in nation-wide, we still have many important social challenges and works to be done. In order to facilitate the entire process of the social problem-solving research and maximize its efficacy, it is vital to clearly identify and grasp the important and pressing problems to be focused upon. It is understandable for the problem discovery step to be drastically improved if current social issues can be automatically identified from existing R&D resources such as technical reports and articles. This paper introduces a comprehensive dataset which is essential to build a machine learning model for automatically detecting the social problems and solutions in various national research reports. Initially, we collected a total of 700 research reports regarding social problems and issues. Through intensive annotation process, we built totally 24,022 sentences each of which possesses its own category or label closely related to social problem-solving such as problems, purposes, solutions, effects and so on. Furthermore, we implemented four sentence classification models based on various neural language models and conducted a series of performance experiments using our dataset. As a result of the experiment, the model fine-tuned to the KLUE-BERT pre-trained language model showed the best performance with an accuracy of 75.853% and an F1 score of 63.503%.

The Effect of Encoding strategy and Transfer Appropriate Processing on Prospective Memory Performance (부호화 전략 유형과 동시과제 처리 적절성이 미래계획기억 수행에 미치는 효과)

  • Park, Youngshin
    • Korean Journal of Cognitive Science
    • /
    • v.27 no.1
    • /
    • pp.101-127
    • /
    • 2016
  • The present study was conducted to examine the effect of meta-cognitive strategy and transfer appropriate processing(TAP) on prospective memory performance. In two experiments, encoding strategy for PM target words was manipulated by instructions. Participants who were assigned to meta strategic condition were engaged to rate task difficulty(EOL) in addition to predict their own performance(JOL), while participants in cognitive strategy condition were to remember target words by pleasantness ratings and sentence generation. In experiment1 and experiment 2, all participants in both conditions performed not only TAP ongoing task but also TIP ongoing task. Results revealed the benefit of meta cognition and transfer appropriate processing on PM performance. Furthermore, the benefit of TAP was diminished in cognitive strategy condition. There were no-costs on judgement tasks across conditions. The findings suggest that meta-cognition allows to sustain PM targets and intention without regard to cognitive resource.

  • PDF

Processing of syntactic dependency in Korean relative clauses: Evidence from an eye-tracking study (안구이동추적을 통해 살펴본 관계절의 통사처리 과정)

  • Lee, Mi-Seon;Yong, Nam-Seok
    • Korean Journal of Cognitive Science
    • /
    • v.20 no.4
    • /
    • pp.507-533
    • /
    • 2009
  • This paper examines the time course and processing patterns of filler-gap dependencies in Korean relative clauses, using an eyetracking method. Participants listened to a short story while viewing four pictures of entities mentioned in the story. Each story is followed by an auditorily presented question involving a relative clause (subject relative or dative relative). Participants' eye movements in response to the question were recorded. Results showed that the proportion of looks to the picture corresponding to a filler noun significantly increased at the relative verb affixed with a relativizer, and was largest at the filler where the fixation duration on the filler picture significantly increased. These results suggest that online resolution of the filler-gap dependency only starts at the relative verb marked with a relativiser and is finally completed at the filler position. Accordingly, they partly support the filler-driven parsing strategy for Korean, as for head-initial languages. In addition, the different patterns of eye movements between subject relatives and dative relatives indicate the role of case markers in parsing Korean sentences.

  • PDF

A Method for Spelling Error Correction in Korean Using a Hangul Edit Distance Algorithm (한글 편집거리 알고리즘을 이용한 한국어 철자오류 교정방법)

  • Bak, Seung Hyeon;Lee, Eun Ji;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.6 no.1
    • /
    • pp.16-21
    • /
    • 2017
  • Long time has passed since computers which used to be a means of research were commercialized and available for the general public. People used writing instruments to write before computer was commercialized. However, today a growing number of them are using computers to write instead. Computerized word processing helps write faster and reduces fatigue of hands than writing instruments, making it better fit to making long texts. However, word processing programs are more likely to cause spelling errors by the mistake of users. Spelling errors distort the shape of words, making it easy for the writer to find and correct directly, but those caused due to users' lack of knowledge or those hard to find may make it almost impossible to produce a document free of spelling errors. However, spelling errors in important documents such as theses or business proposals may lead to falling reliability. Consequently, it is necessary to conduct research on high-level spelling error correction programs for the general public. This study was designed to produce a system to correct sentence-level spelling errors to normal words with Korean alphabet similarity algorithm. On the basis of findings reported in related literatures that corrected words are significantly similar to misspelled words in form, spelling errors were extracted from a corpus. Extracted corrected words were replaced with misspelled ones to correct spelling errors with spelling error detection algorithm.