• Title/Summary/Keyword: 단어 필터링

Search Result 96, Processing Time 0.021 seconds

Korean Mobile Spam Filtering System Considering Characteristics of Text Messages (문자메시지의 특성을 고려한 한국어 모바일 스팸필터링 시스템)

  • Sohn, Dae-Neung;Lee, Jung-Tae;Lee, Seung-Wook;Shin, Joong-Hwi;Rim, Hae-Chang
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.7
    • /
    • pp.2595-2602
    • /
    • 2010
  • This paper introduces a mobile spam filtering system that considers the style of short text messages sent to mobile phones for detecting spam. The proposed system not only relies on the occurrence of content words as previously suggested but additionally leverages the style information to reduce critical cases in which legitimate messages containing spam words are mis-classified as spam. Moreover, the accuracy of spam classification is improved by normalizing the messages through the correction of word spacing and spelling errors. Experiment results using real world Korean text messages show that the proposed system is effective for Korean mobile spam filtering.

Competition Relation Extraction based on Combining Machine Learning and Filtering (기계학습 및 필터링 방법을 결합한 경쟁관계 인식)

  • Lee, ChungHee;Seo, YoungHoon;Kim, HyunKi
    • Journal of KIISE
    • /
    • v.42 no.3
    • /
    • pp.367-378
    • /
    • 2015
  • This study was directed at the design of a hybrid algorithm for competition relation extraction. Previous works on relation extraction have relied on various lexical and deep parsing indicators and mostly utilize only the machine learning method. We present a new algorithm integrating machine learning with various filtering methods. Some simple but useful features for competition relation extraction are also introduced, and an optimum feature set is proposed. The goal of this paper was to increase the precision of competition relation extraction by combining supervised learning with various filtering methods. Filtering methods were employed for classifying compete relation occurrence, using distance restriction for the filtering of feature pairs, and classifying whether or not the candidate entity pair is spam. For evaluation, a test set consisting of 2,565 sentences was examined. The proposed method was compared with the rule-based method and general relation extraction method. As a result, the rule-based method achieved positive precision of 0.812 and accuracy of 0.568, while the general relation extraction method achieved 0.612 and 0.563, respectively. The proposed system obtained positive precision of 0.922 and accuracy of 0.713. These results demonstrate that the developed method is effective for competition relation extraction.

A Phoneme-based Approximate String Searching System for Restricted Korean Character Input Environments (제한된 한글 입력환경을 위한 음소기반 근사 문자열 검색 시스템)

  • Yoon, Tai-Jin;Cho, Hwan-Gue;Chung, Woo-Keun
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.10
    • /
    • pp.788-801
    • /
    • 2010
  • Advancing of mobile device is remarkable, so the research on mobile input device is getting more important issue. There are lots of input devices such as keypad, QWERTY keypad, touch and speech recognizer, but they are not as convenient as typical keyboard-based desktop input devices so input strings usually contain many typing errors. These input errors are not trouble with communication among person, but it has very critical problem with searching in database, such as dictionary and address book, we can not obtain correct results. Especially, Hangeul has more than 10,000 different characters because one Hangeul character is made by combination of consonants and vowels, frequency of error is higher than English. Generally, suffix tree is the most widely used data structure to deal with errors of query, but it is not enough for variety errors. In this paper, we propose fast approximate Korean word searching system, which allows variety typing errors. This system includes several algorithms for applying general approximate string searching to Hangeul. And we present profanity filters by using proposed system. This system filters over than 90% of coined profanities.

An Analysis Method of User Preference by using Web Usage Data in User Device (사용자 기기에서 이용한 웹 데이터 분석을 통한 사용자 취향 분석 방법)

  • Lee, Seung-Hwa;Choi, Hyoung-Kee;Lee, Eun-Seok
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.3
    • /
    • pp.189-199
    • /
    • 2009
  • The amount of information on the Web is explosively growing as the Internet gains in popularity. However, only a small portion of the information on the Web is truly relevant or useful to the user. Thus, offering suitable information according to user demand is an important subject in information retrieval. In e-commerce, the recommender system is essential to revitalize commercial transactions, raise user satisfaction and loyalty towards the information provider. The existing recommender systems are mostly based on user data collected at servers, so user data are dispersed over several servers. Therefore, web servers that lack sufficient user behavior data cannot easily infer user preferences. Also, if the user visits the server infrequently, it may be hard to reflect the dynamically changing user's interest. This paper proposes a novel personalization system analyzing the user preference based on web documents that are accessed by the user on a user device. The system also identifies non-content blocks appearing repeatedly in the dynamically generated web documents, and adds weight to the keywords extracted from the hyperlink sentence selected by the user. Therefore, the system establishes at an early stage recommendation strategies for the web server that has little user data. Also, user profiles are generated rapidly and more accurately by identifying the information blocks. In order to evaluate the proposed system, this study collected web data and purchase history from users who have current purchase activity. Then, we computed the similarity between purchase data and the user profile. We confirm the accuracy of the generated user profile since the web page containing the purchased item has higher correlation than other item pages.

Robust Feature Extraction Based on Image-based Approach for Visual Speech Recognition (시각 음성인식을 위한 영상 기반 접근방법에 기반한 강인한 시각 특징 파라미터의 추출 방법)

  • Gyu, Song-Min;Pham, Thanh Trung;Min, So-Hee;Kim, Jing-Young;Na, Seung-You;Hwang, Sung-Taek
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.3
    • /
    • pp.348-355
    • /
    • 2010
  • In spite of development in speech recognition technology, speech recognition under noisy environment is still a difficult task. To solve this problem, Researchers has been proposed different methods where they have been used visual information except audio information for visual speech recognition. However, visual information also has visual noises as well as the noises of audio information, and this visual noises cause degradation in visual speech recognition. Therefore, it is one the field of interest how to extract visual features parameter for enhancing visual speech recognition performance. In this paper, we propose a method for visual feature parameter extraction based on image-base approach for enhancing recognition performance of the HMM based visual speech recognizer. For experiments, we have constructed Audio-visual database which is consisted with 105 speackers and each speaker has uttered 62 words. We have applied histogram matching, lip folding, RASTA filtering, Liner Mask, DCT and PCA. The experimental results show that the recognition performance of our proposed method enhanced at about 21% than the baseline method.

Decision Method of Importance of E-Mail based on User Profiles (사용자 프로파일에 기반한 전자 메일의 중요도 결정)

  • Lee, Samuel Sang-Kon
    • The KIPS Transactions:PartB
    • /
    • v.15B no.5
    • /
    • pp.493-500
    • /
    • 2008
  • Although modern day people gather many data from the network, the users want only the information needed. Using this technology, the users can extract on the data that satisfy the query. As the previous studies use the single data in the document, frequency of the data for example, it cannot be considered as the effective data clustering method. What is needed is the effective clustering technology that can process the electronic network documents such as the e-mail or XML that contain the tags of various formats. This paper describes the study of extracting the information from the user query based on the multi-attributes. It proposes a method of extracting the data such as the sender, text type, time limit syntax in the text, and title from the e-mail and using such data for filtering. It also describes the experiment to verify that the multi-attribute based clustering method is more accurate than the existing clustering methods using only the word frequency.

Customized Recipe Recommendation System Implemented in the form of a Chatbot (챗봇 형태로 구현한 사용자 맞춤형 레시피 추천 시스템)

  • Ahn, Ye-Jin;Cho, Ha-Young;Kang, Shin-Jae
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.5
    • /
    • pp.543-550
    • /
    • 2020
  • Interest in food recipe retrieval systems has been increasing recently. Most computer-based recipe retrieval systems are searched by cooking name or ingredient name. Since each recipe provides information in different weighing units, recalculations to the desired amount are necessary and inconvenient. This paper introduces a computer system that addresses these inconveniences. The system is a chatbot system, based on web-based recipe recommendations, for users familiar with the use of messenger conversation systems. After selecting the most popular recipes by their names, and pre-processing to extract only information required for the recipes, the system recommends recipes based on the 100,000 data. Recipes are then searched by the names of food ingredients (included and excluded). Recalculations are performed based on the number of servings entered by the user. A satisfaction rate for the systems' recommendations was 90.5%.

Modified File Title Normalization Techniques for Copyright Protection (저작권 보호를 위한 변형된 파일 제목 정규화 기법)

  • Hwang, Chan Woong;Ha, Ji Hee;Lee, Tea Jin
    • Convergence Security Journal
    • /
    • v.19 no.4
    • /
    • pp.133-142
    • /
    • 2019
  • Although torrents and P2P sites or web hard are frequently used by users simply because they can be easily downloaded freely or at low prices, domestic torrent and P2P sites or web hard are very sensitive to copyright. Techniques have been researched and applied. Among these, title and string comparison method filtering techniques that block the number of cases such as file titles or combinations of key words are blocked by changing the title and spacing. Bypass is easy through. In order to detect and block illegal works for copyright protection, a technique for normalizing modified file titles is essential. In this paper, we compared the detection rate by searching before and after normalizing the modified file title of illegal works and normalizing the file title. Before the normalization, the detection rate was 77.72%, which was unfortunate while the detection rate was 90.23% after the normalization. In the future, it is expected that better handling of nonsense terms, such as common date and quality display, will yield better results.

Automatic Vowel Onset Point Detection Based on Auditory Frequency Response (청각 주파수 응답에 기반한 자동 모음 개시 지점 탐지)

  • Zang, Xian;Kim, Hag-Tae;Chong, Kil-To
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.13 no.1
    • /
    • pp.333-342
    • /
    • 2012
  • This paper presents a vowel onset point (VOP) detection method based on the human auditory system. This method maps the "perceptual" frequency scale, i.e. Mel scale onto a linear acoustic frequency, and then establishes a series of Triangular Mel-weighted Filter Bank simulate the function of band pass filtering in human ear. This nonlinear critical-band filter bank helps greatly reduce the data dimensionality, and eliminate the effect of harmonic waves to make the formants more prominent in the nonlinear spaced Mel spectrum. The sum of mel spectrum peaks energy is extracted as feature for each frame, and the instinct at which the energy amplitude starts rising sharply is detected as VOP, by convolving with Gabor window. For the single-word database which contains 12 vowels articulated with different kinds of consonants, the experimental results showed a good average detection rate of 72.73%, higher than other vowel detection methods based on short-time energy and zero-crossing rate.

The Development of the Korean Medicine Symptom Diagnosis System Using Morphological Analysis to Refine Difficult Medical Terminology (전문용어 정제를 위한 형태소 분석을 이용한 한의학 증상 진단 시스템 개발)

  • Lee, Sang-Baek;Son, Yun-Hee;Jang, Hyun-Chul;Lee, Kyu-Chul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.2
    • /
    • pp.77-82
    • /
    • 2016
  • This paper presents the development of the Korean medicine symptom diagnosis system. In the Korean medicine symptom diagnosis system, the patient explains their symptoms and an oriental doctor makes a diagnosis based on the symptoms. Natural language processing is required to make a diagnosis automatically through the patients' reports of symptoms. We use morphological analysis to get understandable information from the natural language itself. We developed a diagnosis system that consists of NoSQL document-oriented databases-MongoDB. NoSQL has better performance at unstructured and semi-structured data, rather than using Relational Databases. We collect patient symptom reports in MongoDB to refine difficult medical terminology and provide understandable terminology to patients.