• Title/Summary/Keyword: Text comparing

Search Result 270, Processing Time 0.024 seconds

Layout Analysis for Calculation of Web Page Similarity as Image

  • Mitsuhashi, Noriaki;Yamaguchi, Toru;Takama, Yasufumi
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.142-145
    • /
    • 2003
  • When we search information on the Web using search engines, they only analyze the text information collected from the source files of Web pages. However, there is a limit to analyze the layout of a Web page only from its source file, although Web page design is the most important factor for a user to estimate a page. In particular it often happens on the Web that the pages of similar design ofter similar information. We propose a method to analyze layout for comparing the design of pages by treating the displayed page as image.

  • PDF

Fluency and Speech Rate for the Standard Korean Speakers (한국 표준어 화자의 유창성과 말속도에 관한 연구)

  • Shim, Hong-Im
    • Speech Sciences
    • /
    • v.11 no.3
    • /
    • pp.193-200
    • /
    • 2004
  • This was a preliminary study for standardizing speech rate and fluency of normal adult Korean speakers and comparing speech rate and fluency of normal speakers with those of professional speakers. The purposes of this study were to investigate (a) the speech rates (the overall speech rate and the articulation rate) and the disfluency characteristics of normnal adult speakers and (b) the speech rates (the overall speech rate and the articulation rate) and the disfluency characteristics between normal adult speakers and professional speakers. The results were as follows: The most frequent disfluency type was 'interjection' in story-telling, 'revision' in text reading and announcing of professional speakers. The professional speakers had the fastest speech rates (overall speech rate and articulation rate) among the 3 groups.

  • PDF

Probabilistic Model for Performance Analysis of a Heuristic with Multi-byte Suffix Matching

  • Choi, Yoon-Ho
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.7 no.4
    • /
    • pp.711-725
    • /
    • 2013
  • A heuristic with multi-byte suffix matching plays an important role in real pattern matching algorithms. By skipping many characters at a time in the process of comparing a given pattern with the text, the pattern matching algorithm based on a heuristic with multi-byte suffix matching shows a faster average search time than algorithms based on deterministic finite automata. Based on various experimental results and simulations, the previous works show that the pattern matching algorithms with multi-byte suffix matching performs well. However, there have been limited studies on the mathematical model for analyzing the performance in a standard manner. In this paper, we propose a new probabilistic model, which evaluates the performance of a heuristic with multi-byte suffix matching in an average-case search. When the theoretical analysis results and experimental results were compared, the proposed probabilistic model was found to be sufficient for evaluating the performance of a heuristic with suffix matching in the real pattern matching algorithms.

Comparing Feature Selection Methods in Spam Mail Filtering

  • Kim, Jong-Wan;Kang, Sin-Jae
    • Proceedings of the Korea Society of Information Technology Applications Conference
    • /
    • 2005.11a
    • /
    • pp.17-20
    • /
    • 2005
  • In this work, we compared several feature selection methods in the field of spam mail filtering. The proposed fuzzy inference method outperforms information gain and chi squared test methods as a feature selection method in terms of error rate. In the case of junk mails, since the mail body has little text information, it provides insufficient hints to distinguish spam mails from legitimate ones. To address this problem, we follow hyperlinks contained in the email body, fetch contents of a remote web page, and extract hints from both original email body and fetched web pages. A two-phase approach is applied to filter spam mails in which definite hint is used first, and then less definite textual information is used. In our experiment, the proposed two-phase method achieved an improvement of recall by 32.4% on the average over the $1^{st}$ phase or the $2^{nd}$ phase only works.

  • PDF

Personalized Anti-spam Filter Considering Users' Different Preferences

  • Kim, Jong-Wan
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.6
    • /
    • pp.841-848
    • /
    • 2010
  • Conventional filters using email header and body information equally judge whether an incoming email is spam or not. However this is unrealistic in everyday life because each person has different criteria to judge what is spam or not. To resolve this problem, we consider user preference information as well as email category information derived from the email content. In this paper, we have developed a personalized anti-spam system using ontologies constructed from rules derived in a data mining process. The reason why traditional content-based filters are not applicable to the proposed experimental situation is described. In also, several experiments constructing classifiers to decide email category and comparing classification rule learners are performed. Especially, an ID3 decision tree algorithm improved the overall accuracy around 17% compared to a conventional SVM text miner on the decision of email category. Some discussions about the axioms generated from the experimental dataset are given too.

Study on HuatuoXuanmenNeizhaotu in Processing of Medicinal ("화타현문내조도(華陀玄門內照圖)"의 약물포제(藥物炮製)에 대한 고찰(考察))

  • Sim, Hyun-A;Hwang, Seong-Yeon;Eom, Dong-Myung
    • Journal of Korean Medical classics
    • /
    • v.25 no.2
    • /
    • pp.75-88
    • /
    • 2012
  • Objective : Huatuoxuanmenneizhaotu(華陀玄門內照圖) is a Huatuo's book in two volumes, The second volume classifies poisonous and nonpoisonous medicines with explaining processing of medicinals. We, authors have concern on processing of medicinals in Huatuoxuanmenneizhaotu. Methods : Through Huatuoxuanmenneizhaotu text translation, we will try to categorize four ways : classifying 1) poisonous and nonpoisonous medicines, 2) methods of making medicines, 3) processing of medicinals using weter and fire and 4) methods of supplements in processing of medicinals. Result : There are some miss-matching in poisonous and nonpoisonous medicines category in Huatuoxuanmenneizhaotu comparing with Bencaogangmu. There are several methods in making medicines, processing of medicinals and supplements in processing of medicinals. Conclusion : These results explain that processing of medicinals in Huatuoxuanmenneizhaotu were really diverse and various.

User-Created Content Recommendation Using Tag Information and Content Metadata

  • Rhie, Byung-Woon;Kim, Jong-Woo;Lee, Hong-Joo
    • Management Science and Financial Engineering
    • /
    • v.16 no.2
    • /
    • pp.29-38
    • /
    • 2010
  • As the Internet is more embedded in people's lives, Internet users draw on new Internet applications to express themselves through "user-created content (UCC)." In addition, there is a noticeable shift from text-centered contents mainly posted on bulletin boards to multimedia contents such as images and videos on UCC web sites. The changes require different way of recommendations comparing to traditional products or contents recommendation on the Internet. This paper aims to design UCC recommendation methods with user behavior data and contents metadata such as tags and titles, and compare performances of the suggested methods. Real web logs data of a major Korean video UCC site was used to empirical experiments. The results of the experiments show that collaborative filtering technique based on similarity of UCC customers' preferences performs better than other content-based recommendation methods based on tag information and content metadata.

An experiment in automatic indexing with korean texts : a comparison of syntactico-statistical and manual methods (구문 . 통계적 기법을 이용한 한국어 자동색인에 관한 연구)

  • 서은경
    • Journal of the Korean Society for information Management
    • /
    • v.10 no.1
    • /
    • pp.97-124
    • /
    • 1993
  • This study was undertaken in order to develop practical automatic indexing techniques suitable for Korean natural language texts. It has taken a modest step toward this goal by developing an automatic syntactico-statistical indexing method and evaluating the method by comparing the resutls with manual indexing. For this experimental study, the Korean text database was constructed manually based on 300 abstracts covering business subject. The experimental results showed that the performance of the automatic syntactico-statistical indexing system was comparable to that of other studies which have compared automatic indexing with manual indexing.

  • PDF

A Study on DamEum(Phlegm-fluid retention) in Shingan Hyemin Eoyakwonbang(新刊惠民御藥院方) ("신간혜민어약원방(新刊惠民御藥院方).담음문(痰飮門)"에 대한 연구(硏究))

  • Eom, Dong-Myung;Song, Ji-Chung;Keum, Kyung-Soo
    • Journal of Korean Medical classics
    • /
    • v.25 no.4
    • /
    • pp.115-121
    • /
    • 2012
  • Objective : Yayaoyuanfang(御藥院方) is a prescription book, compiled by Xu Guozhen(許國楨) in 1267. Yayaoyuanfang was published in Chosun dynasty as named as Singan Hyemin Eoyakwonbang. Therefore, we have interests in what are the differences in those two books. Method : We try to analyze two texts' differences by physical bibliography and comparing contents only in DamEum. Result : Those differences are the name, order, matria medica, effects, medicine dose, how to use and medicine processing of prescriptions. Conclusion : There are several differences between Yayaoyuanfang and Singan Hyemin Eoyakwonbang. However, Singan Hyemin Eoyakwonbang is not a full text so far, we need continuous studies on Yayaoyuanfang and Singan Hyemin Eoyakwonbang.

A Study on the Methodology of Traceability Analysis and Visualization between Non-standardized documents (비정형화된 문서간 추적성 분석 및 그 가시화 방안 제시)

  • Kim, EunHee;Song, Duck Yong;Hwang, Jin Sang;Jung, Jea Cheon
    • Journal of the Korean Society of Systems Engineering
    • /
    • v.10 no.1
    • /
    • pp.57-64
    • /
    • 2014
  • We propose a methodology to automatically extract the requirements from the documents and check the traceability between them. The documents include not only the text file but also PDF or image files. We also suggest a method to visualize the result with maps, numbers, and graphs. By comparing the results with those of expert reviews, we show that it is necessary to use knowledge-based method in future instead of the word-based method for improving the reliability. The results give more values when they are applied in already existing documents than those of newly developed product.