• Title/Summary/Keyword: Neighbor Document

Search Result 24, Processing Time 0.036 seconds

Skew Detection for Thai Printed Document Images

  • Premchaiswad, Wichian;Duangphasuk, Surakarn
    • Proceedings of the IEEK Conference
    • /
    • 2000.07a
    • /
    • pp.326-328
    • /
    • 2000
  • The paper proposes the scheme of skew detection for Thai printed document images by using linear regression algorithm. It intends to use with the Thai character recognition systems to reduce the skew detection time. This scheme begins by finding the center of gravity of a document image. This point is used as the starting point for gathering data in the scheme. The data is obtained by scanning incrementally one pixel in vertically with the width of 20-pixels. After the scanning process, if data Is different from it's neighbor more than ${\pm}$ 15 pixels, it will be considered as noise or data in other lines and will be deleted. The last step is the operation by using linear regression algorithm on these selected data and the skew angle will be obtained. The proposed method has been tested with 45 document images with different fonts, sizes and skew angles. The experiment results show that the proposed method can detect the skew angle with the error of less then one degree. The average processing time is about 19 times faster than that of the Hough Transform method.

  • PDF

An ECG Document Imaging System based on Neural Network and Graphic Techniques (신경망과 그래픽 기법을 이용한 심전도 결과지 이미징 시스템)

  • Kim Jin-Sang;Choi Sang-Yeol;Bae In-Ho;Kim Yun-Nyeon
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2006.05a
    • /
    • pp.269-272
    • /
    • 2006
  • 병원의 각종 측정 장비에서 출력되는 결과지나 의사들이 작성한 기록지를 스캔하여 이미지형태로 저장하는 이미징 시스템 개발이 크게 요구되고 있다. 본 논문에서는 신경망과 그래픽 기법을 사용하여 대학병원 심전도실에서 사용되는 여섯 종류의 심전도 출력지를 이미지 형태로 저장하고 검색하는 이미징 시스템의 설계와 구현에 대해 논하였다. 구현된 시스템은 여섯 종류의 심전도 출력지를 분류하고, 분류된 각 출력지에 인쇄된 중요한 측정 데이터를 인식하여 데이터베이스에 저장한다. 심전도 출력지의 분류는 각 샘플 서식들의 평균 히스토그램을 구한 다음 새로운 출력지가 들어올 때 평균 히스토그램과의 거리가 가장 가까운 출력지로 분류하는 nearest-neighbor 방법을 사용하였다. 출력지에 인쇄된 데이터의 인식을 위해 먼저 XML로 작성한 출력지별 추출 정보를 기반으로 스캔한 이미지의 영역 분할 작업을 수행한다. 분할된 영역들은 신경망을 이용해 문자 인식을 하고, 인식된 문자들이 데이터베이스의 해당 속성값으로 저장된다. 스캔한 출력지는 의사들이 주석을 붙이거나 조건 검색을 위해 이미지 형태로 저장된다.

  • PDF

IPv6 Stateless Address Autoconfiguration for Mobile Ad Hoc Networks (Ad hoc 망을 위한 IPv6기반 비상태형 자동 주소설정 프로토콜)

  • 박정수;인민교;홍용근;김용진;박성우
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2001.10a
    • /
    • pp.61-64
    • /
    • 2001
  • The concept of IPv6 stateless address autoconfiguration lends itself easily to mobile ad hoc networks. However, the Neighbor Discovery protocol (NDP)-based mechanism described in [1] does not fit well for the multi-link environments in the mobile ad hoc network. In this document, we extend the current SAA mechanism to be suitable for mobile ad hoc networks.

  • PDF

Empirical Analysis & Comparisons of Web Document Classification Methods (문서분류 기법을 이용한 웹 문서 분류의 실험적 비교)

  • Lee, Sang-Soon;Choi, Jung-Min;Jang, Geun;Lee, Byung-Soo
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10d
    • /
    • pp.154-156
    • /
    • 2002
  • 인터넷의 발전으로 우리는 많은 정보와 지식을 인터넷에서 제공받을 수 있으며 HTML, 뉴스그룹 문서, 전자메일 등의 웹 문서로 존재한다. 이러한 웹 문서들은 여러가지 목적으로 분류해야 할 필요가 있으며 이를 적용한 시스템으로는 Personal WebWatcher, InfoFinder, Webby, NewT 등이 있다. 웹 문서 분류 시스템에서는 문서분류 기법을 사용하여 웹 문서의 소속 클래스를 결정하는데 문서분류를 위한 기법 중 대표적인 알고리즘으로 나이브 베이지안(Naive Baysian), k-NN(k-Nearest Neighbor), TFIDF(Term Frequency Inverse Document Frequency)방법을 이용한다. 본 논문에서는 웹 문서를 대상으로 이러한 문서분류 알고리즘 각각의 성능을 비교 및 평가하고자 한다.

  • PDF

Automatic Document Categorization Using K-Nearest Neighbor Algorithm and Object-Oriented Thesaurus (K-NN과 객체 지향 시소러스를 이용한 웹 문서 자동 분류)

  • 방선이;양재동
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.10b
    • /
    • pp.145-147
    • /
    • 2001
  • 문서 자동 분류에는 통계적인 기법과 machine learning 기법의 맡은 알고리즘들이 이용되고 있다. 통계적인 기법 알고리즘을 이용한 문서 분류는 높은 성능을 보이지만 분류할 카테고리가 둘 이상인 경우가 빈번할 경우에는 정확률이 급격히 저하되는 단점이 있다. 본 논문에서는 K-NN알고리즘을 이용하여 일차적인 문서 분류를 수행한 후 특정 카테고리로 분류하기에 애매모호한 경우가 생길 경우 시소러스의 일반화 관계와 연관화 관계를 이용하여 모호성을 줄임으로써 문서 자동 분류의 성능을 높이기 위한 새 기법을 제안한다.

  • PDF

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

Automatic Document Classification Based on k-NN Classifier and Object-Based Thesaurus (k-NN 분류 알고리즘과 객체 기반 시소러스를 이용한 자동 문서 분류)

  • Bang Sun-Iee;Yang Jae-Dong;Yang Hyung-Jeong
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.9
    • /
    • pp.1204-1217
    • /
    • 2004
  • Numerous statistical and machine learning techniques have been studied for automatic text classification. However, because they train the classifiers using only feature vectors of documents, ambiguity between two possible categories significantly degrades precision of classification. To remedy the drawback, we propose a new method which incorporates relationship information of categories into extant classifiers. In this paper, we first perform the document classification using the k-NN classifier which is generally known for relatively good performance in spite of its simplicity. We employ the relationship information from an object-based thesaurus to reduce the ambiguity. By referencing various relationships in the thesaurus corresponding to the structured categories, the precision of k-NN classification is drastically improved, removing the ambiguity. Experiment result shows that this method achieves the precision up to 13.86% over the k-NN classification, preserving its recall.

A Study on the 16th Century Food Culture of Chosun Dynasty Nobility in "Miam's Diary" (『미암일기(眉巖日記)』분석을 통한 16세기 사대부가(士大夫家) 음식문화 연구 - 정묘년(丁卯年)(1567년(年)) 10월(月)~무진년(戊辰年)(1568년(年)) 9월(月) -)

  • Kim, Mi-Hye
    • Journal of the Korean Society of Food Culture
    • /
    • v.28 no.5
    • /
    • pp.425-437
    • /
    • 2013
  • The aim of this study was to establish the identity of Korean traditional food based on the recorded food preferences during the period of the Chosun Dynasty. Our primary source in this regard was the invaluable, historical document called the "Miam's diary." This important document reveals details of such food preferences from October 1567 to September 1568. By analyzing the income-expenditure trends of virtually every household, this diary was used to describe a vivid traditional food preference of the people during that period. A detailed analysis of the diary reveals the summary of families' characteristics in the 16th century. First, it records the fact that expenditure on food was mainly based on stipend and gifts received. The type of food preferred by the people was diverse in nature; for it included rice, bean, chicken, pheasant, and seafood. However, there were dried or pickled forms too so as to prevent them from undergoing decay. Second, it throws light on the fact that people expended food mainly as a salary for servants. People utilized the income from selling such food items to purchase goods and land. They also used the same either to donate for a funeral or wedding purpose. Third, it records the fact that day-to-day purchase of groceries was mostly based on gift(s) for someone close to them such as a neighbor, colleague, relative, or student. Further, such gifts included small groceries, food items, and clothes. Fourth, based on the data available in the diary, it seemed likely that the gentry families laid emphasis on the customary formalities of a family dating back to as early as the late 16th century. Finally, the document also records the fact that noblemen of the Chosun Dynasty had a notion that they had to extend warmth and affection by presenting generous gifts to their guests at home. Noblemen during that period were very particular in welcoming their guests as they believed that this approach alone would testify their status as noblemen.

Combining Multiple Classifiers for Automatic Classification of Email Documents (전자우편 문서의 자동분류를 위한 다중 분류기 결합)

  • Lee, Jae-Haeng;Cho, Sung-Bae
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.3
    • /
    • pp.192-201
    • /
    • 2002
  • Automated text classification is considered as an important method to manage and process a huge amount of documents in digital forms that are widespread and continuously increasing. Recently, text classification has been addressed with machine learning technologies such as k-nearest neighbor, decision tree, support vector machine and neural networks. However, only few investigations in text classification are studied on real problems but on well-organized text corpus, and do not show their usefulness. This paper proposes and analyzes text classification methods for a real application, email document classification task. First, we propose a combining method of multiple neural networks that improves the performance through the combinations with maximum and neural networks. Second, we present another strategy of combining multiple machine learning classifiers. Voting, Borda count and neural networks improve the overall classification performance. Experimental results show the usefulness of the proposed methods for a real application domain, yielding more than 90% precision rates.

Design and Implementation of Web Crawler utilizing Unstructured data

  • Tanvir, Ahmed Md.;Chung, Mokdong
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.3
    • /
    • pp.374-385
    • /
    • 2019
  • A Web Crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this paper, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. The system goal is filtered at the data acquisition level and for a large taxonomy. The main problem in text taxonomy is how to improve the classification accuracy. In order to improve the accuracy, we propose a new weighting method of TF-IDF. In this paper, we modified TF-algorithm to calculate the accuracy of unstructured data. Finally, our system proposes a competent web pages search crawling algorithm, which is derived from TF-IDF and RL Web search algorithm to enhance the searching efficiency of the relevant information. In this paper, an attempt has been made to research and examine the work nature of crawlers and crawling algorithms in search engines for efficient information retrieval.