• Title/Summary/Keyword: data extraction

Search Result 3,330, Processing Time 0.027 seconds

FIGURE ALPHABET HYPOTHESIS INSPIRED NEURAL NETWORK RECOGNITION MODEL

  • Ohira, Ryoji;Saiki, Kenji;Nagao, Tomoharu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2009.01a
    • /
    • pp.547-550
    • /
    • 2009
  • The object recognition mechanism of human being is not well understood yet. On research of animal experiment using an ape, however, neurons that respond to simple shape (e.g. circle, triangle, square and so on) were found. And Hypothesis has been set up as human being may recognize object as combination of such simple shapes. That mechanism is called Figure Alphabet Hypothesis, and those simple shapes are called Figure Alphabet. As one way to research object recognition algorithm, we focused attention to this Figure Alphabet Hypothesis. Getting idea from it, we proposed the feature extraction algorithm for object recognition. In this paper, we described recognition of binarized images of multifont alphabet characters by the recognition model which combined three-layered neural network in the feature extraction algorithm. First of all, we calculated the difference between the learning image data set and the template by the feature extraction algorithm. The computed finite difference is a feature quantity of the feature extraction algorithm. We had it input the feature quantity to the neural network model and learn by backpropagation (BP method). We had the recognition model recognize the unknown image data set and found the correct answer rate. To estimate the performance of the contriving recognition model, we had the unknown image data set recognized by a conventional neural network. As a result, the contriving recognition model showed a higher correct answer rate than a conventional neural network model. Therefore the validity of the contriving recognition model could be proved. We'll plan the research a recognition of natural image by the contriving recognition model in the future.

  • PDF

Development of Information Extraction System from Multi Source Unstructured Documents for Knowledge Base Expansion (지식베이스 확장을 위한 멀티소스 비정형 문서에서의 정보 추출 시스템의 개발)

  • Choi, Hyunseung;Kim, Mintae;Kim, Wooju;Shin, Dongwook;Lee, Yong Hun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.111-136
    • /
    • 2018
  • In this paper, we propose a methodology to extract answer information about queries from various types of unstructured documents collected from multi-sources existing on web in order to expand knowledge base. The proposed methodology is divided into the following steps. 1) Collect relevant documents from Wikipedia, Naver encyclopedia, and Naver news sources for "subject-predicate" separated queries and classify the proper documents. 2) Determine whether the sentence is suitable for extracting information and derive the confidence. 3) Based on the predicate feature, extract the information in the proper sentence and derive the overall confidence of the information extraction result. In order to evaluate the performance of the information extraction system, we selected 400 queries from the artificial intelligence speaker of SK-Telecom. Compared with the baseline model, it is confirmed that it shows higher performance index than the existing model. The contribution of this study is that we develop a sequence tagging model based on bi-directional LSTM-CRF using the predicate feature of the query, with this we developed a robust model that can maintain high recall performance even in various types of unstructured documents collected from multiple sources. The problem of information extraction for knowledge base extension should take into account heterogeneous characteristics of source-specific document types. The proposed methodology proved to extract information effectively from various types of unstructured documents compared to the baseline model. There is a limitation in previous research that the performance is poor when extracting information about the document type that is different from the training data. In addition, this study can prevent unnecessary information extraction attempts from the documents that do not include the answer information through the process for predicting the suitability of information extraction of documents and sentences before the information extraction step. It is meaningful that we provided a method that precision performance can be maintained even in actual web environment. The information extraction problem for the knowledge base expansion has the characteristic that it can not guarantee whether the document includes the correct answer because it is aimed at the unstructured document existing in the real web. When the question answering is performed on a real web, previous machine reading comprehension studies has a limitation that it shows a low level of precision because it frequently attempts to extract an answer even in a document in which there is no correct answer. The policy that predicts the suitability of document and sentence information extraction is meaningful in that it contributes to maintaining the performance of information extraction even in real web environment. The limitations of this study and future research directions are as follows. First, it is a problem related to data preprocessing. In this study, the unit of knowledge extraction is classified through the morphological analysis based on the open source Konlpy python package, and the information extraction result can be improperly performed because morphological analysis is not performed properly. To enhance the performance of information extraction results, it is necessary to develop an advanced morpheme analyzer. Second, it is a problem of entity ambiguity. The information extraction system of this study can not distinguish the same name that has different intention. If several people with the same name appear in the news, the system may not extract information about the intended query. In future research, it is necessary to take measures to identify the person with the same name. Third, it is a problem of evaluation query data. In this study, we selected 400 of user queries collected from SK Telecom 's interactive artificial intelligent speaker to evaluate the performance of the information extraction system. n this study, we developed evaluation data set using 800 documents (400 questions * 7 articles per question (1 Wikipedia, 3 Naver encyclopedia, 3 Naver news) by judging whether a correct answer is included or not. To ensure the external validity of the study, it is desirable to use more queries to determine the performance of the system. This is a costly activity that must be done manually. Future research needs to evaluate the system for more queries. It is also necessary to develop a Korean benchmark data set of information extraction system for queries from multi-source web documents to build an environment that can evaluate the results more objectively.

Research on Data Acquisition Strategy and Its Application in Web Usage Mining (웹 사용 마이닝에서의 데이터 수집 전략과 그 응용에 관한 연구)

  • Ran, Cong-Lin;Joung, Suck-Tae
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.12 no.3
    • /
    • pp.231-241
    • /
    • 2019
  • Web Usage Mining (WUM) is one part of Web mining and also the application of data mining technique. Web mining technology is used to identify and analyze user's access patterns by using web server log data generated by web users when users access web site. So first of all, it is important that the data should be acquired in a reasonable way before applying data mining techniques to discover user access patterns from web log. The main task of data acquisition is to efficiently obtain users' detailed click behavior in the process of users' visiting Web site. This paper mainly focuses on data acquisition stage before the first stage of web usage mining data process with activities like data acquisition strategy and field extraction algorithm. Field extraction algorithm performs the process of separating fields from the single line of the log files, and they are also well used in practical application for a large amount of user data.

Feature Parameter Extraction and Speech Recognition Using Matrix Factorization (Matrix Factorization을 이용한 음성 특징 파라미터 추출 및 인식)

  • Lee Kwang-Seok;Hur Kang-In
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.7
    • /
    • pp.1307-1311
    • /
    • 2006
  • In this paper, we propose new speech feature parameter using the Matrix Factorization for appearance part-based features of speech spectrum. The proposed parameter represents effective dimensional reduced data from multi-dimensional feature data through matrix factorization procedure under all of the matrix elements are the non-negative constraint. Reduced feature data presents p art-based features of input data. We verify about usefulness of NMF(Non-Negative Matrix Factorization) algorithm for speech feature extraction applying feature parameter that is got using NMF in Mel-scaled filter bank output. According to recognition experiment results, we confirm that proposed feature parameter is superior to MFCC(Mel-Frequency Cepstral Coefficient) in recognition performance that is used generally.

A Study on Fast Extraction of Endmembers from Hyperspectral Image Data (초분광 영상자료의 Endmember 추출 속도 향상에 관한 연구)

  • Kim, Kwang-Eun
    • Korean Journal of Remote Sensing
    • /
    • v.28 no.4
    • /
    • pp.347-355
    • /
    • 2012
  • A fast algorithm for endmember extraction is proposed in this study which extracts min. and max. pixels from each band after MNF transform as candidate pixels for endmember. This method finds endmembers not from the entire image pixels but only from the previously extracted candidate pixels. The experimental results by N-FINDR using a simulated hyperspectral image data and AVIRIS Cuprite image data showed that the proposed fast algorithm extracts the same endmembers with the conventional methods. More studies on the effect of noise and more adaptive criteria in extracting candidate pixels are expected to increase the usability of this method for more fast and efficient analysis of hyperspectral image data.

Extraction of Different Types of Geometrical Features from Raw Sensor Data of Two-dimensional LRF (2차원 LRF의 Raw Sensor Data로부터 추출된 다른 타입의 기하학적 특징)

  • Yan, Rui-Jun;Wu, Jing;Yuan, Chao;Han, Chang-Soo
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.21 no.3
    • /
    • pp.265-275
    • /
    • 2015
  • This paper describes extraction methods of five different types of geometrical features (line, arc, corner, polynomial curve, NURBS curve) from the obtained raw data by using a two-dimensional laser range finder (LRF). Natural features with their covariance matrices play a key role in the realization of feature-based simultaneous localization and mapping (SLAM), which can be used to represent the environment and correct the pose of mobile robot. The covariance matrices of these geometrical features are derived in detail based on the raw sensor data and the uncertainty of LRF. Several comparison are made and discussed to highlight the advantages and drawbacks of each type of geometrical feature. Finally, the extracted features from raw sensor data obtained by using a LRF in an indoor environment are used to validate the proposed extraction methods.

Using a Cellular Automaton to Extract Medical Information from Clinical Reports

  • Barigou, Fatiha;Atmani, Baghdad;Beldjilali, Bouziane
    • Journal of Information Processing Systems
    • /
    • v.8 no.1
    • /
    • pp.67-84
    • /
    • 2012
  • An important amount of clinical data concerning the medical history of a patient is in the form of clinical reports that are written by doctors. They describe patients, their pathologies, their personal and medical histories, findings made during interviews or during procedures, and so forth. They represent a source of precious information that can be used in several applications such as research information to diagnose new patients, epidemiological studies, decision support, statistical analysis, and data mining. But this information is difficult to access, as it is often in unstructured text form. To make access to patient data easy, our research aims to develop a system for extracting information from unstructured text. In a previous work, a rule-based approach is applied to a clinical reports corpus of infectious diseases to extract structured data in the form of named entities and properties. In this paper, we propose the use of a Boolean inference engine, which is based on a cellular automaton, to do extraction. Our motivation to adopt this Boolean modeling approach is twofold: first optimize storage, and second reduce the response time of the entities extraction.

Conjugate Point Extraction for High-Resolution Stereo Satellite Images Orientation

  • Oh, Jae Hong;Lee, Chang No
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.37 no.2
    • /
    • pp.55-62
    • /
    • 2019
  • The stereo geometry establishment based on the precise sensor modeling is prerequisite for accurate stereo data processing. Ground control points are generally required for the accurate sensor modeling though it is not possible over the area where the accessibility is limited or reference data is not available. For the areas, the relative orientation should be carried out to improve the geometric consistency between the stereo data though it does not improve the absolute positional accuracy. The relative orientation requires conjugate points that are well distributed over the entire image region. Therefore the automatic conjugate point extraction is required because the manual operation is labor-intensive. In this study, we applied the method consisting of the key point extraction, the search space minimization based on the epipolar line, and the rigorous outlier detection based on the RPCs (Rational Polynomial Coefficients) bias compensation modeling. We tested different parameters of window sizes for Kompsat-2 across track stereo data and analyzed the RPCs precision after the bias compensation for the cases whether the epipolar line information is used or not. The experimental results showed that matching outliers were inevitable for the different matching parameterization but they were successfully detected and removed with the rigorous method for sub-pixel level of stereo RPCs precision.

Text Extraction In WWW Images (웹 영상에 포함된 문자 영역의 추출)

  • 김상현;심재창;김중수
    • Proceedings of the IEEK Conference
    • /
    • 2000.06d
    • /
    • pp.15-18
    • /
    • 2000
  • In this paper, we propose a method for text extraction in the Web images. Our approach is based on contrast detecting and pixel component ratio analysis in mouse position. Extracted data with OCR can be used for real time dictionary call or language translation application in Web browser.

  • PDF

Emotion Recognition System Using Neural Networks in Textile Images (신경망을 이용한 텍스타일 영상에서의 감성인식 시스템)

  • Kim, Na-Yeon;Shin, Yun-Hee;Kim, Soo-Jeong;Kim, Jee-In;Jeong, Karp-Joo;Koo, Hyun-Jin;Kim, Eun-Yi
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.869-879
    • /
    • 2007
  • This paper proposes a neural network based approach for automatic human emotion recognition in textile images. To investigate the correlation between the emotion and the pattern, the survey is conducted on 20 peoples, which shows that a emotion is deeply affected by a pattern. Accordingly, a neural network based classifier is used for recognizing the pattern included in textiles. In our system, two schemes are used for describing the pattern; raw-pixel data extraction scheme using auto-regressive method (RDES) and wavelet transformed data extraction scheme (WTDES). To assess the validity of the proposed method, it was applied to recognize the human emotions in 100 textiles, and the results shows that using WTDES guarantees better performance than using RDES. The former produced the accuracy of 71%, while the latter produced the accuracy of 90%. Although there are some differences according to the data extraction scheme, the proposed method shows the accuracy of 80% on average. This result confirmed that our system has the potential to be applied for various application such as textile industry and e-business.