• Title/Summary/Keyword: Preprocessing Process

Search Result 427, Processing Time 0.027 seconds

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Development of Information System based on GIS for Analyzing Basin-Wide Pollutant Washoff (유역오염원 수질거동해석을 위한 GIS기반 정보시스템 개발)

  • Park, Dae-Hee;Ha, Sung-Ryong
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.9 no.4
    • /
    • pp.34-44
    • /
    • 2006
  • Simulation models allow researchers to model large hydrological catchment for comprehensive management of the water resources and explication of the diffuse pollution processes, such as land-use changes by development plan of the region. Recently, there have been reported many researches that examine water body quality using Geographic Information System (GIS) and dynamic watershed models such as AGNPS, HSPF, SWAT that necessitate handling large amounts of data. The aim of this study is to develop a watershed based water quality estimation system for the impact assessment on stream water quality. KBASIN-HSPF, proposed in this study, provides easy data compiling for HSPF by facilitating the setup and simulation process. It also assists the spatial interpretation of point and non-point pollutant information and thiessen rainfall creation and pre and post processing for large environmental data An integration methodology of GIS and water quality model for the preprocessing geo-morphologic data was designed by coupling the data model KBASIN-HSPF interface comprises four modules: registration and modification of basic environmental information, watershed delineation generator, watershed geo-morphologic index calculator and model input file processor. KBASIN-HSPF was applied to simulate the water quality impact by variation of subbasin pollution discharge structure.

  • PDF

Hardware-Based High Performance XML Parsing Technique Using an FPGA (FPGA를 이용한 하드웨어 기반 고성능 XML 파싱 기법)

  • Lee, Kyu-hee;Seo, Byeong-seok
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.40 no.12
    • /
    • pp.2469-2475
    • /
    • 2015
  • A structured XML has been widely used to present services on various Web-services. The XML is also used for digital documents and digital signatures and for the representation of multimedia files in email systems. The XML document should be firstly parsed to access elements in the XML. The parsing is the most compute-instensive task in the use of XML documents. Most of the previous work has focused on hardware based XML parsers in order to improve parsing performance, while a little work has studied parsing techniques. We present the high performance parsing technique which can be used all of XML parsers and design hardware based XML parser using an FPGA. The proposed parsing technique uses element analyzers instead of the state machine and performs multibyte-based element matching. As a result, our parsing technique can reduce the number of clock cycles per byte(CPB) and does not need to require any preprocessing, such as loading XML data into memory. Compared to other parsers, our parser acheives 1.33~1.82 times improvement in the system performance. Therefore, the proposed parsing technique can process XML documents in real time and is suitable for applying to all of XML parsers.

Seismic interval velocity analysis on prestack depth domain for detecting the bottom simulating reflector of gas-hydrate (가스 하이드레이트 부존층의 하부 경계면을 규명하기 위한 심도영역 탄성파 구간속도 분석)

  • Ko Seung-Won;Chung Bu-Heung
    • 한국신재생에너지학회:학술대회논문집
    • /
    • 2005.06a
    • /
    • pp.638-642
    • /
    • 2005
  • For gas hydrate exploration, long offset multichannel seismic data acquired using by the 4km streamer length in Ulleung basin of the East Sea. The dataset was processed to define the BSRs (Bottom Simulating Reflectors) and to estimate the amount of gas hydrates. Confirmation of the presence of Bottom Simulating reflectors (BSR) and investigation of its physical properties from seismic section are important for gas hydrate detection. Specially, faster interval velocity overlying slower interval velocity indicates the likely presences of gas hydrate above BSR and free gas underneath BSR. In consequence, estimation of correct interval velocities and analysis of their spatial variations are critical processes for gas hydrate detection using seismic reflection data. Using Dix's equation, Root Mean Square (RMS) velocities can be converted into interval velocities. However, it is not a proper way to investigate interval velocities above and below BSR considering the fact that RMS velocities have poor resolution and correctness and the assumption that interval velocities increase along the depth. Therefore, we incorporated Migration Velocity Analysis (MVA) software produced by Landmark CO. to estimate correct interval velocities in detail. MVA is a process to yield velocities of sediments between layers using Common Mid Point (CMP) gathered seismic data. The CMP gathered data for MVA should be produced after basic processing steps to enhance the signal to noise ratio of the first reflections. Prestack depth migrated section is produced using interval velocities and interval velocities are key parameters governing qualities of prestack depth migration section. Correctness of interval velocities can be examined by the presence of Residual Move Out (RMO) on CMP gathered data. If there is no RMO, peaks of primary reflection events are flat in horizontal direction for all offsets of Common Reflection Point (CRP) gathers and it proves that prestack depth migration is done with correct velocity field. Used method in this study, Tomographic inversion needs two initial input data. One is the dataset obtained from the results of preprocessing by removing multiples and noise and stacked partially. The other is the depth domain velocity model build by smoothing and editing the interval velocity converted from RMS velocity. After the three times iteration of tomography inversion, Optimum interval velocity field can be fixed. The conclusion of this study as follow, the final Interval velocity around the BSR decreased to 1400 m/s from 2500 m/s abruptly. BSR is showed about 200m depth under the seabottom

  • PDF

Effect of Glucose Level on Brain FDG-PET Images (FDG를 이용한 Brain PET에서 Glucose Level이 영상에 미치는 영향)

  • Kim, In-Yeong;Lee, Yong-ki;Ahn, Sung-Min
    • Journal of radiological science and technology
    • /
    • v.40 no.2
    • /
    • pp.275-280
    • /
    • 2017
  • In addition to tumors, normal tissues, such as the brain and myocardium can intake $^{18}F$-FDG, and the amount of $^{18}F$-FDG intake by normal tissues can be altered by the surrounding environment. Therefore, a process is necessary during which the contrasts of the tumor and normal tissues can be enhanced. Thus, this study examines the effects of glucose levels on FDG PET images of brain tissues, which features high glucose activity at all times, in small animals. Micro PET scan was performed on fourteen mice after injecting $^{18}F$-FDG. The images were compared in relation to fasting. The findings showed that the mean SUV value w as 0.84 higher in fasted mice than in non-fasted mice. During observation, the images from non-fasted mice showed high accumulation in organs other than the brain with increased surrounding noise. In addition, compared to the non-fasted mice, the fasted mice showed higher early intake and curve increase. The findings of this study suggest that fasting is important in assessing brain functions in brain PET using $^{18}F$-FDG. Additional studies to investigate whether caffeine levels and other preprocessing items have an impact on the acquired images would contribute to reducing radiation exposure in patients.

Traffic Attributes Correlation Mechanism based on Self-Organizing Maps for Real-Time Intrusion Detection (실시간 침입탐지를 위한 자기 조직화 지도(SOM)기반 트래픽 속성 상관관계 메커니즘)

  • Hwang, Kyoung-Ae;Oh, Ha-Young;Lim, Ji-Young;Chae, Ki-Joon;Nah, Jung-Chan
    • The KIPS Transactions:PartC
    • /
    • v.12C no.5 s.101
    • /
    • pp.649-658
    • /
    • 2005
  • Since the Network based attack Is extensive in the real state of damage, It is very important to detect intrusion quickly at the beginning. But the intrusion detection using supervised learning needs either the preprocessing enormous data or the manager's analysis. Also it has two difficulties to detect abnormal traffic that the manager's analysis might be incorrect and would miss the real time detection. In this paper, we propose a traffic attributes correlation analysis mechanism based on self-organizing maps(SOM) for the real-time intrusion detection. The proposed mechanism has three steps. First, with unsupervised learning build a map cluster composed of similar traffic. Second, label each map cluster to divide the map into normal traffic and abnormal traffic. In this step there is a rule which is created through the correlation analysis with SOM. At last, the mechanism would the process real-time detecting and updating gradually. During a lot of experiments the proposed mechanism has good performance in real-time intrusion to combine of unsupervised learning and supervised learning than that of supervised learning.

Normalization of Face Images Subject to Directional Illumination using Linear Model (선형모델을 이용한 방향성 조명하의 얼굴영상 정규화)

  • 고재필;김은주;변혜란
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.1
    • /
    • pp.54-60
    • /
    • 2004
  • Face recognition is one of the problems to be solved by appearance based matching technique. However, the appearance of face image is very sensitive to variation in illumination. One of the easiest ways for better performance is to collect more training samples acquired under variable lightings but it is not practical in real world. ]:n object recognition, it is desirable to focus on feature extraction or normalization technique rather than focus on classifier. This paper presents a simple approach to normalization of faces subject to directional illumination. This is one of the significant issues that cause error in the face recognition process. The proposed method, ICR(illumination Compensation based on Multiple Linear Regression), is to find the plane that best fits the intensity distribution of the face image using the multiple linear regression, then use this plane to normalize the face image. The advantages of our method are simple and practical. The planar approximation of a face image is mathematically defined by the simple linear model. We provide experimental results to demonstrate the performance of the proposed ICR method on public face databases and our database. The experimental results show a significant improvement of the recognition accuracy.

Interactive Projection by Closed-loop based Position Tracking of Projected Area for Portable Projector (이동 프로젝터 투사영역의 폐회로 기반 위치추적에 의한 인터랙티브 투사)

  • Park, Ji-Young;Rhee, Seon-Min;Kim, Myoung-Hee
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.1
    • /
    • pp.29-38
    • /
    • 2010
  • We propose an interactive projection technique to display details of a large image in a high resolution and brightness by tracking a portable projector. A closed-loop based tracking method is presented to update the projected image while a user changes the position of the detail area by moving the portable projector. A marker is embedded in the large image to indicate the position to be occupied by the detail image projected by the portable projector. The marker is extracted in sequential images acquired by a camera attached to the portable projector. The marker position in the large display image is updated under a constraint that the center positions of marker and camera frame coincide in every camera frame. The image and projective transformation for warping are calculated using the marker position and shape in the camera frame. The marker's four corner points are determined by a four-step segmentation process which consists of camera image preprocessing based on HSI, edge extraction by Hough transformation, quadrangle test, and cross-ratio test. The interactive projection system implemented by the proposed method performs at about 24fps. In the user study, the overall feedback about the system usability was very high.

Vehicle Area Segmentation from Road Scenes Using Grid-Based Feature Values (격자 단위 특징값을 이용한 도로 영상의 차량 영역 분할)

  • Kim Ku-Jin;Baek Nakhoon
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.10
    • /
    • pp.1369-1382
    • /
    • 2005
  • Vehicle segmentation, which extracts vehicle areas from road scenes, is one of the fundamental opera tions in lots of application areas including Intelligent Transportation Systems, and so on. We present a vehicle segmentation approach for still images captured from outdoor CCD cameras mounted on the supporting poles. We first divided the input image into a set of two-dimensional grids and then calculate the feature values of the edges for each grid. Through analyzing the feature values statistically, we can find the optimal rectangular grid area of the vehicle. Our preprocessing process calculates the statistics values for the feature values from background images captured under various circumstances. For a car image, we compare its feature values to the statistics values of the background images to finally decide whether the grid belongs to the vehicle area or not. We use dynamic programming technique to find the optimal rectangular gird area from these candidate grids. Based on the statistics analysis and global search techniques, our method is more systematic compared to the previous methods which usually rely on a kind of heuristics. Additionally, the statistics analysis achieves high reliability against noises and errors due to brightness changes, camera tremors, etc. Our prototype implementation performs the vehicle segmentation in average 0.150 second for each of $1280\times960$ car images. It shows $97.03\%$ of strictly successful cases from 270 images with various kinds of noises.

  • PDF

An Efficient Numeric Character Segmentation of Metering Devices for Remote Automatic Meter Reading (원격 자동 검침을 위한 효과적인 계량기 숫자 분할)

  • Toan, Vo Van;Chung, Sun-Tae;Cho, Seong-Won
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.6
    • /
    • pp.737-747
    • /
    • 2012
  • Recently, in order to support automatic meter reading for conventional metering devices, an image processing-based approach of recognizing the number meter data in the captured meter images has attracted many researchers' interests. Numerical character segmentation is a very critical process for successful recognition. In this paper, we propose an efficient numeric character segmentation method which can segment numeric characters well for any metering device types under diverse illumination environments. The proposed method consists of two consecutive stages; detection of number area containing all numbers as a tight ROI(Region of Interest) and segmentation of numerical characters in the ROI. Detection of tight ROI is achieved in two steps: extraction of rough ROI by utilizing horizontal line segments after illumination enhancement preprocessing, and making the rough ROI more tight through clipping utilizing vertical and horizontal projection about binarized ROI. Numerical character segmentation in the detected ROI is stably achieved in two processes of 'vertical segmentation of each number region' and 'number segmentation in the each vertical segmented number region'. Through the experiments about a homegrown meter image database containing various meter type images of low contrast, low intensity, shadow, and saturation, it is shown that the proposed numeric character segmentation method performs effectively well for any metering device types under diverse illumination environments.