• Title/Summary/Keyword: improved algorithm

Search Result 4,961, Processing Time 0.041 seconds

How to improve the accuracy of recommendation systems: Combining ratings and review texts sentiment scores (평점과 리뷰 텍스트 감성분석을 결합한 추천시스템 향상 방안 연구)

  • Hyun, Jiyeon;Ryu, Sangyi;Lee, Sang-Yong Tom
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.219-239
    • /
    • 2019
  • As the importance of providing customized services to individuals becomes important, researches on personalized recommendation systems are constantly being carried out. Collaborative filtering is one of the most popular systems in academia and industry. However, there exists limitation in a sense that recommendations were mostly based on quantitative information such as users' ratings, which made the accuracy be lowered. To solve these problems, many studies have been actively attempted to improve the performance of the recommendation system by using other information besides the quantitative information. Good examples are the usages of the sentiment analysis on customer review text data. Nevertheless, the existing research has not directly combined the results of the sentiment analysis and quantitative rating scores in the recommendation system. Therefore, this study aims to reflect the sentiments shown in the reviews into the rating scores. In other words, we propose a new algorithm that can directly convert the user 's own review into the empirically quantitative information and reflect it directly to the recommendation system. To do this, we needed to quantify users' reviews, which were originally qualitative information. In this study, sentiment score was calculated through sentiment analysis technique of text mining. The data was targeted for movie review. Based on the data, a domain specific sentiment dictionary is constructed for the movie reviews. Regression analysis was used as a method to construct sentiment dictionary. Each positive / negative dictionary was constructed using Lasso regression, Ridge regression, and ElasticNet methods. Based on this constructed sentiment dictionary, the accuracy was verified through confusion matrix. The accuracy of the Lasso based dictionary was 70%, the accuracy of the Ridge based dictionary was 79%, and that of the ElasticNet (${\alpha}=0.3$) was 83%. Therefore, in this study, the sentiment score of the review is calculated based on the dictionary of the ElasticNet method. It was combined with a rating to create a new rating. In this paper, we show that the collaborative filtering that reflects sentiment scores of user review is superior to the traditional method that only considers the existing rating. In order to show that the proposed algorithm is based on memory-based user collaboration filtering, item-based collaborative filtering and model based matrix factorization SVD, and SVD ++. Based on the above algorithm, the mean absolute error (MAE) and the root mean square error (RMSE) are calculated to evaluate the recommendation system with a score that combines sentiment scores with a system that only considers scores. When the evaluation index was MAE, it was improved by 0.059 for UBCF, 0.0862 for IBCF, 0.1012 for SVD and 0.188 for SVD ++. When the evaluation index is RMSE, UBCF is 0.0431, IBCF is 0.0882, SVD is 0.1103, and SVD ++ is 0.1756. As a result, it can be seen that the prediction performance of the evaluation point reflecting the sentiment score proposed in this paper is superior to that of the conventional evaluation method. In other words, in this paper, it is confirmed that the collaborative filtering that reflects the sentiment score of the user review shows superior accuracy as compared with the conventional type of collaborative filtering that only considers the quantitative score. We then attempted paired t-test validation to ensure that the proposed model was a better approach and concluded that the proposed model is better. In this study, to overcome limitations of previous researches that judge user's sentiment only by quantitative rating score, the review was numerically calculated and a user's opinion was more refined and considered into the recommendation system to improve the accuracy. The findings of this study have managerial implications to recommendation system developers who need to consider both quantitative information and qualitative information it is expect. The way of constructing the combined system in this paper might be directly used by the developers.

The Comparison of Image Quality and Quantitative Indices by Wide Beam Reconstruction Method and Filtered Back Projection Method in Tl-201 Myocardial Perfusion SPECT (Tl-201 심근관류 SPECT 검사에서 광대역 재구성(Wide Beam Reconstruction: WBR) 방법과 여과 후 역투영법에 따른 영상의 질 및 정량적 지표 값 비교)

  • Yoon, Soon-Sang;Nam, Ki-Pyo;Shim, Dong-Oh;Kim, Dong-Seok
    • The Korean Journal of Nuclear Medicine Technology
    • /
    • v.14 no.2
    • /
    • pp.122-127
    • /
    • 2010
  • Purpose: The Xpress3.$cardiac^{TM}$ which is a kind of wide beam reconstruction (WBR) method developed by UltraSPECT (Haifa, Israel) enables the acquisition of at quarter time while maintaining image quality. The purpose of this study is to investigate the usefulness of WBR method for decreasing scan times and to compare to it with filtered back projection (FBP), which is the method routinely used. Materials and Methods: Phantom and clinical studies were performed. The anthropomorphic torso phantom was made on an equality with counts from patient's body. The Tl-201 concentrations in the compartments were 74 kBq (2 ${\mu}Ci$)/cc in myocardium, 11.1 kBq (0.3 ${\mu}Ci$)/cc in soft tissue, and 2.59 kBq (0.07 ${\mu}Ci$)/cc in lung. The non-gated Tl-201 myocardial perfusion SPECT data were acquired with the phantom. The former study was scanned for 50 seconds per frame with FBP method, and the latter study was acquired for 13 seconds per frame with WBR method. Using the Xeleris ver. 2.0551, full width at half maximum (FWHM) and average image contrast were compared. In clinical studies, we analyzed the 30 patients who were examined by Tl-201 gated myocardial perfusion SPECT in department of nuclear medicine at Asan Medical Center from January to April 2010. The patients were imaged at full time (50 second per frame) with FBP algorithm and again quarter-time (13 second per frame) with the WBR algorithm. Using the 4D MSPECT (4DM), Quantitative Perfusion SPECT (QPS), and Quantitative Gated SPECT (QGS) software, the summed stress score (SSS), summed rest score (SRS), summed difference score, end-diastolic volume (EDV), end-systolic volume (ESV) and ejection fraction (EF) were analyzed for their correlations and statistical comparison by paired t-test. Results: As a result of the phantom study, the WBR method improved FWHM more than about 30% compared with FBP method (WBR data 5.47 mm, FBP data 7.07 mm). And the WBR method's average image contrast was also higher than FBP method's. However, in result of quantitative indices, SSS, SDS, SRS, EDV, ESV, EF, there were statistically significant differences from WBR and FBP(p<0.01). In the correlation of SSS, SDS, SRS, there were significant differences for WBR and FBP (0.18, 0.34, 0.08). But EDV, ESV, EF showed good correlation with WBR and FBP (0.88, 0.89, 0.71). Conclusion: From phantom study results, we confirmed that the WBR method reduces an acquisition time while improving an image quality compared with FBP method. However, we should consider significant differences in quantitative indices. And it needs to take an evaluation test to apply clinical study to find a cause of differences out between phantom and clinical results.

  • PDF

Accelerometer-based Gesture Recognition for Robot Interface (로봇 인터페이스 활용을 위한 가속도 센서 기반 제스처 인식)

  • Jang, Min-Su;Cho, Yong-Suk;Kim, Jae-Hong;Sohn, Joo-Chan
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.53-69
    • /
    • 2011
  • Vision and voice-based technologies are commonly utilized for human-robot interaction. But it is widely recognized that the performance of vision and voice-based interaction systems is deteriorated by a large margin in the real-world situations due to environmental and user variances. Human users need to be very cooperative to get reasonable performance, which significantly limits the usability of the vision and voice-based human-robot interaction technologies. As a result, touch screens are still the major medium of human-robot interaction for the real-world applications. To empower the usability of robots for various services, alternative interaction technologies should be developed to complement the problems of vision and voice-based technologies. In this paper, we propose the use of accelerometer-based gesture interface as one of the alternative technologies, because accelerometers are effective in detecting the movements of human body, while their performance is not limited by environmental contexts such as lighting conditions or camera's field-of-view. Moreover, accelerometers are widely available nowadays in many mobile devices. We tackle the problem of classifying acceleration signal patterns of 26 English alphabets, which is one of the essential repertoires for the realization of education services based on robots. Recognizing 26 English handwriting patterns based on accelerometers is a very difficult task to take over because of its large scale of pattern classes and the complexity of each pattern. The most difficult problem that has been undertaken which is similar to our problem was recognizing acceleration signal patterns of 10 handwritten digits. Most previous studies dealt with pattern sets of 8~10 simple and easily distinguishable gestures that are useful for controlling home appliances, computer applications, robots etc. Good features are essential for the success of pattern recognition. To promote the discriminative power upon complex English alphabet patterns, we extracted 'motion trajectories' out of input acceleration signal and used them as the main feature. Investigative experiments showed that classifiers based on trajectory performed 3%~5% better than those with raw features e.g. acceleration signal itself or statistical figures. To minimize the distortion of trajectories, we applied a simple but effective set of smoothing filters and band-pass filters. It is well known that acceleration patterns for the same gesture is very different among different performers. To tackle the problem, online incremental learning is applied for our system to make it adaptive to the users' distinctive motion properties. Our system is based on instance-based learning (IBL) where each training sample is memorized as a reference pattern. Brute-force incremental learning in IBL continuously accumulates reference patterns, which is a problem because it not only slows down the classification but also downgrades the recall performance. Regarding the latter phenomenon, we observed a tendency that as the number of reference patterns grows, some reference patterns contribute more to the false positive classification. Thus, we devised an algorithm for optimizing the reference pattern set based on the positive and negative contribution of each reference pattern. The algorithm is performed periodically to remove reference patterns that have a very low positive contribution or a high negative contribution. Experiments were performed on 6500 gesture patterns collected from 50 adults of 30~50 years old. Each alphabet was performed 5 times per participant using $Nintendo{(R)}$ $Wii^{TM}$ remote. Acceleration signal was sampled in 100hz on 3 axes. Mean recall rate for all the alphabets was 95.48%. Some alphabets recorded very low recall rate and exhibited very high pairwise confusion rate. Major confusion pairs are D(88%) and P(74%), I(81%) and U(75%), N(88%) and W(100%). Though W was recalled perfectly, it contributed much to the false positive classification of N. By comparison with major previous results from VTT (96% for 8 control gestures), CMU (97% for 10 control gestures) and Samsung Electronics(97% for 10 digits and a control gesture), we could find that the performance of our system is superior regarding the number of pattern classes and the complexity of patterns. Using our gesture interaction system, we conducted 2 case studies of robot-based edutainment services. The services were implemented on various robot platforms and mobile devices including $iPhone^{TM}$. The participating children exhibited improved concentration and active reaction on the service with our gesture interface. To prove the effectiveness of our gesture interface, a test was taken by the children after experiencing an English teaching service. The test result showed that those who played with the gesture interface-based robot content marked 10% better score than those with conventional teaching. We conclude that the accelerometer-based gesture interface is a promising technology for flourishing real-world robot-based services and content by complementing the limits of today's conventional interfaces e.g. touch screen, vision and voice.

L-band SAR-derived Sea Surface Wind Retrieval off the East Coast of Korea and Error Characteristics (L밴드 인공위성 SAR를 이용한 동해 연안 해상풍 산출 및 오차 특성)

  • Kim, Tae-Sung;Park, Kyung-Ae;Choi, Won-Moon;Hong, Sungwook;Choi, Byoung-Cheol;Shin, Inchul;Kim, Kyung-Ryul
    • Korean Journal of Remote Sensing
    • /
    • v.28 no.5
    • /
    • pp.477-487
    • /
    • 2012
  • Sea surface winds in the sea off the east coast of Korea were derived from L-band ALOS (Advanced Land Observing Satellite) PALSAR (Phased Array type L-band Synthetic Aperture Radar) data and their characteristics of errors were analyzed. We could retrieve high-resolution wind vectors off the east coast of Korea including the coastal region, which has been substantially unavailable from satellite scatterometers. Retrieved SAR-wind speeds showed a good agreement with in-situ buoy measurement by showing relatively small an root-mean-square (RMS) error of 0.67 m/s. Comparisons of the wind vectors from SAR and scatterometer presented RMS errors of 2.16 m/s and $19.24^{\circ}$, 3.62 m/s and $28.02^{\circ}$ for L-band GMF (Geophysical Model Function) algorithm 2009 and 2007, respectively, which tended to be somewhat higher than the expected limit of satellite scatterometer winds errors. L-band SAR-derived wind field exhibited the characteristic dependence on wind direction and incidence angle. The previous version (L-band GMF 2007) revealed large errors at small incidence angles of less than $21^{\circ}$. By contrast, the L-band GMF 2009, which improved the effect of incidence angle on the model function by considering a quadratic function instead of a linear relationship, greatly enhanced the quality of wind speed from 6.80 m/s to 1.14 m/s at small incident angles. This study addressed that the causes of wind retrieval errors should be intensively studied for diverse applications of L-band SAR-derived winds, especially in terms of the effects of wind direction and incidence angle, and other potential error sources.

GPU Based Feature Profile Simulation for Deep Contact Hole Etching in Fluorocarbon Plasma

  • Im, Yeon-Ho;Chang, Won-Seok;Choi, Kwang-Sung;Yu, Dong-Hun;Cho, Deog-Gyun;Yook, Yeong-Geun;Chun, Poo-Reum;Lee, Se-A;Kim, Jin-Tae;Kwon, Deuk-Chul;Yoon, Jung-Sik;Kim3, Dae-Woong;You, Shin-Jae
    • Proceedings of the Korean Vacuum Society Conference
    • /
    • 2012.08a
    • /
    • pp.80-81
    • /
    • 2012
  • Recently, one of the critical issues in the etching processes of the nanoscale devices is to achieve ultra-high aspect ratio contact (UHARC) profile without anomalous behaviors such as sidewall bowing, and twisting profile. To achieve this goal, the fluorocarbon plasmas with major advantage of the sidewall passivation have been used commonly with numerous additives to obtain the ideal etch profiles. However, they still suffer from formidable challenges such as tight limits of sidewall bowing and controlling the randomly distorted features in nanoscale etching profile. Furthermore, the absence of the available plasma simulation tools has made it difficult to develop revolutionary technologies to overcome these process limitations, including novel plasma chemistries, and plasma sources. As an effort to address these issues, we performed a fluorocarbon surface kinetic modeling based on the experimental plasma diagnostic data for silicon dioxide etching process under inductively coupled C4F6/Ar/O2 plasmas. For this work, the SiO2 etch rates were investigated with bulk plasma diagnostics tools such as Langmuir probe, cutoff probe and Quadruple Mass Spectrometer (QMS). The surface chemistries of the etched samples were measured by X-ray Photoelectron Spectrometer. To measure plasma parameters, the self-cleaned RF Langmuir probe was used for polymer deposition environment on the probe tip and double-checked by the cutoff probe which was known to be a precise plasma diagnostic tool for the electron density measurement. In addition, neutral and ion fluxes from bulk plasma were monitored with appearance methods using QMS signal. Based on these experimental data, we proposed a phenomenological, and realistic two-layer surface reaction model of SiO2 etch process under the overlying polymer passivation layer, considering material balance of deposition and etching through steady-state fluorocarbon layer. The predicted surface reaction modeling results showed good agreement with the experimental data. With the above studies of plasma surface reaction, we have developed a 3D topography simulator using the multi-layer level set algorithm and new memory saving technique, which is suitable in 3D UHARC etch simulation. Ballistic transports of neutral and ion species inside feature profile was considered by deterministic and Monte Carlo methods, respectively. In case of ultra-high aspect ratio contact hole etching, it is already well-known that the huge computational burden is required for realistic consideration of these ballistic transports. To address this issue, the related computational codes were efficiently parallelized for GPU (Graphic Processing Unit) computing, so that the total computation time could be improved more than few hundred times compared to the serial version. Finally, the 3D topography simulator was integrated with ballistic transport module and etch reaction model. Realistic etch-profile simulations with consideration of the sidewall polymer passivation layer were demonstrated.

  • PDF

The Study of Land Surface Change Detection Using Long-Term SPOT/VEGETATION (장기간 SPOT/VEGETATION 정규화 식생지수를 이용한 지면 변화 탐지 개선에 관한 연구)

  • Yeom, Jong-Min;Han, Kyung-Soo;Kim, In-Hwan
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.13 no.4
    • /
    • pp.111-124
    • /
    • 2010
  • To monitor the environment of land surface change is considered as an important research field since those parameters are related with land use, climate change, meteorological study, agriculture modulation, surface energy balance, and surface environment system. For the change detection, many different methods have been presented for distributing more detailed information with various tools from ground based measurement to satellite multi-spectral sensor. Recently, using high resolution satellite data is considered the most efficient way to monitor extensive land environmental system especially for higher spatial and temporal resolution. In this study, we use two different spatial resolution satellites; the one is SPOT/VEGETATION with 1 km spatial resolution to detect coarse resolution of the area change and determine objective threshold. The other is Landsat satellite having high resolution to figure out detailed land environmental change. According to their spatial resolution, they show different observation characteristics such as repeat cycle, and the global coverage. By correlating two kinds of satellites, we can detect land surface change from mid resolution to high resolution. The K-mean clustering algorithm is applied to detect changed area with two different temporal images. When using solar spectral band, there are complicate surface reflectance scattering characteristics which make surface change detection difficult. That effect would be leading serious problems when interpreting surface characteristics. For example, in spite of constant their own surface reflectance value, it could be changed according to solar, and sensor relative observation location. To reduce those affects, in this study, long-term Normalized Difference Vegetation Index (NDVI) with solar spectral channels performed for atmospheric and bi-directional correction from SPOT/VEGETATION data are utilized to offer objective threshold value for detecting land surface change, since that NDVI has less sensitivity for solar geometry than solar channel. The surface change detection based on long-term NDVI shows improved results than when only using Landsat.

Latent topics-based product reputation mining (잠재 토픽 기반의 제품 평판 마이닝)

  • Park, Sang-Min;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.39-70
    • /
    • 2017
  • Data-drive analytics techniques have been recently applied to public surveys. Instead of simply gathering survey results or expert opinions to research the preference for a recently launched product, enterprises need a way to collect and analyze various types of online data and then accurately figure out customer preferences. In the main concept of existing data-based survey methods, the sentiment lexicon for a particular domain is first constructed by domain experts who usually judge the positive, neutral, or negative meanings of the frequently used words from the collected text documents. In order to research the preference for a particular product, the existing approach collects (1) review posts, which are related to the product, from several product review web sites; (2) extracts sentences (or phrases) in the collection after the pre-processing step such as stemming and removal of stop words is performed; (3) classifies the polarity (either positive or negative sense) of each sentence (or phrase) based on the sentiment lexicon; and (4) estimates the positive and negative ratios of the product by dividing the total numbers of the positive and negative sentences (or phrases) by the total number of the sentences (or phrases) in the collection. Furthermore, the existing approach automatically finds important sentences (or phrases) including the positive and negative meaning to/against the product. As a motivated example, given a product like Sonata made by Hyundai Motors, customers often want to see the summary note including what positive points are in the 'car design' aspect as well as what negative points are in thesame aspect. They also want to gain more useful information regarding other aspects such as 'car quality', 'car performance', and 'car service.' Such an information will enable customers to make good choice when they attempt to purchase brand-new vehicles. In addition, automobile makers will be able to figure out the preference and positive/negative points for new models on market. In the near future, the weak points of the models will be improved by the sentiment analysis. For this, the existing approach computes the sentiment score of each sentence (or phrase) and then selects top-k sentences (or phrases) with the highest positive and negative scores. However, the existing approach has several shortcomings and is limited to apply to real applications. The main disadvantages of the existing approach is as follows: (1) The main aspects (e.g., car design, quality, performance, and service) to a product (e.g., Hyundai Sonata) are not considered. Through the sentiment analysis without considering aspects, as a result, the summary note including the positive and negative ratios of the product and top-k sentences (or phrases) with the highest sentiment scores in the entire corpus is just reported to customers and car makers. This approach is not enough and main aspects of the target product need to be considered in the sentiment analysis. (2) In general, since the same word has different meanings across different domains, the sentiment lexicon which is proper to each domain needs to be constructed. The efficient way to construct the sentiment lexicon per domain is required because the sentiment lexicon construction is labor intensive and time consuming. To address the above problems, in this article, we propose a novel product reputation mining algorithm that (1) extracts topics hidden in review documents written by customers; (2) mines main aspects based on the extracted topics; (3) measures the positive and negative ratios of the product using the aspects; and (4) presents the digest in which a few important sentences with the positive and negative meanings are listed in each aspect. Unlike the existing approach, using hidden topics makes experts construct the sentimental lexicon easily and quickly. Furthermore, reinforcing topic semantics, we can improve the accuracy of the product reputation mining algorithms more largely than that of the existing approach. In the experiments, we collected large review documents to the domestic vehicles such as K5, SM5, and Avante; measured the positive and negative ratios of the three cars; showed top-k positive and negative summaries per aspect; and conducted statistical analysis. Our experimental results clearly show the effectiveness of the proposed method, compared with the existing method.

Rear Vehicle Detection Method in Harsh Environment Using Improved Image Information (개선된 영상 정보를 이용한 가혹한 환경에서의 후방 차량 감지 방법)

  • Jeong, Jin-Seong;Kim, Hyun-Tae;Jang, Young-Min;Cho, Sang-Bok
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.1
    • /
    • pp.96-110
    • /
    • 2017
  • Most of vehicle detection studies using the existing general lens or wide-angle lens have a blind spot in the rear detection situation, the image is vulnerable to noise and a variety of external environments. In this paper, we propose a method that is detection in harsh external environment with noise, blind spots, etc. First, using a fish-eye lens will help minimize blind spots compared to the wide-angle lens. When angle of the lens is growing because nonlinear radial distortion also increase, calibration was used after initializing and optimizing the distortion constant in order to ensure accuracy. In addition, the original image was analyzed along with calibration to remove fog and calibrate brightness and thereby enable detection even when visibility is obstructed due to light and dark adaptations from foggy situations or sudden changes in illumination. Fog removal generally takes a considerably significant amount of time to calculate. Thus in order to reduce the calculation time, remove the fog used the major fog removal algorithm Dark Channel Prior. While Gamma Correction was used to calibrate brightness, a brightness and contrast evaluation was conducted on the image in order to determine the Gamma Value needed for correction. The evaluation used only a part instead of the entirety of the image in order to reduce the time allotted to calculation. When the brightness and contrast values were calculated, those values were used to decided Gamma value and to correct the entire image. The brightness correction and fog removal were processed in parallel, and the images were registered as a single image to minimize the calculation time needed for all the processes. Then the feature extraction method HOG was used to detect the vehicle in the corrected image. As a result, it took 0.064 seconds per frame to detect the vehicle using image correction as proposed herein, which showed a 7.5% improvement in detection rate compared to the existing vehicle detection method.

Reconstruction of Metabolic Pathway for the Chicken Genome (닭 특이 대사 경로 재확립)

  • Kim, Woon-Su;Lee, Se-Young;Park, Hye-Sun;Baik, Woon-Kee;Lee, Jun-Heon;Seo, Seong-Won
    • Korean Journal of Poultry Science
    • /
    • v.37 no.3
    • /
    • pp.275-282
    • /
    • 2010
  • Chicken is an important livestock as a valuable biomedical model as well as food for human, and there is a strong rationale for improving our understanding on metabolism and physiology of this organism. The first draft of chicken genome assembly was released in 2004, which enables elaboration on the linkage between genetic and metabolic traits of chicken. The objectives of this study were thus to reconstruct metabolic pathway of the chicken genome and to construct a chicken specific pathway genome database (PGDB). We developed a comprehensive genome database for chicken by integrating all the known annotations for chicken genes and proteins using a pipeline written in Perl. Based on the comprehensive genome annotations, metabolic pathways of the chicken genome were reconstructed using the PathoLogic algorithm in Pathway Tools software. We identified a total of 212 metabolic pathways, 2,709 enzymes, 71 transporters, 1,698 enzymatic reactions, 8 transport reactions, and 1,360 compounds in the current chicken genome build, Gallus_gallus-2.1. Comparative metabolic analysis with the human, mouse and cattle genomes revealed that core metabolic pathways are highly conserved in the chicken genome. It was indicated the quality of assembly and annotations of the chicken genome need to be improved and more researches are required for improving our understanding on function of genes and metabolic pathways of avian species. We conclude that the chicken PGDB is useful for studies on avian and chicken metabolism and provides a platform for comparative genomic and metabolic analysis of animal biology and biomedicine.

Target Word Selection Disambiguation using Untagged Text Data in English-Korean Machine Translation (영한 기계 번역에서 미가공 텍스트 데이터를 이용한 대역어 선택 중의성 해소)

  • Kim Yu-Seop;Chang Jeong-Ho
    • The KIPS Transactions:PartB
    • /
    • v.11B no.6
    • /
    • pp.749-758
    • /
    • 2004
  • In this paper, we propose a new method utilizing only raw corpus without additional human effort for disambiguation of target word selection in English-Korean machine translation. We use two data-driven techniques; one is the Latent Semantic Analysis(LSA) and the other the Probabilistic Latent Semantic Analysis(PLSA). These two techniques can represent complex semantic structures in given contexts like text passages. We construct linguistic semantic knowledge by using the two techniques and use the knowledge for target word selection in English-Korean machine translation. For target word selection, we utilize a grammatical relationship stored in a dictionary. We use k- nearest neighbor learning algorithm for the resolution of data sparseness Problem in target word selection and estimate the distance between instances based on these models. In experiments, we use TREC data of AP news for construction of latent semantic space and Wail Street Journal corpus for evaluation of target word selection. Through the Latent Semantic Analysis methods, the accuracy of target word selection has improved over 10% and PLSA has showed better accuracy than LSA method. finally we have showed the relatedness between the accuracy and two important factors ; one is dimensionality of latent space and k value of k-NT learning by using correlation calculation.