• Title/Summary/Keyword: web based system

Search Result 5,297, Processing Time 0.037 seconds

An Analysis of IT Trends Using Tweet Data (트윗 데이터를 활용한 IT 트렌드 분석)

  • Yi, Jin Baek;Lee, Choong Kwon;Cha, Kyung Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.143-159
    • /
    • 2015
  • Predicting IT trends has been a long and important subject for information systems research. IT trend prediction makes it possible to acknowledge emerging eras of innovation and allocate budgets to prepare against rapidly changing technological trends. Towards the end of each year, various domestic and global organizations predict and announce IT trends for the following year. For example, Gartner Predicts 10 top IT trend during the next year, and these predictions affect IT and industry leaders and organization's basic assumptions about technology and the future of IT, but the accuracy of these reports are difficult to verify. Social media data can be useful tool to verify the accuracy. As social media services have gained in popularity, it is used in a variety of ways, from posting about personal daily life to keeping up to date with news and trends. In the recent years, rates of social media activity in Korea have reached unprecedented levels. Hundreds of millions of users now participate in online social networks and communicate with colleague and friends their opinions and thoughts. In particular, Twitter is currently the major micro blog service, it has an important function named 'tweets' which is to report their current thoughts and actions, comments on news and engage in discussions. For an analysis on IT trends, we chose Tweet data because not only it produces massive unstructured textual data in real time but also it serves as an influential channel for opinion leading on technology. Previous studies found that the tweet data provides useful information and detects the trend of society effectively, these studies also identifies that Twitter can track the issue faster than the other media, newspapers. Therefore, this study investigates how frequently the predicted IT trends for the following year announced by public organizations are mentioned on social network services like Twitter. IT trend predictions for 2013, announced near the end of 2012 from two domestic organizations, the National IT Industry Promotion Agency (NIPA) and the National Information Society Agency (NIA), were used as a basis for this research. The present study analyzes the Twitter data generated from Seoul (Korea) compared with the predictions of the two organizations to analyze the differences. Thus, Twitter data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. To overcome these challenges, we used SAS IRS (Information Retrieval Studio) developed by SAS to capture the trend in real-time processing big stream datasets of Twitter. The system offers a framework for crawling, normalizing, analyzing, indexing and searching tweet data. As a result, we have crawled the entire Twitter sphere in Seoul area and obtained 21,589 tweets in 2013 to review how frequently the IT trend topics announced by the two organizations were mentioned by the people in Seoul. The results shows that most IT trend predicted by NIPA and NIA were all frequently mentioned in Twitter except some topics such as 'new types of security threat', 'green IT', 'next generation semiconductor' since these topics non generalized compound words so they can be mentioned in Twitter with other words. To answer whether the IT trend tweets from Korea is related to the following year's IT trends in real world, we compared Twitter's trending topics with those in Nara Market, Korea's online e-Procurement system which is a nationwide web-based procurement system, dealing with whole procurement process of all public organizations in Korea. The correlation analysis show that Tweet frequencies on IT trending topics predicted by NIPA and NIA are significantly correlated with frequencies on IT topics mentioned in project announcements by Nara market in 2012 and 2013. The main contribution of our research can be found in the following aspects: i) the IT topic predictions announced by NIPA and NIA can provide an effective guideline to IT professionals and researchers in Korea who are looking for verified IT topic trends in the following topic, ii) researchers can use Twitter to get some useful ideas to detect and predict dynamic trends of technological and social issues.

Latent topics-based product reputation mining (잠재 토픽 기반의 제품 평판 마이닝)

  • Park, Sang-Min;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.39-70
    • /
    • 2017
  • Data-drive analytics techniques have been recently applied to public surveys. Instead of simply gathering survey results or expert opinions to research the preference for a recently launched product, enterprises need a way to collect and analyze various types of online data and then accurately figure out customer preferences. In the main concept of existing data-based survey methods, the sentiment lexicon for a particular domain is first constructed by domain experts who usually judge the positive, neutral, or negative meanings of the frequently used words from the collected text documents. In order to research the preference for a particular product, the existing approach collects (1) review posts, which are related to the product, from several product review web sites; (2) extracts sentences (or phrases) in the collection after the pre-processing step such as stemming and removal of stop words is performed; (3) classifies the polarity (either positive or negative sense) of each sentence (or phrase) based on the sentiment lexicon; and (4) estimates the positive and negative ratios of the product by dividing the total numbers of the positive and negative sentences (or phrases) by the total number of the sentences (or phrases) in the collection. Furthermore, the existing approach automatically finds important sentences (or phrases) including the positive and negative meaning to/against the product. As a motivated example, given a product like Sonata made by Hyundai Motors, customers often want to see the summary note including what positive points are in the 'car design' aspect as well as what negative points are in thesame aspect. They also want to gain more useful information regarding other aspects such as 'car quality', 'car performance', and 'car service.' Such an information will enable customers to make good choice when they attempt to purchase brand-new vehicles. In addition, automobile makers will be able to figure out the preference and positive/negative points for new models on market. In the near future, the weak points of the models will be improved by the sentiment analysis. For this, the existing approach computes the sentiment score of each sentence (or phrase) and then selects top-k sentences (or phrases) with the highest positive and negative scores. However, the existing approach has several shortcomings and is limited to apply to real applications. The main disadvantages of the existing approach is as follows: (1) The main aspects (e.g., car design, quality, performance, and service) to a product (e.g., Hyundai Sonata) are not considered. Through the sentiment analysis without considering aspects, as a result, the summary note including the positive and negative ratios of the product and top-k sentences (or phrases) with the highest sentiment scores in the entire corpus is just reported to customers and car makers. This approach is not enough and main aspects of the target product need to be considered in the sentiment analysis. (2) In general, since the same word has different meanings across different domains, the sentiment lexicon which is proper to each domain needs to be constructed. The efficient way to construct the sentiment lexicon per domain is required because the sentiment lexicon construction is labor intensive and time consuming. To address the above problems, in this article, we propose a novel product reputation mining algorithm that (1) extracts topics hidden in review documents written by customers; (2) mines main aspects based on the extracted topics; (3) measures the positive and negative ratios of the product using the aspects; and (4) presents the digest in which a few important sentences with the positive and negative meanings are listed in each aspect. Unlike the existing approach, using hidden topics makes experts construct the sentimental lexicon easily and quickly. Furthermore, reinforcing topic semantics, we can improve the accuracy of the product reputation mining algorithms more largely than that of the existing approach. In the experiments, we collected large review documents to the domestic vehicles such as K5, SM5, and Avante; measured the positive and negative ratios of the three cars; showed top-k positive and negative summaries per aspect; and conducted statistical analysis. Our experimental results clearly show the effectiveness of the proposed method, compared with the existing method.

Finding Weighted Sequential Patterns over Data Streams via a Gap-based Weighting Approach (발생 간격 기반 가중치 부여 기법을 활용한 데이터 스트림에서 가중치 순차패턴 탐색)

  • Chang, Joong-Hyuk
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.3
    • /
    • pp.55-75
    • /
    • 2010
  • Sequential pattern mining aims to discover interesting sequential patterns in a sequence database, and it is one of the essential data mining tasks widely used in various application fields such as Web access pattern analysis, customer purchase pattern analysis, and DNA sequence analysis. In general sequential pattern mining, only the generation order of data element in a sequence is considered, so that it can easily find simple sequential patterns, but has a limit to find more interesting sequential patterns being widely used in real world applications. One of the essential research topics to compensate the limit is a topic of weighted sequential pattern mining. In weighted sequential pattern mining, not only the generation order of data element but also its weight is considered to get more interesting sequential patterns. In recent, data has been increasingly taking the form of continuous data streams rather than finite stored data sets in various application fields, the database research community has begun focusing its attention on processing over data streams. The data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. In data stream processing, each data element should be examined at most once to analyze the data stream, and the memory usage for data stream analysis should be restricted finitely although new data elements are continuously generated in a data stream. Moreover, newly generated data elements should be processed as fast as possible to produce the up-to-date analysis result of a data stream, so that it can be instantly utilized upon request. To satisfy these requirements, data stream processing sacrifices the correctness of its analysis result by allowing some error. Considering the changes in the form of data generated in real world application fields, many researches have been actively performed to find various kinds of knowledge embedded in data streams. They mainly focus on efficient mining of frequent itemsets and sequential patterns over data streams, which have been proven to be useful in conventional data mining for a finite data set. In addition, mining algorithms have also been proposed to efficiently reflect the changes of data streams over time into their mining results. However, they have been targeting on finding naively interesting patterns such as frequent patterns and simple sequential patterns, which are found intuitively, taking no interest in mining novel interesting patterns that express the characteristics of target data streams better. Therefore, it can be a valuable research topic in the field of mining data streams to define novel interesting patterns and develop a mining method finding the novel patterns, which will be effectively used to analyze recent data streams. This paper proposes a gap-based weighting approach for a sequential pattern and amining method of weighted sequential patterns over sequence data streams via the weighting approach. A gap-based weight of a sequential pattern can be computed from the gaps of data elements in the sequential pattern without any pre-defined weight information. That is, in the approach, the gaps of data elements in each sequential pattern as well as their generation orders are used to get the weight of the sequential pattern, therefore it can help to get more interesting and useful sequential patterns. Recently most of computer application fields generate data as a form of data streams rather than a finite data set. Considering the change of data, the proposed method is mainly focus on sequence data streams.

Preliminary Report of the $1998{\sim}1999$ Patterns of Care Study of Radiation Therapy for Esophageal Cancer in Korea (식도암 방사선 치료에 대한 Patterns of Care Study ($1998{\sim}1999$)의 예비적 결과 분석)

  • Hur, Won-Joo;Choi, Young-Min;Lee, Hyung-Sik;Kim, Jeung-Kee;Kim, Il-Han;Lee, Ho-Jun;Lee, Kyu-Chan;Kim, Jung-Soo;Chun, Mi-Son;Kim, Jin-Hee;Ahn, Yong-Chan;Kim, Sang-Gi;Kim, Bo-Kyung
    • Radiation Oncology Journal
    • /
    • v.25 no.2
    • /
    • pp.79-92
    • /
    • 2007
  • [ $\underline{Purpose}$ ]: For the first time, a nationwide survey in the Republic of Korea was conducted to determine the basic parameters for the treatment of esophageal cancer and to offer a solid cooperative system for the Korean Pattern of Care Study database. $\underline{Materials\;and\;Methods}$: During $1998{\sim}1999$, biopsy-confirmed 246 esophageal cancer patients that received radiotherapy were enrolled from 23 different institutions in South Korea. Random sampling was based on power allocation method. Patient parameters and specific information regarding tumor characteristics and treatment methods were collected and registered through the web based PCS system. The data was analyzed by the use of the Chi-squared test. $\underline{Results}$: The median age of the collected patients was 62 years. The male to female ratio was about 91 to 9 with an absolute male predominance. The performance status ranged from ECOG 0 to 1 in 82.5% of the patients. Diagnostic procedures included an esophagogram (228 patients, 92.7%), endoscopy (226 patients, 91.9%), and a chest CT scan (238 patients, 96.7%). Squamous cell carcinoma was diagnosed in 96.3% of the patients; mid-thoracic esophageal cancer was most prevalent (110 patients, 44.7%) and 135 patients presented with clinical stage III disease. Fifty seven patients received radiotherapy alone and 37 patients received surgery with adjuvant postoperative radiotherapy. Half of the patients (123 patients) received chemotherapy together with RT and 70 patients (56.9%) received it as concurrent chemoradiotherapy. The most frequently used chemotherapeutic agent was a combination of cisplatin and 5-FU. Most patients received radiotherapy either with 6 MV (116 patients, 47.2%) or with 10 MV photons (87 patients, 35.4%). Radiotherapy was delivered through a conventional AP-PA field for 206 patients (83.7%) without using a CT plan and the median delivered dose was 3,600 cGy. The median total dose of postoperative radiotherapy was 5,040 cGy while for the non-operative patients the median total dose was 5,970 cGy. Thirty-four patients received intraluminal brachytherapy with high dose rate Iridium-192. Brachytherapy was delivered with a median dose of 300 cGy in each fraction and was typically delivered $3{\sim}4\;times$. The most frequently encountered complication during the radiotherapy treatment was esophagitis in 155 patients (63.0%). $\underline{Conclusion}$: For the evaluation and treatment of esophageal cancer patients at radiation facilities in Korea, this study will provide guidelines and benchmark data for the solid cooperative systems of the Korean PCS. Although some differences were noted between institutions, there was no major difference in the treatment modalities and RT techniques.

System Development for Measuring Group Engagement in the Art Center (공연장에서 다중 몰입도 측정을 위한 시스템 개발)

  • Ryu, Joon Mo;Choi, Il Young;Choi, Lee Kwon;Kim, Jae Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.45-58
    • /
    • 2014
  • The Korean Culture Contents spread out to Worldwide, because the Korean wave is sweeping in the world. The contents stand in the middle of the Korean wave that we are used it. Each country is ongoing to keep their Culture industry improve the national brand and High added value. Performing contents is important factor of arousal in the enterprise industry. To improve high arousal confidence of product and positive attitude by populace is one of important factor by advertiser. Culture contents is the same situation. If culture contents have trusted by everyone, they will give information their around to spread word-of-mouth. So, many researcher study to measure for person's arousal analysis by statistical survey, physiological response, body movement and facial expression. First, Statistical survey has a problem that it is not possible to measure each person's arousal real time and we cannot get good survey result after they watched contents. Second, physiological response should be checked with surround because experimenter sets sensors up their chair or space by each of them. Additionally it is difficult to handle provided amount of information with real time from their sensor. Third, body movement is easy to get their movement from camera but it difficult to set up experimental condition, to measure their body language and to get the meaning. Lastly, many researcher study facial expression. They measures facial expression, eye tracking and face posed. Most of previous studies about arousal and interest are mostly limited to reaction of just one person and they have problems with application multi audiences. They have a particular method, for example they need room light surround, but set limits only one person and special environment condition in the laboratory. Also, we need to measure arousal in the contents, but is difficult to define also it is not easy to collect reaction by audiences immediately. Many audience in the theater watch performance. We suggest the system to measure multi-audience's reaction with real-time during performance. We use difference image analysis method for multi-audience but it weaks a dark field. To overcome dark environment during recoding IR camera can get the photo from dark area. In addition we present Multi-Audience Engagement Index (MAEI) to calculate algorithm which sources from sound, audience' movement and eye tracking value. Algorithm calculates audience arousal from the mobile survey, sound value, audience' reaction and audience eye's tracking. It improves accuracy of Multi-Audience Engagement Index, we compare Multi-Audience Engagement Index with mobile survey. And then it send the result to reporting system and proposal an interested persons. Mobile surveys are easy, fast, and visitors' discomfort can be minimized. Also additional information can be provided mobile advantage. Mobile application to communicate with the database, real-time information on visitors' attitudes focused on the content stored. Database can provide different survey every time based on provided information. The example shown in the survey are as follows: Impressive scene, Satisfied, Touched, Interested, Didn't pay attention and so on. The suggested system is combine as 3 parts. The system consist of three parts, External Device, Server and Internal Device. External Device can record multi-Audience in the dark field with IR camera and sound signal. Also we use survey with mobile application and send the data to ERD Server DB. The Server part's contain contents' data, such as each scene's weights value, group audience weights index, camera control program, algorithm and calculate Multi-Audience Engagement Index. Internal Device presents Multi-Audience Engagement Index with Web UI, print and display field monitor. Our system is test-operated by the Mogencelab in the DMC display exhibition hall which is located in the Sangam Dong, Mapo Gu, Seoul. We have still gotten from visitor daily. If we find this system audience arousal factor with this will be very useful to create contents.

A Survey of Korean Consumers' Awareness on Animal Welfare of Laying Hens (산란계 동물복지에 대한 국내 소비자의 인지도 조사)

  • Hong, Eui-Chul;Kang, Hwan-Ku;Park, Ki-Tae;Jeon, Jin-Joo;Kim, Hyun-Soo;Kim, Chan-Ho;Kim, Sang-Ho
    • Korean Journal of Poultry Science
    • /
    • v.45 no.3
    • /
    • pp.219-228
    • /
    • 2018
  • This study was conducted twice to investigate egg purchase behavior and perception on animal welfare of Korean consumers. This study included women, who were the main decision makers and caretakers in the household, and men with one-person household. This survey was conducted with by the Computer Assisted Web Interview and Gang Survey methods. On the key considerations factor, the highest response rate was considered to be 'price', and the response rate of considering 'packing date' increased in the second survey. At a reasonable price based on 10 eggs, the response rate was the highest at 53.8% and 42.9% in both the first and second surveys and the appropriate price averages were 2,482 won and 2,132 won, respectively. The highest rate of purchase of egg consumers from 'Large Mart' followed by 'Medium sized supermarket' and 'Chain supermarket'. As for the awareness about animal welfare, the recognition ratio (73.5%) was higher in the result of the second survey than the first. The cognitive period of animal welfare was 59.0% before the insecticide egg crisis and 41.0% thereafter. Regarding whether or not they have ever seen an animal welfare certification mark and an animal welfare animal farm certification mark, 59.6% of respondents said that they saw it for the first time and 37.6% answered that they knew the animal welfare certification mark. On the animal welfare system, the 'free-range' response rate was the highest at 85.8%. The 'free-range' fit response decreased by 34.2%p, while the 'barn' and 'European type' fit response increased by 13.2%p and 24.1%p, respectively. The number of 'I have never seen' and 'I have ever eaten' responses to the recognition and eating experience of animal welfare certified eggs decreased while the number of those who answered 'Have ever seen' and 'Have eaten' increased. The answer of purchasing animal welfare certified eggs at department stores, organic farming cooperatives, and internet shopping malls was higher than that of buying conventional eggs. Of the total respondents, 92.0% were willing to purchase an animal welfare egg before the price was offered, but after offering the prices of animal welfare eggs, the intention to purchase was 62.7%, which was about 30%p lower than before. The reason for purchasing an animal welfare certified egg was the highest score of 71.0% for 'I think it is likely to be high in food safety', and 38.1% for 'I think the price is high' for lack of intention to purchase. In the sensory evaluation of animal welfare eggs, egg color and skin texture of conventional eggs were significantly higher than those of certified welfare eggs (P<0.05), and boiled eggs showed that egg whites of animal welfare certified eggs were more (P<0.05). As a result, the results of this study will contribute to the activation of the animal welfare certification system for laying hens by providing basic data on consumer awareness to animal welfare certified farmers.

Effects of firm strategies on customer acquisition of Software as a Service (SaaS) providers: A mediating and moderating role of SaaS technology maturity (SaaS 기업의 차별화 및 가격전략이 고객획득성과에 미치는 영향: SaaS 기술성숙도 수준의 매개효과 및 조절효과를 중심으로)

  • Chae, SeongWook;Park, Sungbum
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.151-171
    • /
    • 2014
  • Firms today have sought management effectiveness and efficiency utilizing information technologies (IT). Numerous firms are outsourcing specific information systems functions to cope with their short of information resources or IT experts, or to reduce their capital cost. Recently, Software-as-a-Service (SaaS) as a new type of information system has become one of the powerful outsourcing alternatives. SaaS is software deployed as a hosted and accessed over the internet. It is regarded as the idea of on-demand, pay-per-use, and utility computing and is now being applied to support the core competencies of clients in areas ranging from the individual productivity area to the vertical industry and e-commerce area. In this study, therefore, we seek to quantify the value that SaaS has on business performance by examining the relationships among firm strategies, SaaS technology maturity, and business performance of SaaS providers. We begin by drawing from prior literature on SaaS, technology maturity and firm strategy. SaaS technology maturity is classified into three different phases such as application service providing (ASP), Web-native application, and Web-service application. Firm strategies are manipulated by the low-cost strategy and differentiation strategy. Finally, we considered customer acquisition as a business performance. In this sense, specific objectives of this study are as follows. First, we examine the relationships between customer acquisition performance and both low-cost strategy and differentiation strategy of SaaS providers. Secondly, we investigate the mediating and moderating effects of SaaS technology maturity on those relationships. For this purpose, study collects data from the SaaS providers, and their line of applications registered in the database in CNK (Commerce net Korea) in Korea using a questionnaire method by the professional research institution. The unit of analysis in this study is the SBUs (strategic business unit) in the software provider. A total of 199 SBUs is used for analyzing and testing our hypotheses. With regards to the measurement of firm strategy, we take three measurement items for differentiation strategy such as the application uniqueness (referring an application aims to differentiate within just one or a small number of target industry), supply channel diversification (regarding whether SaaS vendor had diversified supply chain) as well as the number of specialized expertise and take two items for low cost strategy like subscription fee and initial set-up fee. We employ a hierarchical regression analysis technique for testing moderation effects of SaaS technology maturity and follow the Baron and Kenny's procedure for determining if firm strategies affect customer acquisition through technology maturity. Empirical results revealed that, firstly, when differentiation strategy is applied to attain business performance like customer acquisition, the effects of the strategy is moderated by the technology maturity level of SaaS providers. In other words, securing higher level of SaaS technology maturity is essential for higher business performance. For instance, given that firms implement application uniqueness or a distribution channel diversification as a differentiation strategy, they can acquire more customers when their level of SaaS technology maturity is higher rather than lower. Secondly, results indicate that pursuing differentiation strategy or low cost strategy effectively works for SaaS providers' obtaining customer, which means that continuously differentiating their service from others or making their service fee (subscription fee or initial set-up fee) lower are helpful for their business success in terms of acquiring their customers. Lastly, results show that the level of SaaS technology maturity mediates the relationships between low cost strategy and customer acquisition. That is, based on our research design, customers usually perceive the real value of the low subscription fee or initial set-up fee only through the SaaS service provide by vender and, in turn, this will affect their decision making whether subscribe or not.

Comparative Study on the Methodology of Motor Vehicle Emission Calculation by Using Real-Time Traffic Volume in the Kangnam-Gu (자동차 대기오염물질 산정 방법론 설정에 관한 비교 연구 (강남구의 실시간 교통량 자료를 이용하여))

  • 박성규;김신도;이영인
    • Journal of Korean Society of Transportation
    • /
    • v.19 no.4
    • /
    • pp.35-47
    • /
    • 2001
  • Traffic represents one of the largest sources of primary air pollutants in urban area. As a consequence. numerous abatement strategies are being pursued to decrease the ambient concentration of pollutants. A characteristic of most of the these strategies is a requirement for accurate data on both the quantity and spatial distribution of emissions to air in the form of an atmospheric emission inventory database. In the case of traffic pollution, such an inventory must be compiled using activity statistics and emission factors for vehicle types. The majority of inventories are compiled using passive data from either surveys or transportation models and by their very nature tend to be out-of-date by the time they are compiled. The study of current trends are towards integrating urban traffic control systems and assessments of the environmental effects of motor vehicles. In this study, a methodology of motor vehicle emission calculation by using real-time traffic data was studied. A methodology for estimating emissions of CO at a test area in Seoul. Traffic data, which are required on a street-by-street basis, is obtained from induction loops of traffic control system. It was calculated speed-related mass of CO emission from traffic tail pipe of data from traffic system, and parameters are considered, volume, composition, average velocity, link length. And, the result was compared with that of a method of emission calculation by VKT(Vehicle Kilometer Travelled) of vehicles of category.

  • PDF

Pre-Evaluation for Prediction Accuracy by Using the Customer's Ratings in Collaborative Filtering (협업필터링에서 고객의 평가치를 이용한 선호도 예측의 사전평가에 관한 연구)

  • Lee, Seok-Jun;Kim, Sun-Ok
    • Asia pacific journal of information systems
    • /
    • v.17 no.4
    • /
    • pp.187-206
    • /
    • 2007
  • The development of computer and information technology has been combined with the information superhighway internet infrastructure, so information widely spreads not only in special fields but also in the daily lives of people. Information ubiquity influences the traditional way of transaction, and leads a new E-commerce which distinguishes from the existing E-commerce. Not only goods as physical but also service as non-physical come into E-commerce. As the scale of E-Commerce is being enlarged as well. It keeps people from finding information they want. Recommender systems are now becoming the main tools for E-Commerce to mitigate the information overload. Recommender systems can be defined as systems for suggesting some Items(goods or service) considering customers' interests or tastes. They are being used by E-commerce web sites to suggest products to their customers who want to find something for them and to provide them with information to help them decide which to purchase. There are several approaches of recommending goods to customer in recommender system but in this study, the main subject is focused on collaborative filtering technique. This study presents a possibility of pre-evaluation for the prediction performance of customer's preference in collaborative filtering before the process of customer's preference prediction. Pre-evaluation for the prediction performance of each customer having low performance is classified by using the statistical features of ratings rated by each customer is conducted before the prediction process. In this study, MovieLens 100K dataset is used to analyze the accuracy of classification. The classification criteria are set by using the training sets divided 80% from the 100K dataset. In the process of classification, the customers are divided into two groups, classified group and non classified group. To compare the prediction performance of classified group and non classified group, the prediction process runs the 20% test set through the Neighborhood Based Collaborative Filtering Algorithm and Correspondence Mean Algorithm. The prediction errors from those prediction algorithm are allocated to each customer and compared with each user's error. Research hypothesis : Two research hypotheses are formulated in this study to test the accuracy of the classification criterion as follows. Hypothesis 1: The estimation accuracy of groups classified according to the standard deviation of each user's ratings has significant difference. To test the Hypothesis 1, the standard deviation is calculated for each user in training set which is divided 80% from MovieLens 100K dataset. Four groups are classified according to the quartile of the each user's standard deviations. It is compared to test the estimation errors of each group which results from test set are significantly different. Hypothesis 2: The estimation accuracy of groups that are classified according to the distribution of each user's ratings have significant differences. To test the Hypothesis 2, the distributions of each user's ratings are compared with the distribution of ratings of all customers in training set which is divided 80% from MovieLens 100K dataset. It assumes that the customers whose ratings' distribution are different from that of all customers would have low performance, so six types of different distributions are set to be compared. The test groups are classified into fit group or non-fit group according to the each type of different distribution assumed. The degrees in accordance with each type of distribution and each customer's distributions are tested by the test of ${\chi}^2$ goodness-of-fit and classified two groups for testing the difference of the mean of errors. Also, the degree of goodness-of-fit with the distribution of each user's ratings and the average distribution of the ratings in the training set are closely related to the prediction errors from those prediction algorithms. Through this study, the customers who have lower performance of prediction than the rest in the system are classified by those two criteria, which are set by statistical features of customers ratings in the training set, before the prediction process.

Building a Korean Sentiment Lexicon Using Collective Intelligence (집단지성을 이용한 한글 감성어 사전 구축)

  • An, Jungkook;Kim, Hee-Woong
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.49-67
    • /
    • 2015
  • Recently, emerging the notion of big data and social media has led us to enter data's big bang. Social networking services are widely used by people around the world, and they have become a part of major communication tools for all ages. Over the last decade, as online social networking sites become increasingly popular, companies tend to focus on advanced social media analysis for their marketing strategies. In addition to social media analysis, companies are mainly concerned about propagating of negative opinions on social networking sites such as Facebook and Twitter, as well as e-commerce sites. The effect of online word of mouth (WOM) such as product rating, product review, and product recommendations is very influential, and negative opinions have significant impact on product sales. This trend has increased researchers' attention to a natural language processing, such as a sentiment analysis. A sentiment analysis, also refers to as an opinion mining, is a process of identifying the polarity of subjective information and has been applied to various research and practical fields. However, there are obstacles lies when Korean language (Hangul) is used in a natural language processing because it is an agglutinative language with rich morphology pose problems. Therefore, there is a lack of Korean natural language processing resources such as a sentiment lexicon, and this has resulted in significant limitations for researchers and practitioners who are considering sentiment analysis. Our study builds a Korean sentiment lexicon with collective intelligence, and provides API (Application Programming Interface) service to open and share a sentiment lexicon data with the public (www.openhangul.com). For the pre-processing, we have created a Korean lexicon database with over 517,178 words and classified them into sentiment and non-sentiment words. In order to classify them, we first identified stop words which often quite likely to play a negative role in sentiment analysis and excluded them from our sentiment scoring. In general, sentiment words are nouns, adjectives, verbs, adverbs as they have sentimental expressions such as positive, neutral, and negative. On the other hands, non-sentiment words are interjection, determiner, numeral, postposition, etc. as they generally have no sentimental expressions. To build a reliable sentiment lexicon, we have adopted a concept of collective intelligence as a model for crowdsourcing. In addition, a concept of folksonomy has been implemented in the process of taxonomy to help collective intelligence. In order to make up for an inherent weakness of folksonomy, we have adopted a majority rule by building a voting system. Participants, as voters were offered three voting options to choose from positivity, negativity, and neutrality, and the voting have been conducted on one of the largest social networking sites for college students in Korea. More than 35,000 votes have been made by college students in Korea, and we keep this voting system open by maintaining the project as a perpetual study. Besides, any change in the sentiment score of words can be an important observation because it enables us to keep track of temporal changes in Korean language as a natural language. Lastly, our study offers a RESTful, JSON based API service through a web platform to make easier support for users such as researchers, companies, and developers. Finally, our study makes important contributions to both research and practice. In terms of research, our Korean sentiment lexicon plays an important role as a resource for Korean natural language processing. In terms of practice, practitioners such as managers and marketers can implement sentiment analysis effectively by using Korean sentiment lexicon we built. Moreover, our study sheds new light on the value of folksonomy by combining collective intelligence, and we also expect to give a new direction and a new start to the development of Korean natural language processing.