Search | Korea Science

Online news-based stock price forecasting considering homogeneity in the industrial sector (산업군 내 동질성을 고려한 온라인 뉴스 기반 주가예측)

Seong, Nohyoon;Nam, Kihwan
- Journal of Intelligence and Information Systems
- /
- v.24 no.2
- /
- pp.1-19
- /
- 2018
Since stock movements forecasting is an important issue both academically and practically, studies related to stock price prediction have been actively conducted. The stock price forecasting research is classified into structured data and unstructured data, and it is divided into technical analysis, fundamental analysis and media effect analysis in detail. In the big data era, research on stock price prediction combining big data is actively underway. Based on a large number of data, stock prediction research mainly focuses on machine learning techniques. Especially, research methods that combine the effects of media are attracting attention recently, among which researches that analyze online news and utilize online news to forecast stock prices are becoming main. Previous studies predicting stock prices through online news are mostly sentiment analysis of news, making different corpus for each company, and making a dictionary that predicts stock prices by recording responses according to the past stock price. Therefore, existing studies have examined the impact of online news on individual companies. For example, stock movements of Samsung Electronics are predicted with only online news of Samsung Electronics. In addition, a method of considering influences among highly relevant companies has also been studied recently. For example, stock movements of Samsung Electronics are predicted with news of Samsung Electronics and a highly related company like LG Electronics.These previous studies examine the effects of news of industrial sector with homogeneity on the individual company. In the previous studies, homogeneous industries are classified according to the Global Industrial Classification Standard. In other words, the existing studies were analyzed under the assumption that industries divided into Global Industrial Classification Standard have homogeneity. However, existing studies have limitations in that they do not take into account influential companies with high relevance or reflect the existence of heterogeneity within the same Global Industrial Classification Standard sectors. As a result of our examining the various sectors, it can be seen that there are sectors that show the industrial sectors are not a homogeneous group. To overcome these limitations of existing studies that do not reflect heterogeneity, our study suggests a methodology that reflects the heterogeneous effects of the industrial sector that affect the stock price by applying k-means clustering. Multiple Kernel Learning is mainly used to integrate data with various characteristics. Multiple Kernel Learning has several kernels, each of which receives and predicts different data. To incorporate effects of target firm and its relevant firms simultaneously, we used Multiple Kernel Learning. Each kernel was assigned to predict stock prices with variables of financial news of the industrial group divided by the target firm, K-means cluster analysis. In order to prove that the suggested methodology is appropriate, experiments were conducted through three years of online news and stock prices. The results of this study are as follows. (1) We confirmed that the information of the industrial sectors related to target company also contains meaningful information to predict stock movements of target company and confirmed that machine learning algorithm has better predictive power when considering the news of the relevant companies and target company's news together. (2) It is important to predict stock movements with varying number of clusters according to the level of homogeneity in the industrial sector. In other words, when stock prices are homogeneous in industrial sectors, it is important to use relational effect at the level of industry group without analyzing clusters or to use it in small number of clusters. When the stock price is heterogeneous in industry group, it is important to cluster them into groups. This study has a contribution that we testified firms classified as Global Industrial Classification Standard have heterogeneity and suggested it is necessary to define the relevance through machine learning and statistical analysis methodology rather than simply defining it in the Global Industrial Classification Standard. It has also contribution that we proved the efficiency of the prediction model reflecting heterogeneity.
https://doi.org/10.13088/jiis.2018.24.2.001 인용 PDF KSCI

Analysis Corrosion Products Formed on the Great Buddha Image of Kotokuin Temple in Kamakura (고덕원 국보 동조아미타여래좌상의 표면에 생성한 부식생성물의 해석)

Matsuda Shiro;Aoki Shigeo;Kang, Dai-il
- 보존과학연구
- /
- s.17
- /
- pp.161-182
- /
- 1996
In natural atmosphere, copper and copper alloy have been used to make buddha statues and ornaments of historic buildings since the abovementioned metals have corrosion resistance in some extent, and the patinaformed on the surface of the metals has provided the people aesthetic satisfaction with its beauty. But in atmosphere polluted by $SO_x$and $NO_x$, the patina layer does not work as a protective film, and it allows damages of the metal. Since 1992, Tokyo National Research Institute of Cultural Properties(TNRICP)has conducted studies on the influence of atmospheric pollution on metal cultural property held under open air. The Great Buddha Image which is located in Kamakura about 50km west from Tokyo, has been selected as one of the objects to study because it is made by copper alloy and it has stood exposed in the air for about a few hundreds years. Furthermore it is also the reason to study on it that there are many cultural properties in the surroundings of it. We have analysed the components and the structure of the corrosion products formed on the surface of the Buddha, have carried out exposure tests using the alloy samples which have simulated the components of the Great Image, and have observed climated and polluted air in order to discuss the relation between corrosion of metals in open air and conditions of the atmosphere. In this paper, the authors have described the components and the structure of the corrosion product formed on the surface of the Great Image by means of X-ray fluorescence spectroscopy and X-ray diffraction. The conclusions are as follows. (1) Sulfate patina composed mainly with brochantite were detected on the all sides of the Image and the amount of the patina is found more on the back of the Image facing to north. (2) Antlerite were detected on the back and a park of the left side facing to west, and formation of it was considered to have close relation with malignant atmosphere. (3) A big amount of chloride patina which mainly composed of atacamite were observed on the front facing to south. (4) Carbonate patina mainly composed of malachite were detected on the area where brochantite was often detected as well. It suggested that malachite had been transformed into brochantite by deteriorated atmosphere. (5) On the all sides of the Image, patina were observed together with copper oxides mainly composed of cuprous oxide. It showed that the surface layer of the Image consists of two layers : inner layer of oxide and outer layer of patina. (6) Corrosion products of lead which was a component of copperalloy were detected on the all sides : the main lead product found on the front was chlorophosphate whereas the one on the back was sulfate.
PDF

Prediction of Correct Answer Rate and Identification of Significant Factors for CSAT English Test Based on Data Mining Techniques (데이터마이닝 기법을 활용한 대학수학능력시험 영어영역 정답률 예측 및 주요 요인 분석)

Park, Hee Jin;Jang, Kyoung Ye;Lee, Youn Ho;Kim, Woo Je;Kang, Pil Sung
- KIPS Transactions on Software and Data Engineering
- /
- v.4 no.11
- /
- pp.509-520
- /
- 2015
College Scholastic Ability Test(CSAT) is a primary test to evaluate the study achievement of high-school students and used by most universities for admission decision in South Korea. Because its level of difficulty is a significant issue to both students and universities, the government makes a huge effort to have a consistent difficulty level every year. However, the actual levels of difficulty have significantly fluctuated, which causes many problems with university admission. In this paper, we build two types of data-driven prediction models to predict correct answer rate and to identify significant factors for CSAT English test through accumulated test data of CSAT, unlike traditional methods depending on experts' judgments. Initially, we derive candidate question-specific factors that can influence the correct answer rate, such as the position, EBS-relation, readability, from the annual CSAT practices and CSAT for 10 years. In addition, we drive context-specific factors by employing topic modeling which identify the underlying topics over the text. Then, the correct answer rate is predicted by multiple linear regression and level of difficulty is predicted by classification tree. The experimental results show that 90% of accuracy can be achieved by the level of difficulty (difficult/easy) classification model, whereas the error rate for correct answer rate is below 16%. Points and problem category are found to be critical to predict the correct answer rate. In addition, the correct answer rate is also influenced by some of the topics discovered by topic modeling. Based on our study, it will be possible to predict the range of expected correct answer rate for both question-level and entire test-level, which will help CSAT examiners to control the level of difficulties.
https://doi.org/10.3745/KTSDE.2015.4.11.509 인용 PDF KSCI

Study on the Current Status of Smart Garden (스마트가든의 인식경향에 관한 연구)

Woo, Kyung-Sook;Suh, Joo-Hwan
- Journal of the Korean Institute of Landscape Architecture
- /
- v.49 no.2
- /
- pp.51-60
- /
- 2021
Modern society is becoming more informed and intelligent with the development of digital technology, in which humans, objects, and networks relate with each other. In accordance with the changing times, a garden system has emerged that makes it easy to supply the ideal temperature, humidity, sunlight, and moisture conditions to grow plants. Therefore, this study attempted to grasp the concept, perception, and trends of smart gardens, a recent concept. To achieve the purpose of this study, previous studies and text mining were used, and the results are as follows. First, the core characteristics of smart gardens are new gardens in which IoT technology and gardening techniques are fused in indoor and outdoor spaces due to technological developments and changes in people's lifestyles. As technology advances and the importance of the environment increases, smart gardens are becoming a reality due to the need for living spaces where humans and nature can co-exist. With the advent of smart gardens, it will be possible to contribute to gardens' vitalization to deal with changes in garden-related industries and people's lifestyles. Second, in current research related to smart gardens and users' experiences, the technical aspects of smart gardens are the most interesting. People value smart garden functions and technical aspects that enable a safe, comfortable, and convenient life, and subjective uses are emerging depending on individual tastes and the comfort with digital devices. Third, looking at the usage behavior of smart gardens, they are mainly used in indoor spaces, with edible plants are being grown. Due to the growing importance of the environment and concerns about climate change and a possible food crisis, the tendency is to prefer the cultivation of plants related to food, but the expansion of garden functions can satisfying users' needs with various technologies that allow for the growing of flowers. In addition, as users feel the shapes of smart gardens are new and sophisticated, it can be seen that design is an essential factor that helps to satisfy users. Currently, smart gardens are developing in terms of technology. However, the main components of the smart garden are the combination of humans, nature, and technology rather than focusing on growing plants conveniently by simply connecting potted plants and smart devices. It strengthens connectivity with various city services and smart homes. Smart gardens interact with the landscape of the architect's ideas rather than reproducing nature through science and technology. Therefore, it is necessary to have a design that considers the functions of the garden and the needs of users. In addition, by providing citizens indoor and urban parks and public facilities, it is possible to share the functions of communication and gardening among generations targeting those who do not enjoy 'smart' services due to age and bridge the digital device and information gap. Smart gardens have potential as a new landscaping space.
https://doi.org/10.9715/KILA.2021.49.2.051 인용 PDF KSCI

Prediction of a hit drama with a pattern analysis on early viewing ratings (초기 시청시간 패턴 분석을 통한 대흥행 드라마 예측)

Nam, Kihwan;Seong, Nohyoon
- Journal of Intelligence and Information Systems
- /
- v.24 no.4
- /
- pp.33-49
- /
- 2018
The impact of TV Drama success on TV Rating and the channel promotion effectiveness is very high. The cultural and business impact has been also demonstrated through the Korean Wave. Therefore, the early prediction of the blockbuster success of TV Drama is very important from the strategic perspective of the media industry. Previous studies have tried to predict the audience ratings and success of drama based on various methods. However, most of the studies have made simple predictions using intuitive methods such as the main actor and time zone. These studies have limitations in predicting. In this study, we propose a model for predicting the popularity of drama by analyzing the customer's viewing pattern based on various theories. This is not only a theoretical contribution but also has a contribution from the practical point of view that can be used in actual broadcasting companies. In this study, we collected data of 280 TV mini-series dramas, broadcasted over the terrestrial channels for 10 years from 2003 to 2012. From the data, we selected the most highly ranked and the least highly ranked 45 TV drama and analyzed the viewing patterns of them by 11-step. The various assumptions and conditions for modeling are based on existing studies, or by the opinions of actual broadcasters and by data mining techniques. Then, we developed a prediction model by measuring the viewing-time distance (difference) using Euclidean and Correlation method, which is termed in our study similarity (the sum of distance). Through the similarity measure, we predicted the success of dramas from the viewer's initial viewing-time pattern distribution using 1~5 episodes. In order to confirm that the model is shaken according to the measurement method, various distance measurement methods were applied and the model was checked for its dryness. And when the model was established, we could make a more predictive model using a grid search. Furthermore, we classified the viewers who had watched TV drama more than 70% of the total airtime as the "passionate viewer" when a new drama is broadcasted. Then we compared the drama's passionate viewer percentage the most highly ranked and the least highly ranked dramas. So that we can determine the possibility of blockbuster TV mini-series. We find that the initial viewing-time pattern is the key factor for the prediction of blockbuster dramas. From our model, block-buster dramas were correctly classified with the 75.47% accuracy with the initial viewing-time pattern analysis. This paper shows high prediction rate while suggesting audience rating method different from existing ones. Currently, broadcasters rely heavily on some famous actors called so-called star systems, so they are in more severe competition than ever due to rising production costs of broadcasting programs, long-term recession, aggressive investment in comprehensive programming channels and large corporations. Everyone is in a financially difficult situation. The basic revenue model of these broadcasters is advertising, and the execution of advertising is based on audience rating as a basic index. In the drama, there is uncertainty in the drama market that it is difficult to forecast the demand due to the nature of the commodity, while the drama market has a high financial contribution in the success of various contents of the broadcasting company. Therefore, to minimize the risk of failure. Thus, by analyzing the distribution of the first-time viewing time, it can be a practical help to establish a response strategy (organization/ marketing/story change, etc.) of the related company. Also, in this paper, we found that the behavior of the audience is crucial to the success of the program. In this paper, we define TV viewing as a measure of how enthusiastically watching TV is watched. We can predict the success of the program successfully by calculating the loyalty of the customer with the hot blood. This way of calculating loyalty can also be used to calculate loyalty to various platforms. It can also be used for marketing programs such as highlights, script previews, making movies, characters, games, and other marketing projects.
https://doi.org/10.13088/jiis.2018.24.4.033 인용 PDF KSCI HTML

Asbestos Trend in Korea from 1918 to 2027 Using Text Mining Techniques in a Big Data Environment (빅데이터환경에서 텍스트마이닝 기법을 활용한 한국의 석면 트렌드 (1918년~2027년))

Yul Roh;Hyeonyi Jeong;Byungno Park;Chaewon Kim;Yumi Kim;Mina Seo;Haengsoo Shin;Hyunwook Kim;Yeji Sung
- Economic and Environmental Geology
- /
- v.56 no.4
- /
- pp.457-473
- /
- 2023
Asbestos has been produced, imported and used in various industries in Korea over the past decades. Since asbestos causes fatal diseases such as malignant mesothelioma and lung cancer, the use of asbestos has been generally banned in Korea since 2009. However, there are still many asbestos-containing materials around us, and safe management is urgently needed. This study aims to examine asbestos-related trend changes using major asbestos-related keywords based on the asbestos trend analysis using big data for the past 32 years (1991 to 2022) in Korea. In addition, we reviewed both domestic trends related to the production, import, and use of asbestos before 1990 and asbestos-related policies from 2023 to 2027. From 1991 to 2000, main keywords related to asbestos were research, workers, carcinogens, and the environment because the carcinogenicity of asbestos was highlighted due to domestic production, import, and use of asbestos. From 2001 to 2010, the main keywords related to asbestos were lung cancer, litigation, carcinogens, exposure, and companies because lawsuits were initiated in the US and Japan in relation to carcinogenicity due to asbestos. From 2011 to 2020, the high ranking keywords related to asbestos were carcinogen, baseball field, school, slate, building, and abandoned asbestos mine due to the seriousness of the asbestos problem in Korea. From 2021 to present (2023), the main search keywords related to asbestos such as school, slate (asbestos cement), buildings, landscape stone, environmental impact assessment, apartment, and cement appeared.
https://doi.org/10.9719/EEG.2023.56.4.457 인용 PDF

Investigating Topics of Incivility Related to COVID-19 on Twitter: Analysis of Targets and Keywords of Hate Speech (트위터에서의 COVID-19와 관련된 반시민성 주제 탐색: 혐오 대상 및 키워드 분석)

Kim, Kyuli;Oh, Chanhee;Zhu, Yongjun
- Journal of the Korean Society for information Management
- /
- v.39 no.1
- /
- pp.331-350
- /
- 2022
This study aims to understand topics of incivility related to COVID-19 from analyzing Twitter posts including COVID-19-related hate speech. To achieve the goal, a total of 63,802 tweets that were created between December 1st, 2019, and August 31st, 2021, covering three targets of hate speech including region and public facilities, groups of people, and religion were analyzed. Frequency analysis, dynamic topic modeling, and keyword co-occurrence network analysis were used to explore topics and keywords. 1) Results of frequency analysis revealed that hate against regions and public facilities showed a relatively increasing trend while hate against specific groups of people and religion showed a relatively decreasing trend. 2) Results of dynamic topic modeling analysis showed keywords of each of the three targets of hate speech. Keywords of the region and public facilities included "Daegu, Gyeongbuk local hate", "interregional hate", and "public facility hate"; groups of people included "China hate", "virus spreaders", and "outdoor activity sanctions"; and religion included "Shincheonji", "Christianity", "religious infection", "refusal of quarantine", and "places visited by confirmed cases". 3) Similarly, results of keyword co-occurrence network analysis revealed keywords of three targets: region and public facilities (Corona, Daegu, confirmed cases, Shincheonji, Gyeongbuk, region); specific groups of people (Coronavirus, Wuhan pneumonia, Wuhan, China, Chinese, People, Entry, Banned); and religion (Corona, Church, Daegu, confirmed cases, infection). This study attempted to grasp the public's anti-citizenship public opinion related to COVID-19 by identifying domestic COVID-19 hate targets and keywords using social media. In particular, it is meaningful to grasp public opinion on incivility topics and hate emotions expressed on social media using data mining techniques for hate-related to COVID-19, which has not been attempted in previous studies. In addition, the results of this study suggest practical implications in that they can be based on basic data for contributing to the establishment of systems and policies for cultural communication measures in preparation for the post-COVID-19 era.
https://doi.org/10.3743/KOSIM.2022.39.1.331 인용 PDF KSCI

An Analysis of the Internal Marketing Impact on the Market Capitalization Fluctuation Rate based on the Online Company Reviews from Jobplanet (직원을 위한 내부마케팅이 기업의 시가 총액 변동률에 미치는 영향 분석: 잡플래닛 기업 리뷰를 중심으로)

Kichul Choi;Sang-Yong Tom Lee
- Information Systems Review
- /
- v.20 no.2
- /
- pp.39-62
- /
- 2018
Thanks to the growth of computing power and the recent development of data analytics, researchers have started to work on the data produced by users through the Internet or social media. This study is in line with these recent research trends and attempts to adopt data analytical techniques. We focus on the impact of "internal marketing" factors on firm performance, which is typically studied through survey methodologies. We looked into the job review platform Jobplanet (www.jobplanet.co.kr), which is a website where employees and former employees anonymously review companies and their management. With web crawling processes, we collected over 40K data points and performed morphological analysis to classify employees' reviews for internal marketing data. We then implemented econometric analysis to see the relationship between internal marketing and market capitalization. Contrary to the findings of extant survey studies, internal marketing is positively related to a firm's market capitalization only within a limited area. In most of the areas, the relationships are negative. Particularly, female-friendly environment and human resource development (HRD) are the areas exhibiting positive relations with market capitalization in the manufacturing industry. In the service industry, most of the areas, such as employ welfare and work-life balance, are negatively related with market capitalization. When firm size is small (or the history is short), female-friendly environment positively affect firm performance. On the contrary, when firm size is big (or the history is long), most of the internal marketing factors are either negative or insignificant. We explain the theoretical contributions and managerial implications with these results.
https://doi.org/10.14329/isr.2018.20.2.039 인용 PDF

A Study of Factors Associated with Software Developers Job Turnover (데이터마이닝을 활용한 소프트웨어 개발인력의 업무 지속수행의도 결정요인 분석)

Jeon, In-Ho;Park, Sun W.;Park, Yoon-Joo
- Journal of Intelligence and Information Systems
- /
- v.21 no.2
- /
- pp.191-204
- /
- 2015
According to the '2013 Performance Assessment Report on the Financial Program' from the National Assembly Budget Office, the unfilled recruitment ratio of Software(SW) Developers in South Korea was 25% in the 2012 fiscal year. Moreover, the unfilled recruitment ratio of highly-qualified SW developers reaches almost 80%. This phenomenon is intensified in small and medium enterprises consisting of less than 300 employees. Young job-seekers in South Korea are increasingly avoiding becoming a SW developer and even the current SW developers want to change careers, which hinders the national development of IT industries. The Korean government has recently realized the problem and implemented policies to foster young SW developers. Due to this effort, it has become easier to find young SW developers at the beginning-level. However, it is still hard to recruit highly-qualified SW developers for many IT companies. This is because in order to become a SW developing expert, having a long term experiences are important. Thus, improving job continuity intentions of current SW developers is more important than fostering new SW developers. Therefore, this study surveyed the job continuity intentions of SW developers and analyzed the factors associated with them. As a method, we carried out a survey from September 2014 to October 2014, which was targeted on 130 SW developers who were working in IT industries in South Korea. We gathered the demographic information and characteristics of the respondents, work environments of a SW industry, and social positions for SW developers. Afterward, a regression analysis and a decision tree method were performed to analyze the data. These two methods are widely used data mining techniques, which have explanation ability and are mutually complementary. We first performed a linear regression method to find the important factors assaociated with a job continuity intension of SW developers. The result showed that an 'expected age' to work as a SW developer were the most significant factor associated with the job continuity intention. We supposed that the major cause of this phenomenon is the structural problem of IT industries in South Korea, which requires SW developers to change the work field from developing area to management as they are promoted. Also, a 'motivation' to become a SW developer and a 'personality (introverted tendency)' of a SW developer are highly importantly factors associated with the job continuity intention. Next, the decision tree method was performed to extract the characteristics of highly motivated developers and the low motivated ones. We used well-known C4.5 algorithm for decision tree analysis. The results showed that 'motivation', 'personality', and 'expected age' were also important factors influencing the job continuity intentions, which was similar to the results of the regression analysis. In addition to that, the 'ability to learn' new technology was a crucial factor for the decision rules of job continuity. In other words, a person with high ability to learn new technology tends to work as a SW developer for a longer period of time. The decision rule also showed that a 'social position' of SW developers and a 'prospect' of SW industry were minor factors influencing job continuity intensions. On the other hand, 'type of an employment (regular position/ non-regular position)' and 'type of company (ordering company/ service providing company)' did not affect the job continuity intension in both methods. In this research, we demonstrated the job continuity intentions of SW developers, who were actually working at IT companies in South Korea, and we analyzed the factors associated with them. These results can be used for human resource management in many IT companies when recruiting or fostering highly-qualified SW experts. It can also help to build SW developer fostering policy and to solve the problem of unfilled recruitment of SW Developers in South Korea.
https://doi.org/10.13088/jiis.2015.21.2.191 인용 PDF KSCI

A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model (키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법)

Cho, Won-Chin;Rho, Sang-Kyu;Yun, Ji-Young Agnes;Park, Jin-Soo
- Asia pacific journal of information systems
- /
- v.21 no.1
- /
- pp.103-122
- /
- 2011
Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.
PDF KSCI

Search Result 1,092, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)