• Title/Summary/Keyword: texts

Search Result 1,726, Processing Time 0.033 seconds

Kim Eung-hwan's Official Excursion for Drawing Scenic Spots in 1788 and his Album of Complete Views of Seas and Mountains (1788년 김응환의 봉명사경과 《해악전도첩(海嶽全圖帖)》)

  • Oh, Dayun
    • MISULJARYO - National Museum of Korea Art Journal
    • /
    • v.96
    • /
    • pp.54-88
    • /
    • 2019
  • The Album of Complete Views of Seas and Mountains comprises sixty real scenery landscape paintings depicting Geumgangsan Mountain, the Haegeumgang River, and the eight scenic views of Gwandong regions, as well as fifty-one pieces of writing. It is a rare example in terms of its size and painting style. The paintings in this album, which are densely packed with natural features, follow the painting style of the Southern School yet employ crude and unconventional elements. In them, stones on the mountains are depicted both geometrically and three-dimensionally. Since 1973, parts of this album have been published in some exhibition catalogues. The entire album was opened to the public at the special exhibition "Through the Eyes of Joseon Painters: Real Scenery Landscapes of Korea" held at the National Museum of Korea in 2019. The Album of Complete Views of Seas and Mountains was attributed to Kim Eung-hwan (1742-1789) due to the signature on the final leaf of the album and the seal reading "Bokheon(painter's penname)" on the currently missing album leaf of Chilbodae Peaks. However, there is a strong possibility that this signature and seal may have been added later. This paper intends to reexamine the creator of this album based on a variety of related factors. In order to understand the production background of Album of Complete Views of Seas and Mountains, I investigated the eighteenth-century tradition of drawing scenic spots while travelling in which scenery of was depicted during private travels or official excursions. Jeong Seon(1676-1759), Sim Sa-jeong(1707-1769), Kim Yun-gyeom(1711-1775), Choe Buk(1712-after 1786), and Kang Se-hwang(1713-1791) all went on a journey to Geumgangsan Mountain, the most famous travel destination in the late Joseon period, and created paintings of the mountain, including Album of Pungak Mountain in the Sinmyo Year(1711) by Jeong Seon. These painters presented their versions of the traditional scenic spots of Inner Geumgangsan and newly depicted vistas they discovered for themselves. To commemorate their private visits, they produced paintings for their fellow travelers or sponsors in an album format that could include several scenes. While the production of paintings of private travels to Geumgangsan Mountain increased, King Jeongjo(r. 1776-1800) ordered Kim Eung-hwan and Kim Hong-do, court painters at the Dohwaseo(Royal Bureau of Painting), to paint scenic spots in the nine counties of the Yeongdong region and around Geumgangsan Mountain. King Jeongjo selected these two as the painters for the official excursion taking into account their relationship, their administrative experience as regional officials, and their distinct painting styles. Starting in the reign of King Yeongjo(r. 1724-1776), Kim Eung-hwan and Kim Hong-do served as court painters at the Dohwaseo, maintained a close relationship as a senior and a junior and as colleagues, and served as chalbang(chief in large of post stations) in the Yeongnam region. While Kim Hong-do was proficient at applying soft and delicate brushstrokes, Kim Eung-hwan was skilled at depicting the beauty of robust and luxuriant landscapes. Both painters produced about 100 scenes of original drawings over fifty days of the official excursion. Based on these original drawings, they created around seventy album leaves or handscrolls. Their paintings enriched the tradition of depicting scenic spots, particularly Outer Inner Geumgang and the eight scenic views of Gwandong around Geumgangsan Mountain during private journeys in the eighteenth century. Moreover, they newly discovered places of scenic beauty in the Outer Geungang and Yeongdong regions, establishing them as new painting themes. The Album of Complete Views of Seas and Mountains consists of four volumes. The volumes I, II include twenty-nine paintings of Inner Geumgangsan; the volume III, seventeen scenes of Outer Geumgangsan; and the volume IV, fourteen images of Maritime Geumgangsan and the eight scenic views of Gwandong. These paintings produced on silk show crowded compositions, geometrical depictions of the stones and the mountains, and distinct presentation of the rocky peaks of Geumgangsan Mountain using white and grayish-blue pigments. This album reflects the Joseon painting style of the mid- and late eighteenth century, integrating influences from Jeong Seon, Kang Se-hwang, Sim Sa-jeong, Jeong Chung-yeop(1725-after 1800), and Kim Hong-do. In particular, some paintings in the album show similarities to Kim Hong-do's Album of Famous Mountains in Korea in terms of its compositions and painterly motifs. However, "Yeongrangho Lake," "Haesanjeong Pavilion," and "Wolsongjeong Pavilion" in Kim Eung-hwan's album differ from in the version by Kim Hong-do. Thus, Kim Eung-hwan was influenced by Kim Hong-do, but produced his own distinctive album. The Album of Complete Views of Seas and Mountains includes scenery of "Jaundam Pool," "Baegundae Peak," "Viewing Birobong Peak at Anmunjeom groove," and "Baekjeongbong Peak," all of which are not depicted in other albums. In his version, Kim Eung-hwan portrayed the characteristics of the natural features in each scenic spot in a detailed and refreshing manner. Moreover, he illustrated stones on the mountains using geometric shapes and added a sense of three-dimensionality using lines and planes. Based on the painting traditions of the Southern School, he established his own characteristics. He also turned natural features into triangular or rectangular chunks. All sixty paintings in this album appear rough and unconventional, but maintain their internal consistency. Each of the fifty-one writings included in the Album of Complete Views of Seas and Mountains is followed by a painting of a scenic spot. It explains the depicted landscape, thus helping viewers to understand and appreciate the painting. Intimately linked to each painting, the related text notes information on traveling from one scenic spot to the next, the origins of the place names, geographic features, and other related information. Such encyclopedic documentation began in the early nineteenth century and was common in painting albums of Geumgangsan Mountain in the mid- nineteenth century. The text following the painting of Baekhwaam Hermitage in the Album of Complete Views of Seas and Mountains documents the reconstruction of the Baekhwaam Hermitage in 1845, which provides crucial evidence for dating the text. Therefore, the owner of the Album of Complete Views of Seas and Mountains might have written the texts or asked someone else to transcribe them in the mid- or late nineteenth century. In this paper, I have inferred the producer of the Album of Complete Views of Seas and Mountains to be Kim Eung-hwan based on the painting style and the tradition of drawing scenic spots during official trips. Moreover, its affinity with the Handscroll of Pungak Mountain created by Kim Ha-jong(1793-after 1878) after 1865 is another decisive factor in attributing the album to Kim Eung-hwan. In contrast to the Album of Famous Mountains in Korea by Kim Hong-do, the Album of Complete Views of Seas and Mountains exerted only a minor influence on other painters. The Handscroll of Pungak Mountain by Kim Ha-jong is the sole example that employs the subject matter from the Album of Complete Views of Seas and Mountains and follows its painting style. In the Handscroll of Pungak Mountain, Kim Ha-jong demonstrated a painting style completely different from that in the Album of Seas and Mountains that he produced fifty years prior in 1816 for Yi Gwang-mun, the magistrate of Chuncheon. He emphasized the idea of "scholar thoughts" by following the compositions, painterly elements, and depictions of figures in the painting manual style from Kim Eung-hwan's Album of Complete Views of Seas and Mountains. Kim Ha-jong, a member of the Gaeseong Kim clan and the eldest grandson of Kim Eung-hwan, is presumed to have appreciated the paintings depicted in the nature of Album of Complete Views of Seas and Mountains, which had been passed down within the family, and newly transformed them. Furthermore, the contents and narrative styles of Yi Yu-won's writings attached to the paintings in the Handscroll of Pungak Mountain are similar to those of the fifty-one writings in Kim Eunghwan's album. This suggests a possible influence of the inscriptions in Kim Eung-hwan's album or the original texts from which these inscriptions were quoted upon the writings in Kim Ha-jong's handscroll. However, a closer examination will be needed to determine the order of the transcription of the writings. The Album of Complete View of Seas and Mountains differs from Kim Hong-do's paintings of his official trips and other painting albums he influenced. This album is a siginificant artwork in that it broadens the understanding of the art world of Kim Eung-hwan and illustrates another layer of real scenery landscape paintings in the late eighteenth century.

Influence analysis of Internet buzz to corporate performance : Individual stock price prediction using sentiment analysis of online news (온라인 언급이 기업 성과에 미치는 영향 분석 : 뉴스 감성분석을 통한 기업별 주가 예측)

  • Jeong, Ji Seon;Kim, Dong Sung;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.37-51
    • /
    • 2015
  • Due to the development of internet technology and the rapid increase of internet data, various studies are actively conducted on how to use and analyze internet data for various purposes. In particular, in recent years, a number of studies have been performed on the applications of text mining techniques in order to overcome the limitations of the current application of structured data. Especially, there are various studies on sentimental analysis to score opinions based on the distribution of polarity such as positivity or negativity of vocabularies or sentences of the texts in documents. As a part of such studies, this study tries to predict ups and downs of stock prices of companies by performing sentimental analysis on news contexts of the particular companies in the Internet. A variety of news on companies is produced online by different economic agents, and it is diffused quickly and accessed easily in the Internet. So, based on inefficient market hypothesis, we can expect that news information of an individual company can be used to predict the fluctuations of stock prices of the company if we apply proper data analysis techniques. However, as the areas of corporate management activity are different, an analysis considering characteristics of each company is required in the analysis of text data based on machine-learning. In addition, since the news including positive or negative information on certain companies have various impacts on other companies or industry fields, an analysis for the prediction of the stock price of each company is necessary. Therefore, this study attempted to predict changes in the stock prices of the individual companies that applied a sentimental analysis of the online news data. Accordingly, this study chose top company in KOSPI 200 as the subjects of the analysis, and collected and analyzed online news data by each company produced for two years on a representative domestic search portal service, Naver. In addition, considering the differences in the meanings of vocabularies for each of the certain economic subjects, it aims to improve performance by building up a lexicon for each individual company and applying that to an analysis. As a result of the analysis, the accuracy of the prediction by each company are different, and the prediction accurate rate turned out to be 56% on average. Comparing the accuracy of the prediction of stock prices on industry sectors, 'energy/chemical', 'consumer goods for living' and 'consumer discretionary' showed a relatively higher accuracy of the prediction of stock prices than other industries, while it was found that the sectors such as 'information technology' and 'shipbuilding/transportation' industry had lower accuracy of prediction. The number of the representative companies in each industry collected was five each, so it is somewhat difficult to generalize, but it could be confirmed that there was a difference in the accuracy of the prediction of stock prices depending on industry sectors. In addition, at the individual company level, the companies such as 'Kangwon Land', 'KT & G' and 'SK Innovation' showed a relatively higher prediction accuracy as compared to other companies, while it showed that the companies such as 'Young Poong', 'LG', 'Samsung Life Insurance', and 'Doosan' had a low prediction accuracy of less than 50%. In this paper, we performed an analysis of the share price performance relative to the prediction of individual companies through the vocabulary of pre-built company to take advantage of the online news information. In this paper, we aim to improve performance of the stock prices prediction, applying online news information, through the stock price prediction of individual companies. Based on this, in the future, it will be possible to find ways to increase the stock price prediction accuracy by complementing the problem of unnecessary words that are added to the sentiment dictionary.

A Study of 'Emotion Trigger' by Text Mining Techniques (텍스트 마이닝을 이용한 감정 유발 요인 'Emotion Trigger'에 관한 연구)

  • An, Juyoung;Bae, Junghwan;Han, Namgi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.69-92
    • /
    • 2015
  • The explosion of social media data has led to apply text-mining techniques to analyze big social media data in a more rigorous manner. Even if social media text analysis algorithms were improved, previous approaches to social media text analysis have some limitations. In the field of sentiment analysis of social media written in Korean, there are two typical approaches. One is the linguistic approach using machine learning, which is the most common approach. Some studies have been conducted by adding grammatical factors to feature sets for training classification model. The other approach adopts the semantic analysis method to sentiment analysis, but this approach is mainly applied to English texts. To overcome these limitations, this study applies the Word2Vec algorithm which is an extension of the neural network algorithms to deal with more extensive semantic features that were underestimated in existing sentiment analysis. The result from adopting the Word2Vec algorithm is compared to the result from co-occurrence analysis to identify the difference between two approaches. The results show that the distribution related word extracted by Word2Vec algorithm in that the words represent some emotion about the keyword used are three times more than extracted by co-occurrence analysis. The reason of the difference between two results comes from Word2Vec's semantic features vectorization. Therefore, it is possible to say that Word2Vec algorithm is able to catch the hidden related words which have not been found in traditional analysis. In addition, Part Of Speech (POS) tagging for Korean is used to detect adjective as "emotional word" in Korean. In addition, the emotion words extracted from the text are converted into word vector by the Word2Vec algorithm to find related words. Among these related words, noun words are selected because each word of them would have causal relationship with "emotional word" in the sentence. The process of extracting these trigger factor of emotional word is named "Emotion Trigger" in this study. As a case study, the datasets used in the study are collected by searching using three keywords: professor, prosecutor, and doctor in that these keywords contain rich public emotion and opinion. Advanced data collecting was conducted to select secondary keywords for data gathering. The secondary keywords for each keyword used to gather the data to be used in actual analysis are followed: Professor (sexual assault, misappropriation of research money, recruitment irregularities, polifessor), Doctor (Shin hae-chul sky hospital, drinking and plastic surgery, rebate) Prosecutor (lewd behavior, sponsor). The size of the text data is about to 100,000(Professor: 25720, Doctor: 35110, Prosecutor: 43225) and the data are gathered from news, blog, and twitter to reflect various level of public emotion into text data analysis. As a visualization method, Gephi (http://gephi.github.io) was used and every program used in text processing and analysis are java coding. The contributions of this study are as follows: First, different approaches for sentiment analysis are integrated to overcome the limitations of existing approaches. Secondly, finding Emotion Trigger can detect the hidden connections to public emotion which existing method cannot detect. Finally, the approach used in this study could be generalized regardless of types of text data. The limitation of this study is that it is hard to say the word extracted by Emotion Trigger processing has significantly causal relationship with emotional word in a sentence. The future study will be conducted to clarify the causal relationship between emotional words and the words extracted by Emotion Trigger by comparing with the relationships manually tagged. Furthermore, the text data used in Emotion Trigger are twitter, so the data have a number of distinct features which we did not deal with in this study. These features will be considered in further study.

Construction of Event Networks from Large News Data Using Text Mining Techniques (텍스트 마이닝 기법을 적용한 뉴스 데이터에서의 사건 네트워크 구축)

  • Lee, Minchul;Kim, Hea-Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.183-203
    • /
    • 2018
  • News articles are the most suitable medium for examining the events occurring at home and abroad. Especially, as the development of information and communication technology has brought various kinds of online news media, the news about the events occurring in society has increased greatly. So automatically summarizing key events from massive amounts of news data will help users to look at many of the events at a glance. In addition, if we build and provide an event network based on the relevance of events, it will be able to greatly help the reader in understanding the current events. In this study, we propose a method for extracting event networks from large news text data. To this end, we first collected Korean political and social articles from March 2016 to March 2017, and integrated the synonyms by leaving only meaningful words through preprocessing using NPMI and Word2Vec. Latent Dirichlet allocation (LDA) topic modeling was used to calculate the subject distribution by date and to find the peak of the subject distribution and to detect the event. A total of 32 topics were extracted from the topic modeling, and the point of occurrence of the event was deduced by looking at the point at which each subject distribution surged. As a result, a total of 85 events were detected, but the final 16 events were filtered and presented using the Gaussian smoothing technique. We also calculated the relevance score between events detected to construct the event network. Using the cosine coefficient between the co-occurred events, we calculated the relevance between the events and connected the events to construct the event network. Finally, we set up the event network by setting each event to each vertex and the relevance score between events to the vertices connecting the vertices. The event network constructed in our methods helped us to sort out major events in the political and social fields in Korea that occurred in the last one year in chronological order and at the same time identify which events are related to certain events. Our approach differs from existing event detection methods in that LDA topic modeling makes it possible to easily analyze large amounts of data and to identify the relevance of events that were difficult to detect in existing event detection. We applied various text mining techniques and Word2vec technique in the text preprocessing to improve the accuracy of the extraction of proper nouns and synthetic nouns, which have been difficult in analyzing existing Korean texts, can be found. In this study, the detection and network configuration techniques of the event have the following advantages in practical application. First, LDA topic modeling, which is unsupervised learning, can easily analyze subject and topic words and distribution from huge amount of data. Also, by using the date information of the collected news articles, it is possible to express the distribution by topic in a time series. Second, we can find out the connection of events in the form of present and summarized form by calculating relevance score and constructing event network by using simultaneous occurrence of topics that are difficult to grasp in existing event detection. It can be seen from the fact that the inter-event relevance-based event network proposed in this study was actually constructed in order of occurrence time. It is also possible to identify what happened as a starting point for a series of events through the event network. The limitation of this study is that the characteristics of LDA topic modeling have different results according to the initial parameters and the number of subjects, and the subject and event name of the analysis result should be given by the subjective judgment of the researcher. Also, since each topic is assumed to be exclusive and independent, it does not take into account the relevance between themes. Subsequent studies need to calculate the relevance between events that are not covered in this study or those that belong to the same subject.

The Mediating Effect of Experiential Value on Customers' Perceived Value of Digital Content: China's Anti-virus Program Market (경험개치대소비자대전자내용적인지개치적중개영향(经验价值对消费者对电子内容的认知价值的中介影响): 중국살독연건시장(中国杀毒软件市场))

  • Jia, Weiwei;Kim, Sae-Bum
    • Journal of Global Scholars of Marketing Science
    • /
    • v.20 no.2
    • /
    • pp.219-230
    • /
    • 2010
  • Digital content makes big changes to our daily lives while bringing opportunities and challenges for companies. Creative firms integrate pictures, texts, videos, audios, and data by digitalization to develop new products or services and create digital experiences to promote their brands. Most articles on digital content contribute to the basic concept or development of marketing it in literature. Actually, compared with traditional value chains for common products or services, the digital content industry seems to have more potential value. Because quite a bit of digital content is free to the consumer, price is not necessarily perceived as an indicator of the quality or value of information (Rowley 2008). It becomes evident that a current theme in digital content is the issue of "value," and research on customers' perceived value of digital content is a necessity. This article argues that experiential value has an advantage in customers' evaluations of digital content. Two different but related contributions to the understanding of "value" of digital content are made here. First, based on the comparison of digital content with products and services, the article proposes two key characteristics that make experiential strategy available for digital content: intangibility and near-zero reproduction cost. On top of that, based on the discussion of the gap between company's idealized value and customer's perceived value, this article emphasizes that digital content prices and pricing of digital content is different from products and services. As a result of intangibility, prices may not reflect customer value. Moreover, the cost of digital content in the development stage may be very high while reproduction costs shrink dramatically. Moreover, because of the value gap mentioned before, the pricing polices vary for different digital contents. For example, flat price policy is generally used for movies and music (Magiera 2001; Netherby 2002), while for continuous demand, digital content such as online games and anti-virus programs involves a more complicated matter of utility and competitive price levels. Digital content companies have to explore various kinds of strategies to overcome this gap. Rethinking marketing solutions such as advertisements, images, and word-of-mouth and their effect on customers' perceived value becomes essential. China's digital content industry is becoming more and more globalized and drawing special attention from different countries and regions that have respective competitive advantages. The 2008-2009 Annual Report on the Development of China's Digital Content Industry (CCIDConsulting 2009) indicates that, with the driven power of domestic demand and governmental policy support, the country's digital content industry maintained a fast growth of some 30 percent in 2008, obviously indicating the initial stage of industry expansion. In China, anti-virus programs and other software programs which need to be updated use a quarter-based pricing policy. Customers can download a trial version for free and use it for six months or a year. If they want to use it longer, continuous payment is needed. They examine the excellence of the digital content during this trial period and decide whether to pay for continued usage. For China’s music and movie industries, as a result of initial development, experiential strategy has not been much applied, even though firms in other countries find the trial experience and explore important strategies(such as customers listening to music for several seconds for free before downloading it). For the above reasons, anti-virus program may be a representative for digital content industry in China and an exploratory study of the advantage of experiential value in customer's perceived value of digital content is done in the anti-virus market of China. In order to enhance the reliability of the survey data, this study focused on people who were experienced users of anti-virus programs. The empirical results revealed that experiential value has a positive effect on customers' perceived value of digital content. In other words, because digital content is intangible and the reproduction costs are nearly zero, customers' evaluations are based heavily on their experience. Moreover, image and word-of-mouth do not have a positive effect on perceived value, only on experiential value. That is to say, a digital content value chain is different from that of a general product or service. Experiential value has a notable advantage and mediates the effect of image and word-of-mouth on perceived value. The results of this study help provide an understanding of why free digital content downloads exist in developing countries. Customers can perceive the value of digital content only by using and experiencing it. This is also why such governments support the development of digital content. Other developing countries whose digital content business is also in the beginning stage can make use of the suggestions here. Moreover, based on the advantage of experiential strategy, companies should make more of an effort to invest in customers' experience. As a result of the characteristics and value gap of digital content, customers perceive more value in the intangible digital content only by experiencing what they really want. Moreover, because of the near-zero reproduction costs, companies can perhaps use experiential strategy to enhance customer understanding of digital content.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model (키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법)

  • Cho, Won-Chin;Rho, Sang-Kyu;Yun, Ji-Young Agnes;Park, Jin-Soo
    • Asia pacific journal of information systems
    • /
    • v.21 no.1
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

A Study of The Medical Classics in the '$\bar{A}yurveda$' ('아유르베다'($\bar{A}yurveda$)의 의경(醫經)에 관한 연구)

  • Kim, Ki-Wook;Park, Hyun-Kuk;Seo, Ji-Young
    • Journal of Korean Medical classics
    • /
    • v.20 no.4
    • /
    • pp.91-117
    • /
    • 2007
  • Through a simple study of the medical classics in the '$\bar{A}yurveda$', we have summarized them as follows. 1) Traditional Indian medicine started in the Ganges river area at about 1500 B. C. E. and traces of medical science can be found in the "Rigveda" and "Atharvaveda". 2) The "Charaka" and "$Su\acute{s}hruta$(妙聞集)", ancient texts from India, are not the work of one person, but the result of the work and errors of different doctors and philosophers. Due to the lack of historical records, the time of Charaka or $Su\acute{s}hruta$(妙聞)s' lives are not exactly known. So the completion of the "Charaka" is estimated at 1st${\sim}$2nd century C. E. in northwestern India, and the "$Su\acute{s}hruta$" is estimated to have been completed in 3rd${\sim}$4th century C. E. in central India. Also, the "Charaka" contains details on internal medicine, while the "$Su\acute{s}hruta$" contains more details on surgery by comparison. 3) '$V\bar{a}gbhata$', one of the revered Vriddha Trayi(triad of the ancients, 三醫聖) of the '$\bar{A}yurveda$', lived and worked in about the 7th century and wrote the "$A\d{s}\d{t}\bar{a}nga$ $A\d{s}\d{t}\bar{a}nga$ $h\d{r}daya$ $sa\d{m}hit\bar{a}$ $samhit\bar{a}$(八支集)" and "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$(八心集)", where he tried to compromise and unify the "Charaka" and "$Su\acute{s}hruta$". The "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$" was translated into Tibetan and Arabic at about the 8th${\sim}$9th century, and if we generalize the medicinal plants recorded in each the "Charaka", "$Su\acute{s}hruta$" and the "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$", there are 240, 370, 240 types each. 4) The 'Madhava' focused on one of the subjects of Indian medicine, '$Nid\bar{a}na$' ie meaning "the cause of diseases(病因論)", and in one of the copies found by Bower in 4th century C. E. we can see that it uses prescriptions from the "BuHaLaJi(布哈拉集)", "Charaka", "$Su\acute{s}hruta$". 5) According to the "Charaka", there were 8 branches of ancient medicine in India : treatment of the body(kayacikitsa), special surgery(salakya), removal of alien substances(salyapahartka), treatment of poison or mis-combined medicines(visagaravairodhikaprasamana), the study of ghosts(bhutavidya), pediatrics(kaumarabhrtya), perennial youth and long life(rasayana), and the strengthening of the essence of the body(vajikarana). 6) The '$\bar{A}yurveda$', which originated from ancient experience, was recorded in Sanskrit, which was a theorization of knowledge, and also was written in verses to make memorizing easy, and made medicine the exclusive possession of the Brahmin. The first annotations were 1060 for the "Charaka", 1200 for the "$Su\acute{s}hruta$", 1150 for the "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$", and 1100 for the "$Nid\bar{a}na$", The use of various mineral medicines in the "Charaka" or the use of mercury as internal medicine in the "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$", and the palpation of the pulse for diagnosing in the '$\bar{A}yurveda$' and 'XiZhang(西藏)' medicine are similar to TCM's pulse diagnostics. The coexistence with Arabian 'Unani' medicine, compromise with western medicine and the reactionism trend restored the '$\bar{A}yurveda$' today. 7) The "Charaka" is a book inclined to internal medicine that investigates the origin of human disease which used the dualism of the 'Samkhya', the natural philosophy of the 'Vaisesika' and the logic of the 'Nyaya' in medical theories, and its structure has 16 syllables per line, 2 lines per poem and is recorded in poetry and prose. Also, the "Charaka" can be summarized into the introduction, cause, judgement, body, sensory organs, treatment, pharmaceuticals, and end, and can be seen as a work that strongly reflects the moral code of Brahmin and Aryans. 8) In extracting bloody pus, the "Charaka" introduces a 'sharp tool' bloodletting treatment, while the "$Su\scute{s}hruta$" introduces many surgical methods such as the use of gourd dippers, horns, sucking the blood with leeches. Also the "$Su\acute{s}hruta$" has 19 chapters specializing in ophthalmology, and shows 76 types of eye diseases and their treatments. 9) Since anatomy did not develop in Indian medicine, the inner structure of the human body was not well known. The only exception is 'GuXiangXue(骨相學)' which developed from 'Atharvaveda' times and the "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$". In the "$A\d{s}\d{t}\bar{a}nga$ Sangraha $samhit\bar{a}$"'s 'ShenTiLun(身體論)' there is a thorough listing of the development of a child from pregnancy to birth. The '$\bar{A}yurveda$' is not just an ancient traditional medical system but is being called alternative medicine in the west because of its ability to supplement western medicine and, as its effects are being proved scientifically it is gaining attention worldwide. We would like to say that what we have researched is just a small fragment and a limited view, and would like to correct and supplement any insufficient parts through more research of new records.

  • PDF

A Study of The Medical Classics in the '$\bar{A}yurveda$' (아유르베다'($\bar{A}yurveda$) 의경(醫經)에 관한 연구)

  • Kim, Kj-Wook;Park, Hyun-Kuk;Seo, Ji-Young
    • The Journal of Dong Guk Oriental Medicine
    • /
    • v.10
    • /
    • pp.119-145
    • /
    • 2008
  • Through a simple study of the medical classics in the '$\bar{A}yurveda$', we have summarized them as follows. 1) Traditional Indian medicine started in the Ganges river area at about 1500 B. C. E. and traces of medical science can be found in the "Rigveda" and "Atharvaveda". 2) The "Charaka(閣羅迦集)" and "$Su\acute{s}hruta$(妙聞集)", ancient texts from India, are not the work of one person, but the result of the work and errors of different doctors and philosophers. Due to the lack of historical records, the time of Charaka(閣羅迦) or $Su\acute{s}hruta$(妙聞)s' lives are not exactly known. So the completion of the "Charaka" is estimated at 1st$\sim$2nd century C. E. in northwestern India, and the "$Su\acute{s}hruta$" is estimated to have been completed in 3rd$\sim$4th century C. E. in central India. Also, the "Charaka" contains details on internal medicine, while the "$Su\acute{s}hruta$" contains more details on surgery by comparison. 3) '$V\bar{a}gbhata$', one of the revered Vriddha Trayi(triad of the ancients, 三醫聖) of the '$\bar{A}yurveda$', lived and worked in about the 7th century and wrote the "$Ast\bar{a}nga$ $Ast\bar{a}nga$ hrdaya $samhit\bar{a}$ $samhit\bar{a}$(八支集) and "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$(八心集)", where he tried to compromise and unify the "Charaka" and "$Su\acute{s}hruta$". The "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$" was translated into Tibetan and Arabic at about the 8th$\sim$9th century, and if we generalize the medicinal plants recorded in each the "Charaka", "$Su\acute{s}hruta$" and the "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$", there are 240, 370, 240 types each. 4) The 'Madhava' focused on one of the subjects of Indian medicine, '$Nid\bar{a}na$' ie meaning "the cause of diseases(病因論)", and in one of the copies found by Bower in 4th century C. E. we can see that it uses prescriptions from the "BuHaLaJi(布唅拉集)", "Charaka", "$Su\acute{s}hruta$". 5) According to the "Charaka", there were 8 branches of ancient medicine in India : treatment of the body(kayacikitsa), special surgery(salakya), removal of alien substances(salyapahartka), treatment of poison or mis-combined medicines(visagaravairodhikaprasamana), the study of ghosts(bhutavidya), pediatrics(kaumarabhrtya), perennial youth and long life(rasayana), and the strengthening of the essence of the body(vajikarana). 6) The '$\bar{A}yurveda$', which originated from ancient experience, was recorded in Sanskrit, which was a theorization of knowledge, and also was written in verses to make memorizing easy, and made medicine the exclusive possession of the Brahmin. The first annotations were 1060 for the "Charaka", 1200 for the "$Su\acute{s}hruta$", 1150 for the "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$", and 1100 for the "$Nid\bar{a}na$". The use of various mineral medicines in the "Charaka" or the use of mercury as internal medicine in the "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$", and the palpation of the pulse for diagnosing in the '$\bar{A}yurveda$' and 'XiZhang(西藏)' medicine are similar to TCM's pulse diagnostics. The coexistence with Arabian 'Unani' medicine, compromise with western medicine and the reactionism trend restored the '$\bar{A}yurveda$' today. 7) The "Charaka" is a book inclined to internal medicine that investigates the origin of human disease which used the dualism of the 'Samkhya', the natural philosophy of the 'Vaisesika' and the logic of the 'Nyaya' in medical theories, and its structure has 16 syllables per line, 2 lines per poem and is recorded in poetry and prose. Also, the "Charaka" can be summarized into the introduction, cause, judgement, body, sensory organs, treatment, pharmaceuticals, and end, and can be seen as a work that strongly reflects the moral code of Brahmin and Aryans. 8) In extracting bloody pus, the "Charaka" introduces a 'sharp tool' bloodletting treatment, while the "$Su\acute{s}hruta$" introduces many surgical methods such as the use of gourd dippers, horns, sucking the blood with leeches. Also the "$Su\acute{s}hruta$" has 19 chapters specializing in ophthalmology, and shows 76 types of eye diseases and their treatments. 9) Since anatomy did not develop in Indian medicine, the inner structure of the human body was not well known. The only exception is 'GuXiangXue(骨相學)' which developed from 'Atharvaveda' times and the "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$". In the "$Ast\bar{a}nga$ Sangraha $samhit\bar{a}$"'s 'ShenTiLun(身體論)' there is a thorough listing of the development of a child from pregnancy to birth. The '$\bar{A}yurveda$' is not just an ancient traditional medical system but is being called alternative medicine in the west because of its ability to supplement western medicine and, as its effects are being proved scientifically it is gaining attention worldwide. We would like to say that what we have researched is just a small fragment and a limited view, and would like to correct and supplement any insufficient parts through more research of new records.

  • PDF

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

  • Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.109-122
    • /
    • 2014
  • People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.