• Title/Summary/Keyword: document classification

Search Result 451, Processing Time 0.029 seconds

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

Discussion about the Self Disposal Guideline of Medical Radioactive Waste (의료용 방사성폐기물 자체처분 가이드라인에 관한 고찰)

  • Lee, Kyung-Jae;Sul, Jin-Hyung;Lee, In-Won;Park, Young-Jae
    • The Korean Journal of Nuclear Medicine Technology
    • /
    • v.21 no.2
    • /
    • pp.13-27
    • /
    • 2017
  • Purpose In the procedure of domestic medical radioactive self-disposal, there are many requests of supplementation and difficulties on the screening process. In this regard, presentation of basic guideline will improve the work processing efficiency of medical institution radioactive waste. From 2015 to 2016, We reviewed and compared a supplementary requests of domestic fifteen medical institution radioactive self-disposal Plan & Procedure manual. In connection with this, we derive the details of the radioactive waste document based on the relative regulation of nuclear safety Act. The representative supplementary requests of Korea Institute of Nuclear Safety are disposal method of non-flammability radioactive waste, storage method of scheduled self-disposal waste, the legitimacy of self-disposal and pre-treatment of self-disposal, reference radioactivity of disused filter and output of storage period, attachment the evidential matter of measurement efficiency when using a gamma counter. Through establishing a medical radioactive waste guideline, we can clearly suggest a classification standard of radioactive nuclide and the type of occurrence. As a result, we can confirm the reduction of examination processing period while preparing a self-disposal document and there is no spending expenses for business agency. Also, the storage efficiency of facility will better and reduce the economic expenses. On the basis of this guideline, we will expect a contribution to the improvement of work efficiency for officials who has a working-level difficulty of radioactive waste self-disposal.

  • PDF

A Korean Community-based Question Answering System Using Multiple Machine Learning Methods (다중 기계학습 방법을 이용한 한국어 커뮤니티 기반 질의-응답 시스템)

  • Kwon, Sunjae;Kim, Juae;Kang, Sangwoo;Seo, Jungyun
    • Journal of KIISE
    • /
    • v.43 no.10
    • /
    • pp.1085-1093
    • /
    • 2016
  • Community-based Question Answering system is a system which provides answers for each question from the documents uploaded on web communities. In order to enhance the capacity of question analysis, former methods have developed specific rules suitable for a target region or have applied machine learning to partial processes. However, these methods incur an excessive cost for expanding fields or lead to cases in which system is overfitted for a specific field. This paper proposes a multiple machine learning method which automates the overall process by adapting appropriate machine learning in each procedure for efficient processing of community-based Question Answering system. This system can be divided into question analysis part and answer selection part. The question analysis part consists of the question focus extractor, which analyzes the focused phrases in questions and uses conditional random fields, and the question type classifier, which classifies topics of questions and uses support vector machine. In the answer selection part, the we trains weights that are used by the similarity estimation models through an artificial neural network. Also these are a number of cases in which the results of morphological analysis are not reliable for the data uploaded on web communities. Therefore, we suggest a method that minimizes the impact of morphological analysis by using character features in the stage of question analysis. The proposed system outperforms the former system by showing a Mean Average Precision criteria of 0.765 and R-Precision criteria of 0.872.

A Study on the Development of a Korean Traditional Food Data Integration System (한국 전통음식 통합검색 시스템 개발에 관한 연구)

  • Shin, Seung-Mee
    • The Korean Journal of Food And Nutrition
    • /
    • v.21 no.4
    • /
    • pp.545-552
    • /
    • 2008
  • This study is attempt to develop for Korean traditional food data integration system with food database. We are collected all kinds of traditional Korean foods, and referred to document and classified according to food types and cooking methods. Also we are classified 6 types of traditional Korean foods as follows: traditional common, royal, local, festival, rites, and Buddhist temple foods, And we integrate all of that databases for using a specialist or not. We researched for Korean traditional food by cooking type and planed organization for the standardized code and construction for database of Korean traditional foods. It was combined all of them, constructed for Korean traditional food data integration system. Korean traditional foods are classified with 10 provinces local foods, 18 festival foods by seasonal divisions reflecting traditional Korean holidays; and 9 classes rites foods. Korean traditional food using a traditional Korean food classification system was investigated a total of 7,289 kinds foods according to food types. those were 2,585 kinds traditional common foods, 142 kinds of royal foods, 2,137 kinds of local foods, 515 kinds of festival foods, 403 kinds of rites foods, and 1,507 kinds of Buddhist temple foods. And Korean traditional foods included 980 kinds of main dishes, 4,456 kinds of side dishes, 873 kinds of tteok lyou, 515 kinds of hangwa lyou and 465 kinds of emchong lyou. It is therefore recommended that knowledge of traditional Korean foods be preserving and develop their excellence and to further studies.

A Study of LOD(Level of Detail) for BIM Model applied the Design Process (설계 프로세스를 반영한 BIM 작성 기준(LOD)에 대한 연구)

  • Cho, Hyun-Jung;Kim, Yeon-Soo;Ma, Young-Kyun
    • Journal of KIBIM
    • /
    • v.3 no.1
    • /
    • pp.1-10
    • /
    • 2013
  • BIM(Building Information Modeling) ordering manuals and guidelines are diffused with the recent BIM activation. However, it is causing drawbacks such as an increase of work at each design stage and a decline of BIM application level that the standard of making up and managing BIM is vague and it includes comprehensive meaning. Therefore, this study aims to secure BIM work standard by establishing BIM making-out standard based on LOD(Level of Detail) classification considering domestic design process. It compared each definition of LOD by analyzing domestic and foreign BIM guideline examples, and figured out insufficiency of existing domestic and foreign design process and BIM guidelines. Moreover, it drew architects' work articles for promoting the progression of the efficient design process, and analyzed BIM requirements on design process, dividing BIM application scale by field. Through this analyzing process, it finally established BIM making-out standard classified by design process. The effects of establishing BIM making-out standard would include improving a division of labor and cooperation environment by creating integrated BIM model on design stages, advancing work efficiency by preventing a repetition and an increase of work, and upgrading project completeness and design quality. Besides, it can secure BIM work standard by clarifying responsibility for working steps. BIM making-out standard established by this study will contribute to developing the future BIM work standard document and BIM guideline as a data base.

An Experimental Study on Automatic Summarization of Multiple News Articles (복수의 신문기사 자동요약에 관한 실험적 연구)

  • Kim, Yong-Kwang;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.1 s.59
    • /
    • pp.83-98
    • /
    • 2006
  • This study proposes a template-based method of automatic summarization of multiple news articles using the semantic categories of sentences. First, the semantic categories for core information to be included in a summary are identified from training set of documents and their summaries. Then, cue words for each slot of the template are selected for later classification of news sentences into relevant slots. When a news article is input, its event/accident category is identified, and key sentences are extracted from the news article and filled in the relevant slots. The template filled with simple sentences rather than original long sentences is used to generate a summary for an event/accident. In the user evaluation of the generated summaries, the results showed the 54.l% recall ratio and the 58.l% precision ratio in essential information extraction and 11.6% redundancy ratio.

A Study on the Construction Method of Collaboration Environment for Web (Web에서의 협력 환경 구축 방안 연구)

  • Lee, Jae-Ho
    • Journal of The Korean Association of Information Education
    • /
    • v.1 no.1
    • /
    • pp.74-81
    • /
    • 1997
  • The World Wide Web (Web) is one of the most popular internet tool on now, In this reason most of common user, they understand the Web is internet and Web content is also important issues on this side. However, commonly Web content created by one of Web content creator and sometime they refer the another document and link. In these kinds of environments cause tile delivery of incorrect information or linking to another Web user. There are lots of way to protect the incorrect information deliveries to Web user and the most famous one is Computer Supported Cooperation Work (CSCW). This supports the multi-user environment on single system environment, but this needs more additional things in the current internet environment Current internet defined as distributed information network not tile traditional client-server environment. Specially, Intranet environments need to support the heterogeneous system environment like several Rinds of database, systems like PC, Mac and UNIX workstation, and etc. In this reason, we need collaboration and this would serve the common user interface to all of Web user. In these paper, we review the current concept of CSCW and grouoware that are major concept of collaboration and definition, classification and problem analysis of the collaboration. Finally, we suggest the construction method of collaboration environment for Web.

  • PDF

Distribution of riparian vegetation in Ian Stream (이안천의 식생분포)

  • Kim, Ho-Joon;Lee, Hye-Keun;Choi, Kwang-Soon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2005.05b
    • /
    • pp.1274-1279
    • /
    • 2005
  • The complex vegetation and plant species distributions within riparian corridors influence plant species diversity patterns at both local and regional scales and further reflect both natural and anthropogenic disturbances. Because of these characteristics, riparian zones are often the ecosystem level component that are most sensitive to changes of the surrounding environment; they provide early indications of environmental change and can be viewed as the important source in the watershed. The objectives of this study were two concepts: first, document the composition and dominance of plant communities of riparian areas in the stream, second, compare species composition and temporal diversity between stations in riparian areas of the Ian Stream. The flora was composed to total 158 kinds of the vascular plants as 49 family, 54 genera, 145 species, 12 varieties, 1 forma When the naturalized plant were applied to the recent classification system 280 kinds, the naturalization rate was $10.8\% higher than that of mean value($10.3\%$) of the Korean mountain district. Furthermore, urbanization index (UI) was $6.1\%$. The dominant vegetation communities were distributed in the habitats of three compartments from upstream to downstream. The vegetations were included Phragmites japonica, Salix gracilistyla, S. hulteni and Robinia pseudo-acacia in the riparian area, and Persicaria sieboldii, Stellaria alsine var. undulata, Draba nemorosa var. hebecarpa, Capsella bursa-pastoris, Lepidium apetalum, Bidens frondosa, Trigonotis peduncularis and Hemistepta lyrata in the sandbank or the riparian area, and Equisetum arvense, Humulus japonicus, Persicaria perfoliata, Trifolium repens, Artemisia princeps var. orientalis, Lactuca indica var. laciniata, Avena fatua, Agropyron yesoense, Oenothera odorata, Viola mandshurica, Rumex crispus in banksides, respectively.

  • PDF

Feature Extraction to Detect Hoax Articles (낚시성 인터넷 신문기사 검출을 위한 특징 추출)

  • Heo, Seong-Wan;Sohn, Kyung-Ah
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1210-1215
    • /
    • 2016
  • Readership of online newspapers has grown with the proliferation of smart devices. However, fierce competition between Internet newspaper companies has resulted in a large increase in the number of hoax articles. Hoax articles are those where the title does not convey the content of the main story, and this gives readers the wrong information about the contents. We note that the hoax articles have certain characteristics, such as unnecessary celebrity quotations, mismatch in the title and content, or incomplete sentences. Based on these, we extract and validate features to identify hoax articles. We build a large-scale training dataset by analyzing text keywords in replies to articles and thus extracted five effective features. We evaluate the performance of the support vector machine classifier on the extracted features, and a 92% accuracy is observed in our validation set. In addition, we also present a selective bigram model to measure the consistency between the title and content, which can be effectively used to analyze short texts in general.

A Study on the Development of Search Algorithm for Identifying the Similar and Redundant Research (유사과제파악을 위한 검색 알고리즘의 개발에 관한 연구)

  • Park, Dong-Jin;Choi, Ki-Seok;Lee, Myung-Sun;Lee, Sang-Tae
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.11
    • /
    • pp.54-62
    • /
    • 2009
  • To avoid the redundant investment on the project selection process, it is necessary to check whether the submitted research topics have been proposed or carried out at other institutions before. This is possible through the search engines adopted by the keyword matching algorithm which is based on boolean techniques in national-sized research results database. Even though the accuracy and speed of information retrieval have been improved, they still have fundamental limits caused by keyword matching. This paper examines implemented TFIDF-based algorithm, and shows an experiment in search engine to retrieve and give the order of priority for similar and redundant documents compared with research proposals, In addition to generic TFIDF algorithm, feature weighting and K-Nearest Neighbors classification methods are implemented in this algorithm. The documents are extracted from NDSL(National Digital Science Library) web directory service to test the algorithm.