• 제목/요약/키워드: Similar Documents

검색결과 283건 처리시간 0.025초

LDA, Top2Vec, BERTopic 모형의 토픽모델링 비교 연구 - 국외 문헌정보학 분야를 중심으로 - (A Comparative Study on Topic Modeling of LDA, Top2Vec, and BERTopic Models Using LIS Journals in WoS)

  • 이용구;김선욱
    • 한국문헌정보학회지
    • /
    • 제58권1호
    • /
    • pp.5-30
    • /
    • 2024
  • 이 연구는 토픽모델링 모형인 LDA, Top2Vec, BERTopic을 대상으로 실험데이터에서 토픽을 추출하고, 그 결과를 비교 분석함으로써 각각의 모형 간의 특성과 차이를 파악하는데 목적이 있다. 실험데이터는 Web of Science(WoS)에 등재된 문헌정보학 분야 학술지 85종에 게재된 논문 55,442편을 대상으로 하였다. 실험 과정으로 우선 각 모형의 파라미터를 기본값 그대로 이용하여 1차 토픽모델링 결과를 얻었고, 최적의 토픽 수를 설정하여 각 모형의 2차 토픽모델링 결과를 얻었으며, 이들을 각 모형과 단계별로 비교분석하였다. 1차 토픽모델링 단계에서는 LDA, Top2Vec, BERTopic 모형이 각각 100개, 350개, 550개의 토픽을 생성하여 세 모형은 각각 매우 다른 크기의 토픽 개수를 가져왔으며, LDA 모형에 비해 Top2Vec이나 BERTopic 모형이 토픽을 3배, 5배 더 세분화하였다. 또한 세 모형은 토픽 당 문서 수의 평균이나 표준편차에서도 많은 차이가 났다. 구체적으로 LDA 모형은 비교적 적은 수의 토픽에 많은 문서를 부여하는 반면, BERTopic 모형은 반대의 경향을 보였다. 25개의 토픽 수를 생성하는 2차 토픽모델링 단계에서는 다른 모형에 비해 Top2Vec 모형이 평균적으로 토픽 당 많은 문서를 부여하고 토픽간에 고르게 문서를 할당하여 상대적으로 편차가 작았다. 또한 모형간의 유사 토픽의 생성여부를 비교하면, LDA와 Top2Vec 모형이 전체 25개 중에 18개(72%)의 공통된 토픽을 생성하여 BERTopic 모형에 비해 두 모형이 더 유사한 결과를 보였다. 향후 토픽모델링 결과에서 각 토픽과 부여된 문서들이 주제적으로 올바르게 형성되었는지에 대한 전문가의 평가를 통해 보다 완전한 분석이 필요하다.

보증신용장규칙(保證信用狀規則)의 특성(特性)에 관한 연구(硏究) - 신용장통일규칙(信用狀統一規則)과의 비교(比較)를 중심(中心)으로 - (A Study On Characteristics of the International Standby Practices - Focused on the comparison with UCP 500 -)

  • 이충열
    • 무역상무연구
    • /
    • 제14권
    • /
    • pp.257-287
    • /
    • 2000
  • Many problems and complaints have been caused by applying the UCP to the standby credit. To solve the problem, International Standby Practices were established. ISP and UCP are similar in that both of them generally regulate the transaction of credit. However, when the ISP is compared with the UCP, the following features are found : 1. In the UCP, when Force Majeure such as acts of God or strikes cause temporary work stoppage, the expiration date cannot be extended. In the ISP, the expiration date can be extended to 30 days afte the place for presentation re-opens for business in the same situation. 2. The UCP does not specify who the issuer of a document must be because there can be many issuers of documents. In the ISP, it is specified that all required documents are to be issued by the beneficiary. 3. In the UCP, compliance between presented documents is required. In the ISP, a discrepancy between presented documents is allowed. 4. In the UCP, if drawings and/or shipments are required by a credit to be made in instalments, and a required drawing/instalment is not made, the credit ceases to be available for any subsequent instalment. In ISP, there is no loss of effect and no influence on the right of beneficiaries, even in the same situation. 5. In the UCP, multiple transfers are not permitted, but partial transfers are. ISP states just the opposite. Multiple transfers are permitted, but partial transfers are not. 6 The UCP obligate each bank (issuer, confirming and nominated bank) to complete their review within a 'reasonable time' but not more than seven banking days. In the ISP, less than three business days is deemed to be not unreasonable and more than seven days is deemed to be unreasonable. 7. ISP, unlike UCP, recognizes that issuers and confirmers may spread their risk through syndication and participation of standby credits. However, the thing to remember is that the ISP should be reviewed carefully before application. If necessary, a partial addition or modifications can be made. Usually, the best advantage of the ISP is given to the issuers. A positive use of the ISP can be made by issuers but, applicants should consider using the UCP to the their rights and duties.

  • PDF

중국문헌을 통해본 중세 동남아의 불교문화(I): 법현(法顯)과 의정(義淨)의 저술을 중심으로 (Some Views for the Buddhist Culture of Southeast Asia at Middle Ages through the Chinese Description (I): Focused on the documents of Faxian and Ichong)

  • 주수완
    • 수완나부미
    • /
    • 제2권1호
    • /
    • pp.55-94
    • /
    • 2010
  • Even Faxian(法顯)'s Gaosengfaxianchuan (『高僧法顯傳』) and Iching(義淨)'s Nanhaijiguineifachuan (『南海寄歸內法傳』) are regarded as very important and useful documents to study the southeast asian buddhist culture, it is very difficult to grasp the contemporary state of those area because their descriptions are very brief and implicit. Therefore this essay aimed an in-depth reading their documents as original texts of modern understanding of those area, and tried to make a new views to approach the southeast asian buddhist culture by some more historically and concretely. At the early 5th century when Faxian(法顯) arrived, Buddhism was flourished in Sri Lanka. Because already a long time passed since the Saṇgha was schismatized into conservative and progressive at around the dominical year, he mentioned nothing about the conflict or disharmony of two orders. And the faith of Buddha tooth relic, which had been uprisen at 50 years ago from Faxian's visiting, was concretely established as a representative religion of Sri Lanka. According to his record, the carrying ritual of this Buddha tooth was performed very magnificently as similar with recent Korean Youngsan ceremony(靈山齋). In the mean time, it looks there were many sculptures of Buddha image made of precious stone of special product from Sri Lanka. The faith of Buddha-pāda(the Buddha's foot-prints) was also generalized at that time. The most famous monk of his contemporary Sir Lanka was Buddhaghosa, the author of Visuddhi-magga, but it is not sure that Faxian had met him. It can be suspected that the funeral in which Faxian participated could be belonged to him, or the Visuddhi-magga was writing at the peak during Faxian's staying. On the way to return to China, Faxian embarked an indigenous ship around Indonesia. It means there were no chinese trade ship which he can use. So the trade between china and southeast asia was advanced by south asian ships, and the chinese ships were not yet joined at that time so activity. And at least until that time, it looks there were no any remarkable buddhist movement in the southeast asian countries by where he stopped. In contrast, the southeast asian world which be seen by Iching had already experienced a lot of changes. He was impressed by the high quality buddhist culture of those area, and insisted to accept it to china. Further, he analyzed the sects of buddhism which were prevalent around the southeast asia in his contemporary time, and tried to make a good relationship with each native monks for learning from them. It looks the center of those exchanges may be Śrīvijaya of Indonesia. He also mentioned the situation of the late 7th century's Funan(扶南) in Cambodia. At that time, the buddhist Saṇgha was oppressed by newly rising Khmer(眞臘). On the other hand, he described the points of sameness and difference in detail between Indian and southeast asian buddhist culture in the field of ritual as like the practical use of garments, buddha images, and daily recited scriptures. There must be a lot of another aspects which this essay couldn't gather up or catch from these documents. Nevertheless, I hope this essay can help the researchers of this field and will wait for any advices and comments from them.

  • PDF

텍스트 분석을 활용한 정보의 수요 공급 기반 뉴스 가치 평가 방안 (A Method for Evaluating News Value based on Supply and Demand of Information Using Text Analysis)

  • 이동훈;최호창;김남규
    • 지능정보연구
    • /
    • 제22권4호
    • /
    • pp.45-67
    • /
    • 2016
  • 최근 정보 유통의 주요 매체인 인터넷 뉴스와 SNS의 매체 간 특성 차이를 주목한 많은 연구가 있었음에도 불구하고, 양 매체의 차이를 정보의 수요 및 공급 관점에서 파악한 연구는 상대적으로 매우 부족하다. 일반적으로 새로운 정보는 언론사의 뉴스 기사를 통해 대중에게 노출되고, 대중은 이러한 기사에 대한 의견 또는 추가정보를 SNS를 통해 공유함으로써 해당 정보를 수용함과 동시에 확산시킨다. 이러한 측면에서 언론사가 뉴스를 제공하는 행위를 정보의 공급으로 파악할 수 있으며, 대중은 SNS를 통해 이에 대한 관심을 능동적으로 나타냄으로써 해당 정보에 대한 소비 수요를 표출하는 것으로 이해할 수 있다. 이는 상품 및 서비스의 가격이 수요와 공급의 관계에 의해 결정되는 것과 유사한 원리로, 정보의 가치를 정보 수요와 정보 공급의 관계에 기반을 두어 측정할 수 있음을 시사한다. 본 연구에서는 정보 공급의 대표 매체로 인터넷 뉴스 기사를, 정보 수요를 나타내는 대표 매체로 트위터를 선정하고, 특정 이슈에 대한 뉴스의 정보로서의 가치를 이와 관련된 트위터의 양으로 평가하는 뉴스가치지수(NVI, News Value Index)를 고안하여 제시한다. 구체적으로 제안 방법론은 각 이슈별로 NVI를 도출하고 이를 통해 시간의 흐름에 따른 정보 가치의 변화를 시각화하여 나타낸다. 또한 본 연구에서는 제안 방법론의 실무 적용 가능성을 평가하기 위해 인터넷 뉴스 387,018건과 트윗 31,674,795건에 대한 실험을 수행하였다. 그 결과 대부분의 이슈가 전체 정보 시장의 평균 가치에 수렴하는 형태로 변화함을 알 수 있었으며, 꾸준히 평균 이상의 가치를 가지며 정보 시장을 장악하는 등 특이한 양상을 보이는 흥미로운 이슈도 존재함을 파악할 수 있었다.

온톨로지 학습에 의한 유사 웹 서비스 오퍼레이션 발견 방법 (Discovery Methods of Similar Web Service Operations by Learning Ontologies)

  • 이용주
    • 정보처리학회논문지D
    • /
    • 제18D권2호
    • /
    • pp.133-142
    • /
    • 2011
  • 시맨틱 웹 서비스 기술의 성공을 보장하기 위해서는 품질 좋은 온톨로지의 사용이 필수적이다. 하지만 온톨로지 사용의 중요성에도 불구하고 현재 웹 서비스를 위한 온톨로지는 거의 존재하지 않으며 이들의 구축도 쉬운 일이 아니다. 이러한 문제는 오늘날 웹 서비스의 확산과 발전을 가로막는 큰 저해 요인이 되고 있다. 본 논문에서는 웹 서비스를 개발할 때 자동으로 생성되는 WSDL 문서만 가지고 항목 간 숨어있는 시맨틱 정보를 찾아내어 온톨로지를 자동 구축하고, 이를 이용한 유사 웹 서비스 오퍼레이션 발견 방법을 제안한다. 핵심 내용은 WSDL 입출력 항목들로부터 의미적으로 같은 개념들을 묶고, 각 항목들 간의 계층관계를 형성하여 자동적으로 시맨틱 온톨로지를 구축한다. 그리고 새로운 유사도 측정 방법을 통해 우선순위별 유사 오퍼레이션을 발견하며, 발견된 오퍼레이션들 중 가장 적합한 오퍼레이션을 선택하여 웹 서비스 조합에 직접 활용할 수 있는 웹 서비스 오퍼레이션 검색 시스템을 구현한다.

교통정보 추론을 위한 비정형데이터 분석과 다중패턴저장 기법 (Unstructured Data Analysis and Multi-pattern Storage Technique for Traffic Information Inference)

  • 김용훈;김부일;정목동
    • 한국멀티미디어학회논문지
    • /
    • 제21권2호
    • /
    • pp.211-223
    • /
    • 2018
  • To understand the meaning of data is a common goal of research on unstructured data. Among these unstructured data, there are difficulties in analyzing the meaning of unstructured data related to corpus and sentences. In the existing researches, the researchers used LSA to select sentences with the most similar meaning to specific words of the sentences. However, it is problematic to examine many sentences continuously. In order to solve unstructured data classification problem, several search sites are available to classify the frequency of words and to serve to users. In this paper, we propose a method of classifying documents by using the frequency of similar words, and the frequency of non-relevant words to be applied as weights, and storing them in terms of a multi-pattern storage. We use Tensorflow's Softmax to the nearby sentences for machine learning, and utilize it for unstructured data analysis and the inference of traffic information.

농촌지도 이념으로서의 평생교육론 고찰 (Towards an Ideology of Agricultural Extension as a Philosophy of Lifelong Education)

  • 이종만
    • 농촌지도와개발
    • /
    • 제11권1호
    • /
    • pp.1-19
    • /
    • 2004
  • The objective of this study was to find a linkage of ideological background between agricultural extension education and lifelong education. This study was conducted by analyzing the studies related to agricultural extension and lifelong education. Review of literature and documents was main methods of this study. The study reviewed and analyzed the concepts, characteristics and ideology of lifelong education, and presented some general characteristics of lifelong education in the context of educational ideology. As a result of the study, the following five characteristics of lifelong education in the context of educational ideology were presented; 1) lifelong education is the supreme concept of education and includes all kinds of education, 2) lifelong education is the future direction of educational ideology and philosophy rather than a kind of educational practice, 3) lifelong education means the security for a right of learning through the entire life-span of an individual, 4) lifelong education has the innovative function of the existing situation of education; viewpoint, contents, and methodology of learning, 5) Lifelong education runs ultimately towards a 'learning society'. Agricultural extension and lifelong education shared the similar ideological background in general, and have the similar basic philosophy. The ideology and philosophy of lifelong education should be reflected into the ideology of agricultural extension to broaden the perspectives of agricultural extension in the future.

  • PDF

남성 패션에 나타난 청색의 배색 특성 (The Characteristics of Blue Color Combination Shown in Men's Fashion)

  • 장정임;조주연;이연희
    • 복식문화연구
    • /
    • 제17권2호
    • /
    • pp.309-319
    • /
    • 2009
  • This study's goal is to analyze the color characteristics of Blue used in men's fashion for design developing process. First, we researched the previous studies and examined documents about color characteristics of Blue in general as well as coloration in fashion design and men's fashion. We composed color samples by collecting two-color coloration used in men's fashion collection for 5 years from 2004 S/S to 2008 F/W through a specialized fashion information web-sites. We limited the colors from Blue Green(BG) to Purple Blue(PB). Second, we analyzed the characteristics of hue combination and tone combination. A total of 351 pictures were collected and RGB and HV/C value were converted with Munsell Conversion program(ver.8.0.1). Color data has been sorted to 10 hues and 12 PCCS tones. From this, we were able to figure out that similar/same hue coloration was used more than contrary hue coloration and similar/same tone coloration was used more than contrary tone coloration for Blue. We've limited Blue coloration characteristics of men's fashion to two-color coloration for an analysis; the succeeding study will need to examine on the characteristics of multi-coloration and detailed Blue coloration image by various garments.

  • PDF

한의학 고문헌 텍스트에서의 인용문 추정과 탐색 (Detecting Local Text Reuse in the Texts of East Asian Traditional Medicine)

  • 오준호
    • 대한한의학원전학회지
    • /
    • 제34권1호
    • /
    • pp.37-45
    • /
    • 2021
  • Objectives : The purpose of this paper was to examine quantitative methods for estimating and detecting local text reuse in the texts of East Asian Traditional Medicine. Methods : We introduce techniques that estimate the volume of local text reuse with n-gram and those that directly detect the reuse with the Smith-Waterman algorithm (SW algorithm). Based on this, the estimation and detection of local text reuse were carried out for 『Donguibogam』 and 『Huangdineijing·Suwen』. Results : Estimates with n-gram had more errors than methods with SW algorithms. SW algorithms detected suspected strings directly with local text reuse, resulting in more accurate results. Conclusions : Although n-gram does not accurately find local text reuse, its high speed makes it a preferable method for certain purposes, such as screening similar documents. On the other hand, SW algorithms have the advantage of being relatively good at finding similar phrases suspected as local text reuse even if the strings do not completely match. However, due to its excessive consumption of time and computing resources, its benefits are limited to cases where precise results are required.

History Document Image Background Noise and Removal Methods

  • Ganchimeg, Ganbold
    • International Journal of Knowledge Content Development & Technology
    • /
    • 제5권2호
    • /
    • pp.11-24
    • /
    • 2015
  • It is common for archive libraries to provide public access to historical and ancient document image collections. It is common for such document images to require specialized processing in order to remove background noise and become more legible. Document images may be contaminated with noise during transmission, scanning or conversion to digital form. We can categorize noises by identifying their features and can search for similar patterns in a document image to choose appropriate methods for their removal. In this paper, we propose a hybrid binarization approach for improving the quality of old documents using a combination of global and local thresholding. This article also reviews noises that might appear in scanned document images and discusses some noise removal methods.