• Title/Summary/Keyword: Web-Crawling

Search Result 175, Processing Time 0.026 seconds

Method of Related Document Recommendation with Similarity and Weight of Keyword (키워드의 유사도와 가중치를 적용한 연관 문서 추천 방법)

  • Lim, Myung Jin;Kim, Jae Hyun;Shin, Ju Hyun
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.11
    • /
    • pp.1313-1323
    • /
    • 2019
  • With the development of the Internet and the increase of smart phones, various services considering user convenience are increasing, so that users can check news in real time anytime and anywhere. However, online news is categorized by media and category, and it provides only a few related search terms, making it difficult to find related news related to keywords. In order to solve this problem, we propose a method to recommend related documents more accurately by applying Doc2Vec similarity to the specific keywords of news articles and weighting the title and contents of news articles. We collect news articles from Naver politics category by web crawling in Java environment, preprocess them, extract topics using LDA modeling, and find similarities using Doc2Vec. To supplement Doc2Vec, we apply TF-IDF to obtain TC(Title Contents) weights for the title and contents of news articles. Then we combine Doc2Vec similarity and TC weight to generate TC weight-similarity and evaluate the similarity between words using PMI technique to confirm the keyword association.

A Study of Realtime Malware URL Detection & Prevention in Mobile Environment (모바일 환경에서 실시간 악성코드 URL 탐지 및 차단 연구)

  • Park, Jae-Kyung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.6
    • /
    • pp.37-42
    • /
    • 2015
  • In this paper, we propose malware database in mobile memory for realtime malware URL detection and we support realtime malware URL detection engine, that is control the web service for more secure mobile service. Recently, mobile malware is on the rise and to be new threat on mobile environment. In particular the mobile characteristics, the damage of malware is more important, because it leads to monetary damages for the user. There are many researches in cybercriminals prevention and malware detection, but it is still insufficient. Additionally we propose the method for prevention Smishing within SMS, MMS. In the near future, mobile venders must build the secure mobile environment with fundamental measures based on our research.

A Study on Political Attitude Estimation of Korean OSN Users (온라인 소셜네트워크를 통한 한국인의 정치성향 예측 기법의 연구)

  • Wijaya, Muhammad Eka;Ahn, Heejune
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.21 no.4
    • /
    • pp.1-11
    • /
    • 2016
  • Recently numerous studies are conducted to estimate the human personality from the online social activities. This paper develops a comprehensive model for political attitude estimation leveraging the Facebook Like information of the users. We designed a Facebook Crawler that efficiently collects data overcoming the difficulties in crawling Ajax enabled Facebook pages. We show that the category level selection can reduce the data analysis complexity utilizing the sparsity of the huge like-attitude matrix. In the Korean Facebook users' context, only 28 criteria (3% of the total) can estimate the political polarity of the user with high accuracy (AUC of 0.82).

Factors affecting the number of citations in papers published in the Journal of Korean Society of Dental Hygiene (한국치위생학회지 게재논문의 피인용수에 영향을 미친 요인)

  • Jeon, Se-Jeong
    • Journal of Korean society of Dental Hygiene
    • /
    • v.21 no.5
    • /
    • pp.639-644
    • /
    • 2021
  • Objectives: The purpose of this study was to analyze the factors that affected the number of citations for articles published in the Journal of Korean Society of Dental Hygiene based on previous studies. Methods: Information on papers including the number of citations was collected using a web crawling technique. The effect of the number of author keywords, the number of Medical Subject Headings (MeSH) keywords, MeSH match rate, abstract word count and keyword-abstract ratio on the number of citations was analyzed by multiple regression analysis. Results: The use of the MeSH keyword did not have a significant effect on the number of citations. Among the other factors, only the keyword-abstract ratio was statistically significant. Conclusions: Select a topic of constant interest in the field, write the title in detail using colons or asterisks if necessary, and do not repeat the words used in the title in keywords. Select specific keywords deeply related to the topic. In particular, choice words or phrases that are frequently used in the abstract. If the MeSH keyword selection contradicts the previous strategies, boldly give up the MeSH keyword.

Deep Unsupervised Learning for Rain Streak Removal using Time-varying Rain Streak Scene (시간에 따라 변화하는 빗줄기 장면을 이용한 딥러닝 기반 비지도 학습 빗줄기 제거 기법)

  • Cho, Jaehoon;Jang, Hyunsung;Ha, Namkoo;Lee, Seungha;Park, Sungsoon;Sohn, Kwanghoon
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.1
    • /
    • pp.1-9
    • /
    • 2019
  • Single image rain removal is a typical inverse problem which decomposes the image into a background scene and a rain streak. Recent works have witnessed a substantial progress on the task due to the development of convolutional neural network (CNN). However, existing CNN-based approaches train the network with synthetically generated training examples. These data tend to make the network bias to the synthetic scenes. In this paper, we present an unsupervised framework for removing rain streaks from real-world rainy images. We focus on the natural phenomena that static rainy scenes capture a common background but different rain streak. From this observation, we train siamese network with the real rain image pairs, which outputs identical backgrounds from the pairs. To train our network, a real rainy dataset is constructed via web-crawling. We show that our unsupervised framework outperforms the recent CNN-based approaches, which are trained by supervised manner. Experimental results demonstrate that the effectiveness of our framework on both synthetic and real-world datasets, showing improved performance over previous approaches.

A Study on Fruits Characteristics of the Chosen Dynasty through the Analysis of Chosenwangjoeshirok Big Data (빅데이터 분석을 통한 조선시대 과실류 특성 연구)

  • Kim, Mi-Hye
    • Journal of the Korean Society of Food Culture
    • /
    • v.36 no.2
    • /
    • pp.168-183
    • /
    • 2021
  • Using the big data analysis of the Choseonwangjosilrok, this research aimed to figure out the fruits' types, prevalence, seasonal appearances as well as the royalty's perspective on fruits during Choseon period. Choseonwangjosilrok included nineteen kinds of fruits and five kinds of nuts, totaling 1,601 cases at 72.8% and 533 cases at 24.2% respectively. The text recorded fruits being used as: tributes for kings, gifts from kings to palace officials, tomb offerings, county specialties, trade goods or gifts to the foreign ambassadors, and medicine ingredients in oriental pharmacy. Seasonally the fruits appeared demonstrating an even distribution. Periodic characteristics were observed in decreasing quantity chronologically. From fifteenth century to nineteenth century, the fruits with timely features were seen: 804 times at 36.6%, 578 times at 26.3%, 490 times at 22.3%, 248 times at 11.3%, and 78 times at 3.5% respectively. In fifteenth century: citrons, quinces, pomegranates, cherries, permissions, watermelons, Korean melons, omija, walnuts, chestnuts, and pine nuts appeared most frequently. In sixteenth century: pears, grapes, apricots, peaches, and hazelnuts appeared most frequently. In seventeenth century: tangerines and dates appeared most frequently. In eighteenth century, trifoliate orange was the most frequently mentioned fruit.

A Study on Key Factors Influencing Customers' Ratings of Restaurants by Using Data Mining Method (데이터 마이닝을 활용한 외식업체의 평점에 영향을 미치는 선행 요인)

  • Kim, Seon Ju;Kim, Byoung Soo
    • The Journal of Information Systems
    • /
    • v.31 no.2
    • /
    • pp.1-18
    • /
    • 2022
  • Purpose Customer review is a major factor in choosing certain restaurants. This study investigates the key factors affecting customer's evaluation about restaurants. With the recent intensification of competition among restaurants in the service industry, the analysis results are expected to provide in-depth insights for enhancing customer experiences. Design/methodology/approach We collected information and reviews provided at the restaurants in the Kakao Map platform. The information collected is based on the information of 3,785 restaurants in Daegu registered on Kakao Map. Based on the information collected, seven independent variables, including number of rating registered, number of reviews, presence or absence of safe restaurants, presence or absence of a posting about holding facilities, presence or absence of a posting about business hours, presence or absence of a posting about hashtags, and presence or absence of break times, were used. Dependent variable is restaurant rating. Multiple regression between independent variables and restaurant rating was carried out. Findings The results of the study confirmed that number of rating registered, presence or absence of a posting about business hours, and presence or absence of a posting about hash tags have an positive effects on the restaurant rating. The number of reviews had a negative effect on the restaurant rating. In addition, in order to confirm the role of customer's reviews, we carried out LDA topic modeling. We divided the topics into the positive review and the negative reviews.

Text-Mining Analyses of News Articles on Schizophrenia (조현병 관련 주요 일간지 기사에 대한 텍스트 마이닝 분석)

  • Nam, Hee Jung;Ryu, Seunghyong
    • Korean Journal of Schizophrenia Research
    • /
    • v.23 no.2
    • /
    • pp.58-64
    • /
    • 2020
  • Objectives: In this study, we conducted an exploratory analysis of the current media trends on schizophrenia using text-mining methods. Methods: First, web-crawling techniques extracted text data from 575 news articles in 10 major newspapers between 2018 and 2019, which were selected by searching "schizophrenia" in the Naver News. We had developed document-term matrix (DTM) and/or term-document matrix (TDM) through pre-processing techniques. Through the use of DTM and TDM, frequency analysis, co-occurrence network analysis, and topic model analysis were conducted. Results: Frequency analysis showed that keywords such as "police," "mental illness," "admission," "patient," "crime," "apartment," "lethal weapon," "treatment," "Jinju," and "residents" were frequently mentioned in news articles on schizophrenia. Within the article text, many of these keywords were highly correlated with the term "schizophrenia" and were also interconnected with each other in the co-occurrence network. The latent Dirichlet allocation model presented 10 topics comprising a combination of keywords: "police-Jinju," "hospital-admission," "research-finding," "care-center," "schizophrenia-symptom," "society-issue," "family-mind," "woman-school," and "disabled-facilities." Conclusion: The results of the present study highlight that in recent years, the media has been reporting violence in patients with schizophrenia, thereby raising an important issue of hospitalization and community management of patients with schizophrenia.

Research on the Drinking Culture of the Choseon dynasty's Ruling Class using Semantic Network Analysis

  • Mi-Hye, Kim;Yeon-Hee, Kim
    • CELLMED
    • /
    • v.13 no.2
    • /
    • pp.3.1-3.21
    • /
    • 2023
  • In this study, the drinking culture of the Choseon dynasty is examined with the text frequency analysis technique on the entire 『Choseonwangjosilok (朝鮮王朝實錄)』. This study examined a total of 1,968 volumes and 948 books about 27 kings of Choseon , which spans a total of 518 years, through web crawling on the National Institute of Korean History website. Python 3.8 was used to extract sentences related to alcohol, Rhino 1.4.5 was used for morphological analysis to extract nouns, and Gephi 0.9.2 was used for semantic network analysis. According to 『Choseonwangjosilok (朝鮮王朝實錄)』 about alcohol culture, the results of the analysis are as follow: Alcoholic beverages were more often used in court or in ritual ceremonies rather than those based on specific ingredients or manufacturing methods commonly used by the general public. regarding the ruling class through semantic network analysis l in the 『Choseonwangjosilok (朝鮮王朝實錄)』, the Choseon dynasty was found to be highly associated with political issues related to maintaining the power relations within the Korean royal court system. At times, alcohol was used to maintain personal relationships, while at other times it was seen as an essential item in state ceremonies. It was also used as a highly political means to maintain and strengthen national power.

Development of Dataset Items for Commercial Space Design Applying AI

  • Jung Hwa SEO;Segeun CHUN;Ki-Pyeong, KIM
    • Korean Journal of Artificial Intelligence
    • /
    • v.11 no.1
    • /
    • pp.25-29
    • /
    • 2023
  • In this paper, the purpose is to create a standard of AI training dataset type for commercial space design. As the market size of the field of space design continues to increase and the time spent increases indoors after COVID-19, interest in space is expanding throughout society. In addition, more and more consumers are getting used to the digital environment. Therefore, If you identify trends and preemptively propose the atmosphere and specifications that customers require quickly and easily, you can increase customer trust and conduct effective sales. As for the data set type, commercial districts were divided into a total of 8 categories, and images that could be processed were derived by refining 4,009,30MB JPG format images collected through web crawling. Then, by performing bounding and labeling operations, we developed a 'Dataset for AI Training' of 3,356 commercial space image data in CSV format with a size of 2.08MB. Through this study, elements of spatial images such as place type, space classification, and furniture can be extracted and used when developing AI algorithms, and it is expected that images requested by clients can be easily and quickly collected through spatial image input information.