• Title/Summary/Keyword: We Crawler

Search Result 82, Processing Time 0.025 seconds

A Study on the Hyperlink Network Analysis of Library Web Sites (도서관 웹사이트의 하이퍼링크 네트워크 분석)

  • Roh, Yoon-Ju;Kim, Seong-Hee
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.28 no.2
    • /
    • pp.99-117
    • /
    • 2017
  • The present study positively analyzed the hyperlinks of 32 web sites with the purpose of analyzing the hyperlink network structure of web sites for each domestic library type. After collecting the hyperlink data using the crawler, we analyzed the overall characteristics of the websites in the network based on the characteristics of the library. The results are as follows. 1) Among all analyzed libraries, Yonsei scored the highest in degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. 2) By library type, Sejong for national library, Seoul for public library, and Yonsei for college library appeared an influential a relatively. Based on these analysis results, the present study will be utilized as basic data for establishing an operation strategy that improves the efficiency and effectiveness of library web sites in the future.

Mask Wearing Detection System using Deep Learning (딥러닝을 이용한 마스크 착용 여부 검사 시스템)

  • Nam, Chung-hyeon;Nam, Eun-jeong;Jang, Kyung-Sik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.1
    • /
    • pp.44-49
    • /
    • 2021
  • Recently, due to COVID-19, studies have been popularly worked to apply neural network to mask wearing automatic detection system. For applying neural networks, the 1-stage detection or 2-stage detection methods are used, and if data are not sufficiently collected, the pretrained neural network models are studied by applying fine-tuning techniques. In this paper, the system is consisted of 2-stage detection method that contain MTCNN model for face recognition and ResNet model for mask detection. The mask detector was experimented by applying five ResNet models to improve accuracy and fps in various environments. Training data used 17,217 images that collected using web crawler, and for inference, we used 1,913 images and two one-minute videos respectively. The experiment showed a high accuracy of 96.39% for images and 92.98% for video, and the speed of inference for video was 10.78fps.

Analysis of Text Mining of Consumer's Personality Implication Words in Review of Used Transaction Application (중고거래 어플리케이션 <당근마켓> 리뷰텍스트에 나타난 소비자의 인성 함축단어 텍스트마이닝 분석)

  • Jung, Yea-Rin;Ju, Young-Ae
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.1-10
    • /
    • 2021
  • This study analyzes the use and meaning of consumer personality implication words in the review text of the Used Transaction Application . From of May 2021, the data were collected for the past six months by our Web crawler in Seoul and Gyeonggi Province, and a total of 1368 cases were collected first by random sampling, and finally 570 cases were preprocessed. The results are as follows. First, 48.2% of review texts were related to the personality of consumers even though it was a commercial platform of products. Second, the review text is mainly positive, which formed a text network structure based on the keyword 'gratitude'. Third, the review text, which implies consumer character, was divided into two groups: 'extrovert personality' and 'introvert personality' of consumers. And the individuality of the two groups worked together on the platform. In conclusion, we would like to suggest that consumer personality plays an important role in the platform transaction process, that consumer personality will play a role in the services of the platform in the future, and that consumer personality should be studied from various perspectives.

Reviews Analysis of Korean Clinics Using LDA Topic Modeling (토픽 모델링을 활용한 한의원 리뷰 분석과 마케팅 제언)

  • Kim, Cho-Myong;Jo, A-Ram;Kim, Yang-Kyun
    • The Journal of Korean Medicine
    • /
    • v.43 no.1
    • /
    • pp.73-86
    • /
    • 2022
  • Objectives: In the health care industry, the influence of online reviews is growing. As medical services are provided mainly by providers, those services have been managed by hospitals and clinics. However, direct promotions of medical services by providers are legally forbidden. Due to this reason, consumers, like patients and clients, search a lot of reviews on the Internet to get any information about hospitals, treatments, prices, etc. It can be determined that online reviews indicate the quality of hospitals, and that analysis should be done for sustainable hospital marketing. Method: Using a Python-based crawler, we collected reviews, written by real patients, who had experienced Korean medicine, about more than 14,000 reviews. To extract the most representative words, reviews were divided by positive and negative; after that reviews were pre-processed to get only nouns and adjectives to get TF(Term Frequency), DF(Document Frequency), and TF-IDF(Term Frequency - Inverse Document Frequency). Finally, to get some topics about reviews, aggregations of extracted words were analyzed by using LDA(Latent Dirichlet Allocation) methods. To avoid overlap, the number of topics is set by Davis visualization. Results and Conclusions: 6 and 3 topics extracted in each positive/negative review, analyzed by LDA Topic Model. The main factors, consisting of topics were 1) Response to patients and customers. 2) Customized treatment (consultation) and management. 3) Hospital/Clinic's environments.

HTML Text Extraction Using Tag Path and Text Appearance Frequency (태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1709-1715
    • /
    • 2021
  • In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages.

Building an SNS Crawling System Using Python (Python을 이용한 SNS 크롤링 시스템 구축)

  • Lee, Jong-Hwa
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.23 no.5
    • /
    • pp.61-76
    • /
    • 2018
  • Everything is coming into the world of network where modern people are living. The Internet of Things that attach sensors to objects allows real-time data transfer to and from the network. Mobile devices, essential for modern humans, play an important role in keeping all traces of everyday life in real time. Through the social network services, information acquisition activities and communication activities are left in a huge network in real time. From the business point of view, customer needs analysis begins with SNS data. In this research, we want to build an automatic collection system of SNS contents of web environment in real time using Python. We want to help customers' needs analysis through the typical data collection system of Instagram, Twitter, and YouTube, which has a large number of users worldwide. It is stored in database through the exploitation process and NLP process by using the virtual web browser in the Python web server environment. According to the results of this study, we want to conduct service through the site, the desired data is automatically collected by the search function and the netizen's response can be confirmed in real time. Through time series data analysis. Also, since the search was performed within 5 seconds of the execution result, the advantage of the proposed algorithm is confirmed.

A study on Digital Agriculture Data Curation Service Plan for Digital Agriculture

  • Lee, Hyunjo;Cho, Han-Jin;Chae, Cheol-Joo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.2
    • /
    • pp.171-177
    • /
    • 2022
  • In this paper, we propose a service method that can provide insight into multi-source agricultural data, way to cluster environmental factor which supports data analysis according to time flow, and curate crop environmental factors. The proposed curation service consists of four steps: collection, preprocessing, storage, and analysis. First, in the collection step, the service system collects and organizes multi-source agricultural data by using an OpenAPI-based web crawler. Second, in the preprocessing step, the system performs data smoothing to reduce the data measurement errors. Here, we adopt the smoothing method for each type of facility in consideration of the error rate according to facility characteristics such as greenhouses and open fields. Third, in the storage step, an agricultural data integration schema and Hadoop HDFS-based storage structure are proposed for large-scale agricultural data. Finally, in the analysis step, the service system performs DTW-based time series classification in consideration of the characteristics of agricultural digital data. Through the DTW-based classification, the accuracy of prediction results is improved by reflecting the characteristics of time series data without any loss. As a future work, we plan to implement the proposed service method and apply it to the smart farm greenhouse for testing and verification.

A Study on the necessity of Open Source Software Intermediaries in the Software Distribution Channel (소프트웨어 유통에 있어 공개소프트웨어 중개자의필요성에 대한 연구)

  • Lee, Seung-Chang;Suh, Eung-Kyo;Ahn, Sung-Hyuck;Park, Hoon-Sung
    • Journal of Distribution Science
    • /
    • v.11 no.2
    • /
    • pp.45-55
    • /
    • 2013
  • Purpose - The development and implementation of OSS (Open Source Software) led to a dramatic change in corporate IT infrastructure, from system server to smart phone, because the performance, reliability, and security functions of OSS are comparable to those of commercial software. Today, OSS has become an indispensable tool to cope with the competitive business environment and the constantly-evolving IT environment. However, the use of OSS is insufficient in small and medium-sized companies and software houses. This study examines the need for OSS Intermediaries in the Software Distribution Channel. It is expected that the role of the OSS Intermediary will be reduced with the improvement of the distribution process. The purpose of this research is to prove that OSS Intermediaries increase the efficiency of the software distribution market. Research design, Data, and Methodology - This study presents the analysis of data gathered online to determine the extent of the impact of the intermediaries on the OSS market. Data was collected using an online survey, conducted by building a personal search robot (web crawler). The survey period lasted 9 days during which a total of 233,021 data points were gathered from sourceforge.net and Apple's App store, the two most popular software intermediaries in the world. The data collected was analyzed using Google's Motion Chart. Results - The study found that, beginning 2006, the production of OSS in the Sourceforge.net increased rapidly across the board, but in the second half of 2009, it dropped sharply. There are many events that can explain this causality; however, we found an appropriate event to explain the effect. It was seen that during the same period of time, the monthly production of OSS in the App store was increasing quickly. The App store showed a contrasting trend to software production. Our follow-up analysis suggests that appropriate intermediaries like App store can enlarge the OSS market. The increase was caused by the appearance of B2C software intermediaries like App store. The results imply that OSS intermediaries can accelerate OSS software distribution, while development of a better online market is critical for corporate users. Conclusion - In this study, we analyzed 233,021 data points on the online software marketplace at Sourceforge.net. It indicates that OSS Intermediaries are needed in the software distribution market for its vitality. It is also critical that OSS intermediaries should satisfy certain qualifications to play a key role as market makers. This study has several interesting implications. One implication of this research is that the OSS intermediary should make an effort to create a complementary relationship between OSS and Proprietary Software. The second implication is that the OSS intermediary must possess a business model that shares the benefits with all the participants (developer, intermediary, and users).The third implication is that the intermediary provides an OSS of high quality like proprietary software with a high level of complexity. Thus, it is worthwhile to examine this study, which proves that the open source software intermediaries are essential in the software distribution channel.

  • PDF

Do Not Just Talk, Show Me in Action: Investigating the Effect of OSSD Activities on Job Change of IT Professional (오픈소스 소프트웨어 개발 플랫폼 활동이 IT 전문직 취업에 미치는 영향)

  • Jang, Moonkyoung;Lee, Saerom;Baek, Hyunmi;Jung, Yoonhyuk
    • The Journal of Society for e-Business Studies
    • /
    • v.26 no.1
    • /
    • pp.43-65
    • /
    • 2021
  • With the advancement of information and communications technology, a means to recruit IT professional has fundamentally changed. Nowadays recruiters search for candidate information from the Web as well as traditional information sources such as résumés or interviews. Particularly, open-source software development (OSSD) platforms have become an opportunity for developers to demonstrate their IT capabilities, making it a way for recruiters to find the right candidates, whom they need. Therefore, this study aims to investigate the impact developers' profiles in an OSSD platform on their finding a job. This study examined four antecedents of developer information that can accelerate their job search: job-seeking status, personal-information posting, learning activities and knowledge contribution activities. For the empirical analysis, we developed a Web crawler and gathered a dataset on 4,005 developers from GitHub, which is a well-known OSSD platform. Proportional hazards regression was used for data analysis because shorter job-seeking period implies more successful result of job change. Our results indicate that developers, who explicitly posted their job-seeking status, had shorter job-seeking periods than those who did not. The other antecedents (i.e., personal-information posting, learning, and knowledge contribution activities) also contributed in reducing the job-seeking period. These findings imply values of OSSD platforms for recruiters to find proper candidates and for developers to successfully find a job.

User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis (다계층 이원 네트워크를 활용한 사용자 관점의 이슈 클러스터링)

  • Kim, Jieun;Kim, Namgyu;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.93-107
    • /
    • 2014
  • In this paper, we report what we have observed with regard to user-perspective issue clustering based on multi-layered two-mode network analysis. This work is significant in the context of data collection by companies about customer needs. Most companies have failed to uncover such needs for products or services properly in terms of demographic data such as age, income levels, and purchase history. Because of excessive reliance on limited internal data, most recommendation systems do not provide decision makers with appropriate business information for current business circumstances. However, part of the problem is the increasing regulation of personal data gathering and privacy. This makes demographic or transaction data collection more difficult, and is a significant hurdle for traditional recommendation approaches because these systems demand a great deal of personal data or transaction logs. Our motivation for presenting this paper to academia is our strong belief, and evidence, that most customers' requirements for products can be effectively and efficiently analyzed from unstructured textual data such as Internet news text. In order to derive users' requirements from textual data obtained online, the proposed approach in this paper attempts to construct double two-mode networks, such as a user-news network and news-issue network, and to integrate these into one quasi-network as the input for issue clustering. One of the contributions of this research is the development of a methodology utilizing enormous amounts of unstructured textual data for user-oriented issue clustering by leveraging existing text mining and social network analysis. In order to build multi-layered two-mode networks of news logs, we need some tools such as text mining and topic analysis. We used not only SAS Enterprise Miner 12.1, which provides a text miner module and cluster module for textual data analysis, but also NetMiner 4 for network visualization and analysis. Our approach for user-perspective issue clustering is composed of six main phases: crawling, topic analysis, access pattern analysis, network merging, network conversion, and clustering. In the first phase, we collect visit logs for news sites by crawler. After gathering unstructured news article data, the topic analysis phase extracts issues from each news article in order to build an article-news network. For simplicity, 100 topics are extracted from 13,652 articles. In the third phase, a user-article network is constructed with access patterns derived from web transaction logs. The double two-mode networks are then merged into a quasi-network of user-issue. Finally, in the user-oriented issue-clustering phase, we classify issues through structural equivalence, and compare these with the clustering results from statistical tools and network analysis. An experiment with a large dataset was performed to build a multi-layer two-mode network. After that, we compared the results of issue clustering from SAS with that of network analysis. The experimental dataset was from a web site ranking site, and the biggest portal site in Korea. The sample dataset contains 150 million transaction logs and 13,652 news articles of 5,000 panels over one year. User-article and article-issue networks are constructed and merged into a user-issue quasi-network using Netminer. Our issue-clustering results applied the Partitioning Around Medoids (PAM) algorithm and Multidimensional Scaling (MDS), and are consistent with the results from SAS clustering. In spite of extensive efforts to provide user information with recommendation systems, most projects are successful only when companies have sufficient data about users and transactions. Our proposed methodology, user-perspective issue clustering, can provide practical support to decision-making in companies because it enhances user-related data from unstructured textual data. To overcome the problem of insufficient data from traditional approaches, our methodology infers customers' real interests by utilizing web transaction logs. In addition, we suggest topic analysis and issue clustering as a practical means of issue identification.