• Title/Summary/Keyword: large-scale search engines

Search Result 6, Processing Time 0.017 seconds

Implementation of a Parallel Web Crawler for the Odysseus Large-Scale Search Engine (오디세우스 대용량 검색 엔진을 위한 병렬 웹 크롤러의 구현)

  • Shin, Eun-Jeong;Kim, Yi-Reun;Heo, Jun-Seok;Whang, Kyu-Young
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.6
    • /
    • pp.567-581
    • /
    • 2008
  • As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator machine to manage them. The parallel web crawler consists of three components: a crawling module for collecting web pages, a converting module for transforming the web pages into a database-friendly format, a ranking module for rating web pages based on their relative importance. We explain each component of the parallel web crawler and implementation methods in detail. Finally, we conduct extensive experiments to analyze the effectiveness of the parallel web crawler. The experimental results clarify the merit of our architecture in that the proposed parallel web crawler is scalable to the number of web pages to crawl and the number of machines used.

PDFindexer: Distributed PDF Indexing system using MapReduce

  • Murtazaev, JAziz;Kihm, Jang-Su;Oh, Sangyoon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.4 no.1
    • /
    • pp.13-17
    • /
    • 2012
  • Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides subsecond response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. In this paper we propose PDFindexer, system for indexing scientific papers in PDF using MapReduce programming model. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. We provide the overview of the system, their components and interactions among them. We discuss some issues related with the design of the system and usage of MapReduce in parsing and indexing of large document collection.

Users' Understanding of Search Engine Advertisements

  • Lewandowski, Dirk
    • Journal of Information Science Theory and Practice
    • /
    • v.5 no.4
    • /
    • pp.6-25
    • /
    • 2017
  • In this paper, a large-scale study on users' understanding of search-based advertising is presented. It is based on (1) a survey, (2) a task-based user study, and (3) an online experiment. Data were collected from 1,000 users representative of the German online population. Findings show that users generally lack an understanding of Google's business model and the workings of search-based advertising. 42% of users self-report that they either do not know that it is possible to pay Google for preferred listings for one's company on the SERPs or do not know how to distinguish between organic results and ads. In the task-based user study, we found that only 1.3 percent of participants were able to mark all areas correctly. 9.6 percent had all their identifications correct but did not mark all results they were required to mark. For none of the screenshots given were more than 35% of users able to mark all areas correctly. In the experiment, we found that users who are not able to distinguish between the two results types choose ads around twice as often as users who can recognize the ads. The implications are that models of search engine advertising and of information seeking need to be amended, and that there is a severe need for regulating search-based advertising.

SoFA: A Distributed File System for Search-Oriented Systems (SoFA: 검색 지향 시스템을 위한 분산 파일 시스템)

  • Choi, Eun-Mi;Tran, Doan Thanh;Upadhyaya, Bipin;Azimov, Fahriddin;Luu, Hoang Long;Truong, Phuong;Kim, Sang-Bum;Kim, Pil-Sung
    • Journal of the Korea Society for Simulation
    • /
    • v.17 no.4
    • /
    • pp.229-239
    • /
    • 2008
  • A Distributed File System (DFS) provides a mechanism in which a file can be stored across several physical computer nodes ensuring replication transparency and failure transparency. Applications that process large volumes of data (such as, search engines, grid computing applications, data mining applications, etc.) require a backend infrastructure for storing data. And the distributed file system is the central component for such storing data infrastructure. There have been many projects focused on network computing that have designed and implemented distributed file systems with a variety of architectures and functionalities. In this paper, we describe a complete distributed file system which can be used in large-scale search-oriented systems.

  • PDF

Hazelcast Vs. Ignite: Opportunities for Java Programmers

  • Maxim, Bartkov;Tetiana, Katkova;S., Kruglyk Vladyslav;G., Murtaziev Ernest;V., Kotova Olha
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.2
    • /
    • pp.406-412
    • /
    • 2022
  • Storing large amounts of data has always been a big problem from the beginning of computing history. Big Data has made huge advancements in improving business processes by finding the customers' needs using prediction models based on web and social media search. The main purpose of big data stream processing frameworks is to allow programmers to directly query the continuous stream without dealing with the lower-level mechanisms. In other words, programmers write the code to process streams using these runtime libraries (also called Stream Processing Engines). This is achieved by taking large volumes of data and analyzing them using Big Data frameworks. Streaming platforms are an emerging technology that deals with continuous streams of data. There are several streaming platforms of Big Data freely available on the Internet. However, selecting the most appropriate one is not easy for programmers. In this paper, we present a detailed description of two of the state-of-the-art and most popular streaming frameworks: Apache Ignite and Hazelcast. In addition, the performance of these frameworks is compared using selected attributes. Different types of databases are used in common to store the data. To process the data in real-time continuously, data streaming technologies are developed. With the development of today's large-scale distributed applications handling tons of data, these databases are not viable. Consequently, Big Data is introduced to store, process, and analyze data at a fast speed and also to deal with big users and data growth day by day.

An Analysis of Clinical Research Trends on Interventions of Oriental Medicine for Postpartum Disease and Postpartum Care (산후병 및 산후관리에 대한 국내 한의학 임상 연구 동향 분석)

  • Kim, Nu-Ree;Lee, Eun-Hee
    • The Journal of Korean Obstetrics and Gynecology
    • /
    • v.35 no.1
    • /
    • pp.34-58
    • /
    • 2022
  • Objectives: This study was performed to analyze the interventions of Oriental Medicine which had been commonly used for postpartum disease and postpartum care. Methods: We searched research on the interventions for postpartum disease and postpartum care in 4 domestic search engines. After that, we conducted eligibility screening based on inclusion and exclusion criteria. Results: 1. We selected total 50 studies. There were 2 randomized controlled trial (RCT), 5 non-RCT, 35 case reports, 8 case series within the 6~8 weeks after childbirth. 2. Of the 35 case reports, several interventions were used : acupuncture (22), moxibustion (11), cupping therapy (7), pharmacopuncture (5), chuna manipulation (4), herbal medicine (34). The most common symptoms were musculoskeletal symptoms (8), followed by postpartum depression (7). Various prescriptions and acupoints of oriental medicine were used depending on the diseases or symptoms. 3. Of the 8 case series, 382 subjects in 5 case series had taken Saenghwa-tang-gagam. And Acupuncture, moxibustion, cupping therapy (5), pharmacopuncture (1) were used as an intervention. 4. The most commonly used acupoint is 腎兪 (BL23) in the pain including postpartum back pain and 三陰交 (SP6), 關元 (CV4) in the postpartum care. 關元 (CV4) is the most commomly used moxibustion point not only the postpartum disease but also the postpartum care. Conclusions: In clinical studies of oriental medicine related to postpartum disease and postpartum care, pain-related clinical studies that belong to or progress to Sanhupung were the most common (30%), and among them, postpartum low back pain studies were the most common (20%). Based on this, we believe that large-scale clinical studies with high quality using oriental interventions including chuna and pharmacopuncture are needed to establish guidelines for the management of pain treatment including postpartum back pain.