Search | Korea Science

Crawling algorithm design and experiment for automatic deep web document collection (심층 웹 문서 자동 수집을 위한 크롤링 알고리즘 설계 및 실험)

Yun-Jeong, Kang;Min-Hye, Lee;Dong-Hyun, Won
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.27 no.1
- /
- pp.1-7
- /
- 2023
Deep web collection means entering a query in a search form and collecting response results. It is estimated that the information possessed by the deep web has about 450 to 550 times more information than the statically constructed surface web. The static method does not show the changed information until the web page is refreshed, but the dynamic web page method updates the necessary information in real time and provides real-time information without reloading the web page, but crawler has difficulty accessing the updated information. Therefore, there is a need for a way to automatically collect information on these deep webs using a crawler. Therefore, this paper proposes a method of utilizing scripts as general links, and for this purpose, an algorithm that can utilize client scripts like regular URLs is proposed and experimented. The proposed algorithm focused on collecting web information by menu navigation and script execution instead of the usual method of entering data into search forms.
https://doi.org/10.6109/jkiice.2023.27.1.1 인용 PDF

Modeling and Implementation of Multilingual Meta-search Service using Open APIs and Ajax (Open API와 Ajax를 이용한 다국어 메타검색 서비스의 모델링 및 구현)

Kim, Seon-Jin;Kang, Sin-Jae
- Journal of Korea Society of Industrial Information Systems
- /
- v.14 no.5
- /
- pp.11-18
- /
- 2009
Ajax based on Java Script receives attention as an alternative to ActiveX technology. Most portal sites in korea show a tendency to reopen existing services by combining the technology, because it supports most web browsers, and has the advantages of such a brilliant interface, excellent speed, and traffic reduction through asynchronous interaction. This paper modeled and implemented a multilingual meta-search service using the Ajax and open APIs provided by international famous sites. First, a Korean query is translated into one of the language of 54 countries around the world by Google translation API, and then the translated result is used to search the information of the social web sites such as Flickr, Youtube, Daum, and Naver. Searched results are displayed fast by dynamic loading of portion of the screen using Ajax. Our system can reduce server traffic and per-packet communications charges by preventing redundant transmission of unnecessary information.
https://doi.org/10.9723/jksiis.2009.14.5.011 인용 PDF

HummingBird: A Similar Music Retrieval System using Improved Scaled and Warped Matching (HummingBird: 향상된 스케일드앤워프트 매칭을 이용한 유사 음악 검색 시스템)

Lee, Hye-Hwan;Shim, Kyu-Seok;Park, Hyoung-Min
- Journal of KIISE:Databases
- /
- v.34 no.5
- /
- pp.409-419
- /
- 2007
Database community focuses on the similar music retrieval systems for music database when a humming query is given. One of the approaches is converting the midi data to time series, building their indices and performing the similarity search on them. Queries based on humming can be transformed to time series by using the known pitch detection algorithms. The recently suggested algorithm, scaled and warped matching, is based on dynamic time warping and uniform scaling. This paper proposes Humming BIRD(Humming Based sImilaR mini music retrieval system) using sliding window and center-aligned scaled and warped matching. Center-aligned scaled and warped matching is a mixed distance measure of center-aligned uniform scaling and time warping. The newly proposed measure gives tighter lower bound than previous ones which results in reduced search space. The empirical results show the superiority of this algorithm comparing the pruning power while it returns the same results.
PDF KSCI

Semantic Process Retrieval with Similarity Algorithms (유사도 알고리즘을 활용한 시맨틱 프로세스 검색방안)

Lee, Hong-Joo;Klein, Mark
- Asia pacific journal of information systems
- /
- v.18 no.1
- /
- pp.79-96
- /
- 2008
One of the roles of the Semantic Web services is to execute dynamic intra-organizational services including the integration and interoperation of business processes. Since different organizations design their processes differently, the retrieval of similar semantic business processes is necessary in order to support inter-organizational collaborations. Most approaches for finding services that have certain features and support certain business processes have relied on some type of logical reasoning and exact matching. This paper presents our approach of using imprecise matching for expanding results from an exact matching engine to query the OWL(Web Ontology Language) MIT Process Handbook. MIT Process Handbook is an electronic repository of best-practice business processes. The Handbook is intended to help people: (1) redesigning organizational processes, (2) inventing new processes, and (3) sharing ideas about organizational practices. In order to use the MIT Process Handbook for process retrieval experiments, we had to export it into an OWL-based format. We model the Process Handbook meta-model in OWL and export the processes in the Handbook as instances of the meta-model. Next, we need to find a sizable number of queries and their corresponding correct answers in the Process Handbook. Many previous studies devised artificial dataset composed of randomly generated numbers without real meaning and used subjective ratings for correct answers and similarity values between processes. To generate a semantic-preserving test data set, we create 20 variants for each target process that are syntactically different but semantically equivalent using mutation operators. These variants represent the correct answers of the target process. We devise diverse similarity algorithms based on values of process attributes and structures of business processes. We use simple similarity algorithms for text retrieval such as TF-IDF and Levenshtein edit distance to devise our approaches, and utilize tree edit distance measure because semantic processes are appeared to have a graph structure. Also, we design similarity algorithms considering similarity of process structure such as part process, goal, and exception. Since we can identify relationships between semantic process and its subcomponents, this information can be utilized for calculating similarities between processes. Dice's coefficient and Jaccard similarity measures are utilized to calculate portion of overlaps between processes in diverse ways. We perform retrieval experiments to compare the performance of the devised similarity algorithms. We measure the retrieval performance in terms of precision, recall and F measure? the harmonic mean of precision and recall. The tree edit distance shows the poorest performance in terms of all measures. TF-IDF and the method incorporating TF-IDF measure and Levenshtein edit distance show better performances than other devised methods. These two measures are focused on similarity between name and descriptions of process. In addition, we calculate rank correlation coefficient, Kendall's tau b, between the number of process mutations and ranking of similarity values among the mutation sets. In this experiment, similarity measures based on process structure, such as Dice's, Jaccard, and derivatives of these measures, show greater coefficient than measures based on values of process attributes. However, the Lev-TFIDF-JaccardAll measure considering process structure and attributes' values together shows reasonably better performances in these two experiments. For retrieving semantic process, we can think that it's better to consider diverse aspects of process similarity such as process structure and values of process attributes. We generate semantic process data and its dataset for retrieval experiment from MIT Process Handbook repository. We suggest imprecise query algorithms that expand retrieval results from exact matching engine such as SPARQL, and compare the retrieval performances of the similarity algorithms. For the limitations and future work, we need to perform experiments with other dataset from other domain. And, since there are many similarity values from diverse measures, we may find better ways to identify relevant processes by applying these values simultaneously.
PDF KSCI

Smart Browser based on Semantic Web using RFID Technology (RFID 기술을 이용한 시맨틱 웹 기반 스마트 브라우저)

Song, Chang-Woo;Lee, Jung-Hyun
- The Journal of the Korea Contents Association
- /
- v.8 no.12
- /
- pp.37-44
- /
- 2008
Data entered into RFID tags are used for saving costs and enhancing competitiveness in the development of applications in various industrial areas. RFID readers perform the identification and search of hundreds of objects, which are tags. RFID technology that identifies objects on request of dynamic linking and tracking is composed of application components supporting information infrastructure. Despite their many advantages, existing applications, which do not consider elements related to real.time data communication among remote RFID devices, cannot support connections among heterogeneous devices effectively. As different network devices are installed in applications separately and go through different query analysis processes, there happen the delays of monitoring or errors in data conversion. The present study implements a RFID database handling system in semantic Web environment for integrated management of information extracted from RFID tags regardless of application. Users’ RFID tags are identified by a RFID reader mounted on an application, and the data are sent to the RFID database processing system, and then the process converts the information into a semantic Web language. Data transmitted on the standardized semantic Web base are translated by a smart browser and displayed on the screen. The use of a semantic Web language enables reasoning on meaningful relations and this, in turn, makes it easy to expand the functions by adding modules.
https://doi.org/10.5392/JKCA.2008.8.12.037 인용 PDF

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

Mun, Jin-Gyu;Jin, Seong-Il;Jo, Seong-Hyeon
- Journal of KIISE:Databases
- /
- v.29 no.3
- /
- pp.180-192
- /
- 2002
We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new harsh join methods in the shared-nothing multiprocessor environment. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets dynamically via a frequency distribution. Using harsh-based joins, multiple joins can be pipelined to that the early results from a join, before the whole join is completed, are sent to the next join processing without staying in disks. Shared nothing multiprocessor architecture is known to be more scalable to support very large databases. However, this hardware structure is very sensitive to the data skew. Unless the pipelining execution of multiple hash joins includes some dynamic load balancing mechanism, the skew effect can severely deteriorate the system performance. In this parer, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.
PDF KSCI

Dynamic Management of Equi-Join Results for Multi-Keyword Searches (다중 키워드 검색에 적합한 동등조인 연산 결과의 동적 관리 기법)

Lim, Sung-Chae
- The KIPS Transactions:PartA
- /
- v.17A no.5
- /
- pp.229-236
- /
- 2010
With an increasing number of documents in the Internet or enterprises, it becomes crucial to efficiently support users' queries on those documents. In that situation, the full-text search technique is accepted in general, because it can answer uncontrolled ad-hoc queries by automatically indexing all the keywords found in the documents. The size of index files made for full-text searches grows with the increasing number of indexed documents, and thus the disk cost may be too large to process multi-keyword queries against those enlarged index files. To solve the problem, we propose both of the index file structure and its management scheme suitable to the processing of multi-keyword queries against a large volume of index files. For this, we adopt the structure of inverted-files, which are widely used in the multi-keyword searches, as a basic index structure and modify it to a hierarchical structure for join operations and ranking operations performed during the query processing. In order to save disk costs based on that index structure, we dynamically store in the main memory the results of join operations between two keywords, if they are highly expected to be entered in users' queries. We also do performance comparisons using a cost model of the disk to show the performance advantage of the proposed scheme.
https://doi.org/10.3745/KIPSTA.2010.17A.5.229 인용 PDF KSCI

Search Result 177, Processing Time 0.021 seconds

Crawling algorithm design and experiment for automatic deep web document collection (심층 웹 문서 자동 수집을 위한 크롤링 알고리즘 설계 및 실험)

Modeling and Implementation of Multilingual Meta-search Service using Open APIs and Ajax (Open API와 Ajax를 이용한 다국어 메타검색 서비스의 모델링 및 구현)

HummingBird: A Similar Music Retrieval System using Improved Scaled and Warped Matching (HummingBird: 향상된 스케일드앤워프트 매칭을 이용한 유사 음악 검색 시스템)

Semantic Process Retrieval with Similarity Algorithms (유사도 알고리즘을 활용한 시맨틱 프로세스 검색방안)

Smart Browser based on Semantic Web using RFID Technology (RFID 기술을 이용한 시맨틱 웹 기반 스마트 브라우저)

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

Dynamic Management of Equi-Join Results for Multi-Keyword Searches (다중 키워드 검색에 적합한 동등조인 연산 결과의 동적 관리 기법)

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)