• Title/Summary/Keyword: multiple query

Search Result 253, Processing Time 0.028 seconds

A Method to Solve the Entity Linking Ambiguity and NIL Entity Recognition for efficient Entity Linking based on Wikipedia (위키피디아 기반의 효과적인 개체 링킹을 위한 NIL 개체 인식과 개체 연결 중의성 해소 방법)

  • Lee, Hokyung;An, Jaehyun;Yoon, Jeongmin;Bae, Kyoungman;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.44 no.8
    • /
    • pp.813-821
    • /
    • 2017
  • Entity Linking find the meaning of an entity mention, which indicate the entity using different expressions, in a user's query by linking the entity mention and the entity in the knowledge base. This task has four challenges, including the difficult knowledge base construction problem, multiple presentation of the entity mention, ambiguity of entity linking, and NIL entity recognition. In this paper, we first construct the entity name dictionary based on Wikipedia to build a knowledge base and solve the multiple presentation problem. We then propose various methods for NIL entity recognition and solve the ambiguity of entity linking by training the support vector machine based on several features, including the similarity of the context, semantic relevance, clue word score, named entity type similarity of the mansion, entity name matching score, and object popularity score. We sequentially use the proposed two methods based on the constructed knowledge base, to obtain the good performance in the entity linking. In the result of the experiment, our system achieved 83.66% and 90.81% F1 score, which is the performance of the NIL entity recognition to solve the ambiguity of the entity linking.

A Data Model for an Object-based Faceted Thesaurus System Supporting Multiple Dimensions of View in a Visualized Environment (시각화된 환경에서 다차원 관점을 지원하는 객체기반 패싯 시소러스 관리 시스템 모델의 정형화 및 구현)

  • Kim, Won-Jung;Yang, Jae-Dong
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.828-847
    • /
    • 2007
  • In this paper we propose a formal data model of an object-based thesaurus system supporting multi-dimensional facets. According to facets reflecting on respective user perspectives, it supports systematic construction, browsing, navigating and referencing of thesauri. Unlike other faceted thesaurus systems, it systematically manages its complexity by appropriately ing sophisticated conceptual structure through visualized browsing and navigation as well as construction. The browsing and navigation is performed by dynamically generating multi-dimensional virtual thesaurus hierarchies called "faceted thesaurus hierarchies." The hierarchies are automatically constructed by combining facets, each representing a dimension of view. Such automatic construction may make it possible the flexible extension of thesauri for they can be easily upgraded by pure insertion or deletion of facets. With a well defined set of self-referential queries, the thesauri can also be effectively referenced from multiple view points since they are structured by appropriately interpreting the semantics of instances based on facets. In this paper, we first formalize the underlying model and then implement its prototype to demonstrate its feasibility.

Efficient Rotation-Invariant Boundary Image Matching Using the Envelope-based Lower Bound (엔빌로프 기반 하한을 사용한 효율적인 회전-불변 윤곽선 이미지 매칭)

  • Kim, Sang-Pil;Moon, Yang-Sae;Hong, Sun-Kyong
    • The KIPS Transactions:PartD
    • /
    • v.18D no.1
    • /
    • pp.9-22
    • /
    • 2011
  • In this paper we present an efficient solution to rotation?invariant boundary image matching. Computing the rotation-invariant distance between image time-series is a time-consuming process since it requires a lot of Euclidean distance computations for all possible rotations. In this paper we propose a novel solution that significantly reduces the number of distance computations using the envelope-based lower bound. To this end, we first present how to construct a single envelope from a query sequence and how to obtain a lower bound of the rotation-invariant distance using the envelope. We then show that the single envelope-based lower bound can reduce a number of distance computations. This approach, however, may cause bad performance since it may incur a larger lower bound by considering all possible rotated sequences in a single envelope. To solve this problem, we present a concept of rotation interval, and using the rotation interval we generalize the envelope-based lower bound by exploiting multiple envelopes rather than a single envelope. We also propose equi-width and envelope minimization divisions as the method of determining rotation intervals in the multiple envelope approach. Experimental results show that our envelope-based solutions outperform existing solutions by one or two orders of magnitude.

Gramene database: A resource for comparative plant genomics, pathways and phylogenomics analyses

  • Tello-Ruiz, Marcela K.;Stein, Joshua;Wei, Sharon;Preece, Justin;Naithani, Sushma;Olson, Andrew;Jiao, Yinping;Gupta, Parul;Kumari, Sunita;Chougule, Kapeel;Elser, Justin;Wang, Bo;Thomason, James;Zhang, Lifang;D'Eustachio, Peter;Petryszak, Robert;Kersey, Paul;Lee, PanYoung Koung;Jaiswal, kaj;Ware, Doreen
    • Proceedings of the Korean Society of Crop Science Conference
    • /
    • 2017.06a
    • /
    • pp.135-135
    • /
    • 2017
  • The Gramene database (http://www.gramene.org) is a powerful online resource for agricultural researchers, plant breeders and educators that provides easy access to reference data, visualizations and analytical tools for conducting cross-species comparisons. Learn the benefits of using Gramene to enrich your lectures, accelerate your research goals, and respond to your organismal community needs. Gramene's genomes portal hosts browsers for 44 complete reference genomes, including crops and model organisms, each displaying functional annotations, gene-trees with orthologous and paralogous gene classification, and whole-genome alignments. SNP and structural diversity data, available for 11 species, are displayed in the context of gene annotation, protein domains and functional consequences on transcript structure (e.g., missense variant). Browsers from multiple species can be viewed simultaneously with links to community-driven organismal databases. Thus, while hosting the underlying data for comparative studies, the portal also provides unified access to diverse plant community resources, and the ability for communities to upload and display private data sets in multiple standard formats. Our BioMart data mining interface enable complex queries and bulk download of sequence, annotation, homology and variation data. Gramene's pathway portal, the Plant Reactome, hosts over 240 pathways curated in rice and inferred in 66 additional plant species by orthology projection. Users may compare pathways across species, query and visualize curated expression data from EMBL-EBI's Expression Atlas in the context of pathways, analyze genome-scale expression data, and conduct pathway enrichment analysis. Our integrated search database and modern user interface leverage these diverse annotations to facilitate finding genes through selecting auto-suggested filters with interactive views of the results.

  • PDF

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

  • Mun, Jin-Gyu;Jin, Seong-Il;Jo, Seong-Hyeon
    • Journal of KIISE:Databases
    • /
    • v.29 no.3
    • /
    • pp.180-192
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new harsh join methods in the shared-nothing multiprocessor environment. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets dynamically via a frequency distribution. Using harsh-based joins, multiple joins can be pipelined to that the early results from a join, before the whole join is completed, are sent to the next join processing without staying in disks. Shared nothing multiprocessor architecture is known to be more scalable to support very large databases. However, this hardware structure is very sensitive to the data skew. Unless the pipelining execution of multiple hash joins includes some dynamic load balancing mechanism, the skew effect can severely deteriorate the system performance. In this parer, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

Trajectory Indexing for Efficient Processing of Range Queries (영역 질의의 효과적인 처리를 위한 궤적 인덱싱)

  • Cha, Chang-Il;Kim, Sang-Wook;Won, Jung-Im
    • The KIPS Transactions:PartD
    • /
    • v.16D no.4
    • /
    • pp.487-496
    • /
    • 2009
  • This paper addresses an indexing scheme capable of efficiently processing range queries in a large-scale trajectory database. After discussing the drawbacks of previous indexing schemes, we propose a new scheme that divides the temporal dimension into multiple time intervals and then, by this interval, builds an index for the line segments. Additionally, a supplementary index is built for the line segments within each time interval. This scheme can make a dramatic improvement in the performance of insert and search operations using a main memory index, particularly for the time interval consisting of the segments taken by those objects which are currently moving or have just completed their movements, as contrast to the previous schemes that store the index totally on the disk. Each time interval index is built as follows: First, the extent of the spatial dimension is divided onto multiple spatial cells to which the line segments are assigned evenly. We use a 2D-tree to maintain information on those cells. Then, for each cell, an additional 3D $R^*$-tree is created on the spatio-temporal space (x, y, t). Such a multi-level indexing strategy can cure the shortcomings of the legacy schemes. Performance results obtained from intensive experiments show that our scheme enhances the performance of retrieve operations by 3$\sim$10 times, with much less storage space.

Development of Information Extraction System from Multi Source Unstructured Documents for Knowledge Base Expansion (지식베이스 확장을 위한 멀티소스 비정형 문서에서의 정보 추출 시스템의 개발)

  • Choi, Hyunseung;Kim, Mintae;Kim, Wooju;Shin, Dongwook;Lee, Yong Hun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.111-136
    • /
    • 2018
  • In this paper, we propose a methodology to extract answer information about queries from various types of unstructured documents collected from multi-sources existing on web in order to expand knowledge base. The proposed methodology is divided into the following steps. 1) Collect relevant documents from Wikipedia, Naver encyclopedia, and Naver news sources for "subject-predicate" separated queries and classify the proper documents. 2) Determine whether the sentence is suitable for extracting information and derive the confidence. 3) Based on the predicate feature, extract the information in the proper sentence and derive the overall confidence of the information extraction result. In order to evaluate the performance of the information extraction system, we selected 400 queries from the artificial intelligence speaker of SK-Telecom. Compared with the baseline model, it is confirmed that it shows higher performance index than the existing model. The contribution of this study is that we develop a sequence tagging model based on bi-directional LSTM-CRF using the predicate feature of the query, with this we developed a robust model that can maintain high recall performance even in various types of unstructured documents collected from multiple sources. The problem of information extraction for knowledge base extension should take into account heterogeneous characteristics of source-specific document types. The proposed methodology proved to extract information effectively from various types of unstructured documents compared to the baseline model. There is a limitation in previous research that the performance is poor when extracting information about the document type that is different from the training data. In addition, this study can prevent unnecessary information extraction attempts from the documents that do not include the answer information through the process for predicting the suitability of information extraction of documents and sentences before the information extraction step. It is meaningful that we provided a method that precision performance can be maintained even in actual web environment. The information extraction problem for the knowledge base expansion has the characteristic that it can not guarantee whether the document includes the correct answer because it is aimed at the unstructured document existing in the real web. When the question answering is performed on a real web, previous machine reading comprehension studies has a limitation that it shows a low level of precision because it frequently attempts to extract an answer even in a document in which there is no correct answer. The policy that predicts the suitability of document and sentence information extraction is meaningful in that it contributes to maintaining the performance of information extraction even in real web environment. The limitations of this study and future research directions are as follows. First, it is a problem related to data preprocessing. In this study, the unit of knowledge extraction is classified through the morphological analysis based on the open source Konlpy python package, and the information extraction result can be improperly performed because morphological analysis is not performed properly. To enhance the performance of information extraction results, it is necessary to develop an advanced morpheme analyzer. Second, it is a problem of entity ambiguity. The information extraction system of this study can not distinguish the same name that has different intention. If several people with the same name appear in the news, the system may not extract information about the intended query. In future research, it is necessary to take measures to identify the person with the same name. Third, it is a problem of evaluation query data. In this study, we selected 400 of user queries collected from SK Telecom 's interactive artificial intelligent speaker to evaluate the performance of the information extraction system. n this study, we developed evaluation data set using 800 documents (400 questions * 7 articles per question (1 Wikipedia, 3 Naver encyclopedia, 3 Naver news) by judging whether a correct answer is included or not. To ensure the external validity of the study, it is desirable to use more queries to determine the performance of the system. This is a costly activity that must be done manually. Future research needs to evaluate the system for more queries. It is also necessary to develop a Korean benchmark data set of information extraction system for queries from multi-source web documents to build an environment that can evaluate the results more objectively.

Improving Performance of Search Engine By Using WordNet-based Collaborative Evaluation and Hyperlink (워드넷 기반 협동적 평가와 하이퍼링크를 이용한 검색엔진의 성능 향상)

  • Kim, Hyun-Gil;Kim, Jun-Tae
    • The KIPS Transactions:PartB
    • /
    • v.11B no.3
    • /
    • pp.369-380
    • /
    • 2004
  • In this paper, we propose a web page weighting scheme based on WordNet-based collaborative evaluation and hyperlink to improve the precision of web search engine. Generally search engines use keyword matching to decide web page ranking. In the information retrieval from huge data such as the Web, simple word comparison cannot distinguish important documents because there exist too many documents with similar relevancy. In this paper, we implement a WordNet-based user interface that helps to distinguish different senses of query word, and constructed a search engine in which the implicit evaluations by multiple users are reflected in ranking by accumulating the number of clicks. In accumulating click counts, they are stored separately according to lenses, so that more accurate search is possible. Weighting of each web page by using collaborative evaluation and hyperlink is reflected in ranking. The experimental results with several keywords show that the precision of proposed system is improved compared to conventional search engines.

Long-term Location Data Management for Distributed Moving Object Databases (분산 이동 객체 데이타베이스를 위한 과거 위치 정보 관리)

  • Lee, Ho;Lee, Joon-Woo;Park, Seung-Yong;Lee, Chung-Woo;Hwang, Jae-Il;Nah, Yun-Mook
    • Journal of Korea Spatial Information System Society
    • /
    • v.8 no.2 s.17
    • /
    • pp.91-107
    • /
    • 2006
  • To handling the extreme situation that must manage positional information of a very large volume, at least millions of moving objects. A cluster-based sealable distributed computing system architecture, called the GALIS which consists of multiple data processors, each dedicated to keeping records relevant to a different geographical zone and a different time zone, was proposed. In this paper, we proposed a valid time management and time-zone shifting scheme, which are essential in realizing the long-term location data subsystem of GALIS, but missed in our previous prototype development. We explain how to manage valid time of moving objects to avoid ambiguity of location information. We also describe time-zone shifting algorithm with three variations, such as Real Time-Time Zone Shifting, Batch-Time Zone Shifting, Table Partitioned Batch-Time Zone Shifting, Through experiments related with query processing time and CPU utilization, we show the efficiency of the proposed time-zone shifting schemes.

  • PDF

An Analysis of High School Korean Language Instruction Regarding Universal Design for Learning: Social Big Data Analysis and Survey Analysis (보편적 학습설계 측면에서의 고등학교 국어과 교수 실태: 소셜 빅데이터 및 설문조사 분석)

  • Shin, Mikyung;Lee, Okin
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.1
    • /
    • pp.326-337
    • /
    • 2020
  • This study examined the public interest in high school Korean language instruction and the universal design for learning (UDL) using the social big data analysis method. The observations from 10,339 search results led to the conclusion that public interest in UDL was significantly lower than that of high school Korean language instruction. The results of the Big Data Association analysis showed that 17.22% of the terms were found to be related to "curriculum." In addition, a survey was conducted on a total of 330 high school students to examine how their teachers apply UDL in the classroom. High school students perceived computers as the most frequently used technology tool in daily classes (38.79%). Teacher-led lectures (52.12%) were the most frequently observed method of instruction. Compared to the second-year and third-year students, the first-year students appreciated the usage of technology tools and various instruction mediums more frequently (ps<.05). Students were relatively more positive in their response to the query on the provision of multiple means of representation. Consequently, the lesson contents became easier to understand for students with the availability of various study methods and materials. The first-year students were generally more positive towards teachers' incorporation of UDL.