• Title/Summary/Keyword: Data Dictionary

Search Result 349, Processing Time 0.02 seconds

A Study on Small-sized Index Structure and Fast Retrieval Method Using The RCB trio (RCB트라이를 이용한 빠른 검색과 소용량 색인 구조에 관한 연구)

  • Jung, Kyu-Cheol
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.4
    • /
    • pp.11-19
    • /
    • 2007
  • This paper proposes RCB(Reduced Compact Binary) tie to correct faults of both CB(Compact Binary) tie and HCB(Hierarchical Compact Binary) trie. First, in the case of CB trie, a compact structure was tried for the first time, but as the amount of data was increasing, that of inputted data gained and much difficulty was experienced in insertion due to the dummy nods used in balancing trees. On the other hand, if the HCB trie realized hierarchically, given certain depth to prevent the map from increasing on the right, reached the depth, the method for making new trees and connecting to them was used. Eventually, fast progress could be made in the inputting and searching speed, but this had a disadvantage of the storage space becoming bigger because of the use of dummy nods like CB trie and of many tree links. In the case of RCB trie in this thesis, the tree-map could be reduced by about 35% by completely cutting down dummy nods and the whole size by half, compared with the HCB trie.

  • PDF

Target Word Selection for English-Korean Machine Translation System using Multiple Knowledge (다양한 지식을 사용한 영한 기계번역에서의 대역어 선택)

  • Lee, Ki-Young;Kim, Han-Woo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.5 s.43
    • /
    • pp.75-86
    • /
    • 2006
  • Target word selection is one of the most important and difficult tasks in English-Korean Machine Translation. It effects on the translation accuracy of machine translation systems. In this paper, we present a new approach to select Korean target word for an English noun with translation ambiguities using multiple knowledge such as verb frame patterns, sense vectors based on collocations, statistical Korean local context information and co-occurring POS information. Verb frame patterns constructed with dictionary and corpus play an important role in resolving the sparseness problem of collocation data. Sense vectors are a set of collocation data when an English word having target selection ambiguities is to be translated to specific Korean target word. Statistical Korean local context Information is an N-gram information generated using Korean corpus. The co-occurring POS information is a statistically significant POS clue which appears with ambiguous word. The experiment showed promising results for diverse sentences from web documents.

  • PDF

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

  • Mao, Makara;Peng, Sony;Yang, Yixuan;Park, Doo-Soon
    • Journal of Information Processing Systems
    • /
    • v.18 no.4
    • /
    • pp.549-561
    • /
    • 2022
  • In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.

Analysis of Research Trends on Interactions between Herbal Formula and Conventional Drugs Using Papers from PubMed (PubMed 수록 논문을 활용한 한약 처방과 양약 상호작용에 관한 연구 동향 분석)

  • Sang Jun Yea
    • Herbal Formula Science
    • /
    • v.32 no.3
    • /
    • pp.365-375
    • /
    • 2024
  • Objectives : Herbal formula consist of multiple herbs, which can potentially interact with conventional drugs. If these interactions are not properly understood, they may reduce treatment efficacy or cause unexpected side effects. Thus, this study collected and analyzed papers on herbal formula and conventional drug interactions from PubMed to analyze various research trends. Methods : To analyze research trends on herbal formula and drug interactions, we first created search queries using a dictionary of herbal formula terms and collected related papers from PubMed using the Entrez API. The PubTator API was applied to identify compound names in the abstracts, recognizing compounds registered in the DrugBank as conventional drugs. Sentences describing interactions between herbal formulas and drugs were extracted using pattern matching, and relevant papers were selected. Trends were then analyzed by year, country, major formulas, major drugs, and interaction networks. Results : Yearly analysis showed a gradual increase in paper counts with a significant rise after 2010. Country analysis revealed that China published the most papers (53), followed by Japan (19) and South Korea (8). formula analysis identified 'sosiho-tang' and 'siryung-tang' as the most frequently mentioned (7 times each). Drug analysis showed '5-fluorouracil', 'acetaminophen', 'entecavir', and 'streptozotocin' were the most frequently mentioned (4 times each). Network analysis revealed 'sosiho-tang and tolbutamide' and 'siryung-tang and prednisolone' as the most frequently, mentioned interactions (3 times each). Disease analysis indicated 'urogenital diseases' were the most discussed (32 mentions), Followed by 'pathological conditions, signs, and symptoms' and 'digestive system diseases' (25 mentions each). Conclusions : Analyzing research trends on herbal formula and conventional drug interactions provides basic data for subsequent research, aiming to reduce side effects and enhance treatment efficacy in clinical settings.

Electronic-Composit Consumer Sentiment Index(CCSI) development by Social Bigdata Analysis (소셜빅데이터를 이용한 온라인 소비자감성지수(e-CCSI) 개발)

  • Kim, Yoosin;Hong, Sung-Gwan;Kang, Hee-Joo;Jeong, Seung-Ryul
    • Journal of Internet Computing and Services
    • /
    • v.18 no.4
    • /
    • pp.121-131
    • /
    • 2017
  • With emergence of Internet, social media, and mobile service, the consumers have actively presented their opinions and sentiment, and then it is spreading out real time as well. The user-generated text data on the Internet and social media is not only the communication text among the users but also the valuable resource to be analyzed for knowing the users' intent and sentiment. In special, economic participants have strongly asked that the social big data and its' analytics supports to recognize and forecast the economic trend in future. In this regard, the governments and the businesses are trying to apply the social big data into making the social and economic solutions. Therefore, this study aims to reveal the capability of social big data analysis for the economic use. The research proposed a social big data analysis model and an online consumer sentiment index. To test the model and index, the researchers developed an economic survey ontology, defined a sentiment dictionary for sentiment analysis, conducted classification and sentiment analysis, and calculated the online consumer sentiment index. In addition, the online consumer sentiment index was compared and validated with the composite consumer survey index of the Bank of Korea.

Determination of Fire Risk Assessment Indicators for Building using Big Data (빅데이터를 활용한 건축물 화재위험도 평가 지표 결정)

  • Joo, Hong-Jun;Choi, Yun-Jeong;Ok, Chi-Yeol;An, Jae-Hong
    • Journal of the Korea Institute of Building Construction
    • /
    • v.22 no.3
    • /
    • pp.281-291
    • /
    • 2022
  • This study attempts to use big data to determine the indicators necessary for a fire risk assessment of buildings. Because most of the causes affecting the fire risk of buildings are fixed as indicators considering only the building itself, previously only limited and subjective assessment has been performed. Therefore, if various internal and external indicators can be considered using big data, effective measures can be taken to reduce the fire risk of buildings. To collect the data necessary to determine indicators, a query language was first selected, and professional literature was collected in the form of unstructured data using a web crawling technique. To collect the words in the literature, pre-processing was performed such as user dictionary registration, duplicate literature, and stopwords. Then, through a review of previous research, words were classified into four components, and representative keywords related to risk were selected from each component. Risk-related indicators were collected through analysis of related words of representative keywords. By examining the indicators according to their selection criteria, 20 indicators could be determined. This research methodology indicates the applicability of big data analysis for establishing measures to reduce fire risk in buildings, and the determined risk indicators can be used as reference materials for assessment.

Sensitivity Identification Method for New Words of Social Media based on Naive Bayes Classification (나이브 베이즈 기반 소셜 미디어 상의 신조어 감성 판별 기법)

  • Kim, Jeong In;Park, Sang Jin;Kim, Hyoung Ju;Choi, Jun Ho;Kim, Han Il;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.9 no.1
    • /
    • pp.51-59
    • /
    • 2020
  • From PC communication to the development of the internet, a new term has been coined on the social media, and the social media culture has been formed due to the spread of smart phones, and the newly coined word is becoming a culture. With the advent of social networking sites and smart phones serving as a bridge, the number of data has increased in real time. The use of new words can have many advantages, including the use of short sentences to solve the problems of various letter-limited messengers and reduce data. However, new words do not have a dictionary meaning and there are limitations and degradation of algorithms such as data mining. Therefore, in this paper, the opinion of the document is confirmed by collecting data through web crawling and extracting new words contained within the text data and establishing an emotional classification. The progress of the experiment is divided into three categories. First, a word collected by collecting a new word on the social media is subjected to learned of affirmative and negative. Next, to derive and verify emotional values using standard documents, TF-IDF is used to score noun sensibilities to enter the emotional values of the data. As with the new words, the classified emotional values are applied to verify that the emotions are classified in standard language documents. Finally, a combination of the newly coined words and standard emotional values is used to perform a comparative analysis of the technology of the instrument.

A Study on Conversion Methods for Generating RDF Ontology from Structural Terminology Net (STNet) based on RDB (관계형 데이터베이스 기반 구조적학술용어사전(STNet)의 RDF 온톨로지 변환 방식 연구)

  • Ko, Young Man;Lee, Seung-Jun;Song, Min-Sun
    • Journal of the Korean Society for information Management
    • /
    • v.32 no.2
    • /
    • pp.131-152
    • /
    • 2015
  • This study described the results of converting RDB to RDF ontology by each of R2RML method and Non-R2RML method. This study measured the size of the converted data, the conversion time per each tuple, and the response speed to queries. The STNet, a structured terminology dictionary based on RDB, was served as a test bed for converting to RDF ontology. As a result of the converted data size, Non-R2RML method appeared to be superior to R2RML method on the number of converted triples, including its expressive diversity. For the conversion time per each tuple, Non-R2RML was a little bit more faster than R2RML, but, for the response speed to queries, both methods showed similar response speed and stable performance since more than 300 numbers of queries. On comprehensive examination it is evaluated that Non-R2RML is the more appropriate to convert the dynamic RDB system, such as the STNet in which new data are steadily accumulated, data transformation very often occurred, and relationships between data continuously changed.

Study on the Methodology for Extracting Information from SNS Using a Sentiment Analysis (SNS 감성분석을 이용한 정보 추출 방법론에 관한 연구)

  • Hong, Doopyo;Jeong, Harim;Park, Sangmin;Han, Eum;Kim, Honghoi;Yun, Ilsoo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.16 no.6
    • /
    • pp.141-155
    • /
    • 2017
  • As the use of SNS becomes more active, many people are posting their thoughts about specific events in their SNS in the form of text. As a result, SNS is used in various fields such as finance and distribution to conduct service satisfaction surveys and consumer monitoring. However, in the transportation area, there are not enough cases to utilize unstructured data analysis such as emotional analysis. In this study, we developed an emotional analysis methodology that can be used in transportation by using highway VOC data, which is atypical data collected by Korea Expressway Corporation. The developed methodology consists of morpheme analysis, emotional dictionary construction, and emotional discrimination of the collected unstructured data. The developed methodology was verified using highway related tweet data. As a result of the analysis, it can be guessed that many information and information about the construction and the accident were related to the highway during the analysis period. Also, it seems that users complain about the delay caused by construction and accident.

Crafting a Quality Performance Evaluation Model Leveraging Unstructured Data (비정형데이터를 활용한 건축현장 품질성과 평가 모델 개발)

  • Lee, Kiseok;Song, Taegeun;Yoo, Wi Sung
    • Journal of the Korea Institute of Building Construction
    • /
    • v.24 no.1
    • /
    • pp.157-168
    • /
    • 2024
  • The frequent occurrence of structural failures at building construction sites in Korea has underscored the critical role of rigorous oversight in the inspection and management of construction projects. As mandated by prevailing regulations and standards, onsite supervision by designated supervisors encompasses thorough documentation of construction quality, material standards, and the history of any reconstructions, among other factors. These reports, predominantly consisting of unstructured data, constitute approximately 80% of the data amassed at construction sites and serve as a comprehensive repository of quality-related information. This research introduces the SL-QPA model, which employs text mining techniques to preprocess supervision reports and establish a sentiment dictionary, thereby enabling the quantification of quality performance. The study's findings, demonstrating a statistically significant Pearson correlation between the quality performance scores derived from the SL-QPA model and various legally defined indicators, were substantiated through a one-way analysis of variance of the correlation coefficients. The SL-QPA model, as developed in this study, offers a supplementary approach to evaluating the quality performance of building construction projects. It holds the promise of enhancing quality inspection and management practices by harnessing the wealth of unstructured data generated throughout the lifecycle of construction projects.