• Title/Summary/Keyword: dataset

Search Result 3,827, Processing Time 0.026 seconds

KOMUChat: Korean Online Community Dialogue Dataset for AI Learning (KOMUChat : 인공지능 학습을 위한 온라인 커뮤니티 대화 데이터셋 연구)

  • YongSang Yoo;MinHwa Jung;SeungMin Lee;Min Song
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.2
    • /
    • pp.219-240
    • /
    • 2023
  • Conversational AI which allows users to interact with satisfaction is a long-standing research topic. To develop conversational AI, it is necessary to build training data that reflects real conversations between people, but current Korean datasets are not in question-answer format or use honorifics, making it difficult for users to feel closeness. In this paper, we propose a conversation dataset (KOMUChat) consisting of 30,767 question-answer sentence pairs collected from online communities. The question-answer pairs were collected from post titles and first comments of love and relationship counsel boards used by men and women. In addition, we removed abuse records through automatic and manual cleansing to build high quality dataset. To verify the validity of KOMUChat, we compared and analyzed the result of generative language model learning KOMUChat and benchmark dataset. The results showed that our dataset outperformed the benchmark dataset in terms of answer appropriateness, user satisfaction, and fulfillment of conversational AI goals. The dataset is the largest open-source single turn text data presented so far and it has the significance of building a more friendly Korean dataset by reflecting the text styles of the online community.

Mobile database server management system (모바일 데이트베이스 서버 관리 시스템)

  • Zheng, Bao-Wei;Park, Sang-Gug
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2007.06a
    • /
    • pp.645-648
    • /
    • 2007
  • This paper describes database server management system by mobile device. Our system can control database server immediately in any where by mobile device. We have developed the dataset server CE which interface with mobile device, database management system which rons in mobile device and web services which connect the mobile database with desktop database. External mobile device communicate with web service agent via wireless equipment. Our system includes dataset server CE client engine, web service server agent and mobile device management system. Through the test by an application program, we have confirmed that our system operates very well each others.

  • PDF

Ensemble-By-Session Method on Keystroke Dynamics based User Authentication

  • Ho, Jiacang;Kang, Dae-Ki
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.8 no.4
    • /
    • pp.19-25
    • /
    • 2016
  • There are many free applications that need users to sign up before they can use the applications nowadays. It is difficult to choose a suitable password for your account. If the password is too complicated, then it is hard to remember it. However, it is easy to be intruded by other users if we use a very simple password. Therefore, biometric-based approach is one of the solutions to solve the issue. The biometric-based approach includes keystroke dynamics on keyboard, mice, or mobile devices, gait analysis and many more. The approach can integrate with any appropriate machine learning algorithm to learn a user typing behavior for authentication system. Preprocessing phase is one the important role to increase the performance of the algorithm. In this paper, we have proposed ensemble-by-session (EBS) method which to operate the preprocessing phase before the training phase. EBS distributes the dataset into multiple sub-datasets based on the session. In other words, we split the dataset into session by session instead of assemble them all into one dataset. If a session is considered as one day, then the sub-dataset has all the information on the particular day. Each sub-dataset will have different information for different day. The sub-datasets are then trained by a machine learning algorithm. From the experimental result, we have shown the improvement of the performance for each base algorithm after the preprocessing phase.

A Measure for Improvement in Quality of Association Rules in the Item Response Dataset (문항 응답 데이터에서 문항간 연관규칙의 질적 향상을 위한 도구 개발)

  • Kwak, Eun-Young;Kim, Hyeoncheol
    • The Journal of Korean Association of Computer Education
    • /
    • v.10 no.3
    • /
    • pp.1-8
    • /
    • 2007
  • In this paper, we introduce a new measure called surprisal that estimates the informativeness of transactional instances and attributes in the item response dataset and improve the quality of association rules. In order to this, we set artificial dataset and eliminate noisy and uninformative data using the surprisal first, and then generate association rules between items. And we compare the association rules from the dataset after surprisal-based pruning with support-based pruning and original dataset unpruned. Experimental result that the surprisal-based pruning improves quality of association rules in question item response datasets significantly.

  • PDF

Spatial Statistic Data Release Based on Differential Privacy

  • Cai, Sujin;Lyu, Xin;Ban, Duohan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.10
    • /
    • pp.5244-5259
    • /
    • 2019
  • With the continuous development of LBS (Location Based Service) applications, privacy protection has become an urgent problem to be solved. Differential privacy technology is based on strict mathematical theory that provides strong privacy guarantees where it supposes that the attacker has the worst-case background knowledge and that knowledge has been applied to different research directions such as data query, release, and mining. The difficulty of this research is how to ensure data availability while protecting privacy. Spatial multidimensional data are usually released by partitioning the domain into disjointed subsets, then generating a hierarchical index. The traditional data-dependent partition methods need to allocate a part of the privacy budgets for the partitioning process and split the budget among all the steps, which is inefficient. To address such issues, a novel two-step partition algorithm is proposed. First, we partition the original dataset into fixed grids, inject noise and synthesize a dataset according to the noisy count. Second, we perform IH-Tree (Improved H-Tree) partition on the synthetic dataset and use the resulting partition keys to split the original dataset. The algorithm can save the privacy budget allocated to the partitioning process and obtain a more accurate release. The algorithm has been tested on three real-world datasets and compares the accuracy with the state-of-the-art algorithms. The experimental results show that the relative errors of the range query are considerably reduced, especially on the large scale dataset.

Emotion Recognition of Low Resource (Sindhi) Language Using Machine Learning

  • Ahmed, Tanveer;Memon, Sajjad Ali;Hussain, Saqib;Tanwani, Amer;Sadat, Ahmed
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.369-376
    • /
    • 2021
  • One of the most active areas of research in the field of affective computing and signal processing is emotion recognition. This paper proposes emotion recognition of low-resource (Sindhi) language. This work's uniqueness is that it examines the emotions of languages for which there is currently no publicly accessible dataset. The proposed effort has provided a dataset named MAVDESS (Mehran Audio-Visual Dataset Mehran Audio-Visual Database of Emotional Speech in Sindhi) for the academic community of a significant Sindhi language that is mainly spoken in Pakistan; however, no generic data for such languages is accessible in machine learning except few. Furthermore, the analysis of various emotions of Sindhi language in MAVDESS has been carried out to annotate the emotions using line features such as pitch, volume, and base, as well as toolkits such as OpenSmile, Scikit-Learn, and some important classification schemes such as LR, SVC, DT, and KNN, which will be further classified and computed to the machine via Python language for training a machine. Meanwhile, the dataset can be accessed in future via https://doi.org/10.5281/zenodo.5213073.

Automatic Dataset Generation of Object Detection and Instance Segmentation using Mask R-CNN (Mask R-CNN을 이용한 물체인식 및 개체분할의 학습 데이터셋 자동 생성)

  • Jo, HyunJun;Kim, Dawit;Song, Jae-Bok
    • The Journal of Korea Robotics Society
    • /
    • v.14 no.1
    • /
    • pp.31-39
    • /
    • 2019
  • A robot usually adopts ANN (artificial neural network)-based object detection and instance segmentation algorithms to recognize objects but creating datasets for these algorithms requires high labeling costs because the dataset should be manually labeled. In order to lower the labeling cost, a new scheme is proposed that can automatically generate a training images and label them for specific objects. This scheme uses an instance segmentation algorithm trained to give the masks of unknown objects, so that they can be obtained in a simple environment. The RGB images of objects can be obtained by using these masks, and it is necessary to label the classes of objects through a human supervision. After obtaining object images, they are synthesized with various background images to create new images. Labeling the synthesized images is performed automatically using the masks and previously input object classes. In addition, human intervention is further reduced by using the robot arm to collect object images. The experiments show that the performance of instance segmentation trained through the proposed method is equivalent to that of the real dataset and that the time required to generate the dataset can be significantly reduced.

OryzaGP: rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Do, Huy;Wang, Yue
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.17.1-17.3
    • /
    • 2019
  • Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

Synthesizing Image and Automated Annotation Tool for CNN based Under Water Object Detection (강건한 CNN기반 수중 물체 인식을 위한 이미지 합성과 자동화된 Annotation Tool)

  • Jeon, MyungHwan;Lee, Yeongjun;Shin, Young-Sik;Jang, Hyesu;Yeu, Taekyeong;Kim, Ayoung
    • The Journal of Korea Robotics Society
    • /
    • v.14 no.2
    • /
    • pp.139-149
    • /
    • 2019
  • In this paper, we present auto-annotation tool and synthetic dataset using 3D CAD model for deep learning based object detection. To be used as training data for deep learning methods, class, segmentation, bounding-box, contour, and pose annotations of the object are needed. We propose an automated annotation tool and synthetic image generation. Our resulting synthetic dataset reflects occlusion between objects and applicable for both underwater and in-air environments. To verify our synthetic dataset, we use MASK R-CNN as a state-of-the-art method among object detection model using deep learning. For experiment, we make the experimental environment reflecting the actual underwater environment. We show that object detection model trained via our dataset show significantly accurate results and robustness for the underwater environment. Lastly, we verify that our synthetic dataset is suitable for deep learning model for the underwater environments.

ETLi: Efficiently annotated traffic LiDAR dataset using incremental and suggestive annotation

  • Kang, Jungyu;Han, Seung-Jun;Kim, Nahyeon;Min, Kyoung-Wook
    • ETRI Journal
    • /
    • v.43 no.4
    • /
    • pp.630-639
    • /
    • 2021
  • Autonomous driving requires a computerized perception of the environment for safety and machine-learning evaluation. Recognizing semantic information is difficult, as the objective is to instantly recognize and distinguish items in the environment. Training a model with real-time semantic capability and high reliability requires extensive and specialized datasets. However, generalized datasets are unavailable and are typically difficult to construct for specific tasks. Hence, a light detection and ranging semantic dataset suitable for semantic simultaneous localization and mapping and specialized for autonomous driving is proposed. This dataset is provided in a form that can be easily used by users familiar with existing two-dimensional image datasets, and it contains various weather and light conditions collected from a complex and diverse practical setting. An incremental and suggestive annotation routine is proposed to improve annotation efficiency. A model is trained to simultaneously predict segmentation labels and suggest class-representative frames. Experimental results demonstrate that the proposed algorithm yields a more efficient dataset than uniformly sampled datasets.