• 제목/요약/키워드: Text data

Search Result 2,953, Processing Time 0.043 seconds

A bio-text mining system using keywords and patterns in a grid environment

  • Kwon, Hyuk-Ryul;Jung, Tae-Sung;Kim, Kyoung-Ran;Jahng, Hye-Kyoung;Cho, Wan-Sup;Yoo, Jae-Soo
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2007.02a
    • /
    • pp.48-52
    • /
    • 2007
  • As huge amount of literature including biological data is being generated after post genome era, it becomes difficult for researcher to find useful knowledge from the biological databases. Bio-text mining and related natural language processing technique are the key issues in the intelligent knowledge retrieval from the biological databases. We propose a bio-text mining technique for the biologists who find Knowledge from the huge literature. At first, web robot is used to extract and transform related literature from remote databases. To improve retrieval speed, we generate an inverted file for keywords in the literature. Then, text mining system is used for extracting given knowledge patterns and keywords. Finally, we construct a grid computing environment to guarantee processing speed in the text mining even for huge literature databases. In the real experiment for 10,000 bio-literatures, the system shows 95% precision and 98% recall.

  • PDF

Two - Handed Hangul Input Performance Prediction Model for Mobile Phone (모바일 폰에서의 양 손을 이용한 한글 입력 수행도 예측 모델에 대한 연구)

  • Lee, Joo-Woo;Myung, Ro-Hae
    • Journal of the Ergonomics Society of Korea
    • /
    • v.27 no.4
    • /
    • pp.73-83
    • /
    • 2008
  • With a rapid extension of functions in mobile phones, text input method has become very important for mobile phone users. Previous studies for text input methods were focused on Fitts' law, emphasizing expert's behaviors with one-handed text input method. However, it was observed that 97% of Korean mobile phone users input texts with two-hands. Therefore, this study was designed to develop a prediction model of two-handed Hangul text entry method including novice users as well as experts for mobile phone. For this study, Fitts' law was hypothesized to predict experts' movement time(MT) whereas Hick-Hyman law for visual search time was hypothesized to be added to MT for novices. The results showed that the prediction model was well fitted with the empirical data for both experts and novices with less than 3% error rates. In conclusion, this prediction model of two-handed Hangul text entry including novice users was proven to be a very effective model for modeling two-handed Hangul text input behavior for both experts.

A Study on the Effectiveness of Bigrams in Text Categorization (바이그램이 문서범주화 성능에 미치는 영향에 관한 연구)

  • Lee, Chan-Do;Choi, Joon-Young
    • Journal of Information Technology Applications and Management
    • /
    • v.12 no.2
    • /
    • pp.15-27
    • /
    • 2005
  • Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$\"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

  • PDF

The Effects of Inferential Reading Strategy Program on Text Comprehension and Korean Language Academic Achievements of Vocational High School Students (추론적 읽기전략 프로그램이 전문계 고등학생의 텍스트 이해와 국어과 학업성취에 미치는 효과)

  • Kim, Seon-Kyung;Yune, So-Jung;Kim, Jung-Sub
    • Journal of Fisheries and Marine Sciences Education
    • /
    • v.23 no.1
    • /
    • pp.1-12
    • /
    • 2011
  • The purpose of this study was to examine the effects of inferential reading strategy program on text comprehension and Korean language academic achievements of vocational high school students. We developed the program of inferential reading strategy, applied it to an educational spot, and examined the effects of it on text comprehension ability and Korean language academic achievements of learners. ANCOVA was used for data analysis with SPSS ver.12.0 statistic program. The main findings of this study were as follows. First, the experimental group which had been conducted with the inferential reading strategy program showed statistically significant difference in their text comprehension ability from controlled group. Second, the experimental group showed statistically significant difference in their Korean language academic achievements ability from controlled group. The study shows that the inferential reading strategy program had effect on the text comprehension and Korean language academic achievements of vocational high school students.

On the Development of Risk Factor Map for Accident Analysis using Textmining and Self-Organizing Map(SOM) Algorithms (재해분석을 위한 텍스트마이닝과 SOM 기반 위험요인지도 개발)

  • Kang, Sungsik;Suh, Yongyoon
    • Journal of the Korean Society of Safety
    • /
    • v.33 no.6
    • /
    • pp.77-84
    • /
    • 2018
  • Report documents of industrial and occupational accidents have continuously been accumulated in private and public institutes. Amongst others, information on narrative-texts of accidents such as accident processes and risk factors contained in disaster report documents is gaining the useful value for accident analysis. Despite this increasingly potential value of analysis of text information, scientific and algorithmic text analytics for safety management has not been carried out yet. Thus, this study aims to develop data processing and visualization techniques that provide a systematic and structural view of text information contained in a disaster report document so that safety managers can effectively analyze accident risk factors. To this end, the risk factor map using text mining and self-organizing map is developed. Text mining is firstly used to extract risk keywords from disaster report documents and then, the Self-Organizing Map (SOM) algorithm is conducted to visualize the risk factor map based on the similarity of disaster report documents. As a result, it is expected that fruitful text information buried in a myriad of disaster report documents is analyzed, providing risk factors to safety managers.

A WWMBERT-based Method for Improving Chinese Text Classification Task (중국어 텍스트 분류 작업의 개선을 위한 WWMBERT 기반 방식)

  • Wang, Xinyuan;Joe, Inwhee
    • Annual Conference of KIPS
    • /
    • 2021.05a
    • /
    • pp.408-410
    • /
    • 2021
  • In the NLP field, the pre-training model BERT launched by the Google team in 2018 has shown amazing results in various tasks in the NLP field. Subsequently, many variant models have been derived based on the original BERT, such as RoBERTa, ERNIEBERT and so on. In this paper, the WWMBERT (Whole Word Masking BERT) model suitable for Chinese text tasks was used as the baseline model of our experiment. The experiment is mainly for "Text-level Chinese text classification tasks" are improved, which mainly combines Tapt (Task-Adaptive Pretraining) and "Multi-Sample Dropout method" to improve the model, and compare the experimental results, experimental data sets and model scoring standards Both are consistent with the official WWMBERT model using Accuracy as the scoring standard. The official WWMBERT model uses the maximum and average values of multiple experimental results as the experimental scores. The development set was 97.70% (97.50%) on the "text-level Chinese text classification task". and 97.70% (97.50%) of the test set. After comparing the results of the experiments in this paper, the development set increased by 0.35% (0.5%) and the test set increased by 0.31% (0.48%). The original baseline model has been significantly improved.

Text Classification Using Parallel Word-level and Character-level Embeddings in Convolutional Neural Networks

  • Geonu Kim;Jungyeon Jang;Juwon Lee;Kitae Kim;Woonyoung Yeo;Jong Woo Kim
    • Asia pacific journal of information systems
    • /
    • v.29 no.4
    • /
    • pp.771-788
    • /
    • 2019
  • Deep learning techniques such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) show superior performance in text classification than traditional approaches such as Support Vector Machines (SVMs) and Naïve Bayesian approaches. When using CNNs for text classification tasks, word embedding or character embedding is a step to transform words or characters to fixed size vectors before feeding them into convolutional layers. In this paper, we propose a parallel word-level and character-level embedding approach in CNNs for text classification. The proposed approach can capture word-level and character-level patterns concurrently in CNNs. To show the usefulness of proposed approach, we perform experiments with two English and three Korean text datasets. The experimental results show that character-level embedding works better in Korean and word-level embedding performs well in English. Also the experimental results reveal that the proposed approach provides better performance than traditional CNNs with word-level embedding or character-level embedding in both Korean and English documents. From more detail investigation, we find that the proposed approach tends to perform better when there is relatively small amount of data comparing to the traditional embedding approaches.

A Group based Privacy-preserving Data Perturbation Technique in Distributed OSN (분산 OSN 환경에서 프라이버시 보호를 위한 그룹 기반의 데이터 퍼튜베이션 기법)

  • Lee, Joohyoung;Park, Seog
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.12
    • /
    • pp.675-680
    • /
    • 2016
  • The development of various mobile devices and mobile platform technology has led to a steady increase in the number of online social network (OSN) users. OSN users are free to communicate and share information through activities such as social networking, but this causes a new, user privacy issue. Various distributed OSN architectures are introduced to address the user privacy concern, however, users do not obtain technically perfect control over their data. In this study, the control rights of OSN user are maintained by using personal data storage (PDS). We propose a technique to improve data privacy protection that involves making a group with the user's friend by generating and providing fake text data based on user's real text data. Fake text data is generated based on the user's word sensitivity value, so that the user's friends can receive the user's differential data. As a result, we propose a system architecture that solves possible problems in the tradeoff between service utility and user privacy in OSN.

Application of a Topic Model on the Korea Expressway Corporation's VOC Data (한국도로공사 VOC 데이터를 이용한 토픽 모형 적용 방안)

  • Kim, Ji Won;Park, Sang Min;Park, Sungho;Jeong, Harim;Yun, Ilsoo
    • Journal of Information Technology Services
    • /
    • v.19 no.6
    • /
    • pp.1-13
    • /
    • 2020
  • Recently, 80% of big data consists of unstructured text data. In particular, various types of documents are stored in the form of large-scale unstructured documents through social network services (SNS), blogs, news, etc., and the importance of unstructured data is highlighted. As the possibility of using unstructured data increases, various analysis techniques such as text mining have recently appeared. Therefore, in this study, topic modeling technique was applied to the Korea Highway Corporation's voice of customer (VOC) data that includes customer opinions and complaints. Currently, VOC data is divided into the business areas of Korea Expressway Corporation. However, the classified categories are often not accurate, and the ambiguous ones are classified as "other". Therefore, in order to use VOC data for efficient service improvement and the like, a more systematic and efficient classification method of VOC data is required. To this end, this study proposed two approaches, including method using only the latent dirichlet allocation (LDA), the most representative topic modeling technique, and a new method combining the LDA and the word embedding technique, Word2vec. As a result, it was confirmed that the categories of VOC data are relatively well classified when using the new method. Through these results, it is judged that it will be possible to derive the implications of the Korea Expressway Corporation and utilize it for service improvement.

Social media big data analysis of Z-generation fashion (Z세대 패션에 대한 소셜미디어의 빅데이터 분석)

  • Sung, Kwang-Sook
    • Journal of the Korea Fashion and Costume Design Association
    • /
    • v.22 no.3
    • /
    • pp.49-61
    • /
    • 2020
  • This study analyzed the social media accounts and performed a Big Data analysis of Z-generation fashion using Textom Text Mining Techniques program and Ucinet Big Data analysis program. The research results are as follows: First, as a result of keyword analysis on 67.646 Z-generation fashion social media posts over the last 5 years, 220,211 keywords were extracted. Among them, 67 major keywords were selected based on the frequency of co-occurrence being greater than more than 250 times. As the top keywords appearing over 1000 times, were the most influential as the number of nodes connected to 'Z generation' (29595 times) are overwhelmingly, and was followed by 'millennials'(18536 times), 'fashion'(17836 times), and 'generation'(13055 times), 'brand'(8325 times) and 'trend'(7310 times) Second, as a result of the analysis of Network Degree Centrality between the key keywords for the Z-generation, the number of nodes connected to the "Z-generation" (29595 times) is overwhelmingly large. Next, many 'millennial'(18536 times), 'fashion'(17836 times), 'generation'(13055 times), 'brand'(8325 times), 'trend'(7310 times), etc. appear. These texts are considered to be important factors in exploring the reaction of social media to the Z-generation. Third, through the analysis of CONCOR, text with the structural equivalence between major keywords for Gen Z fashion was rearranged and clustered. In addition, four clusters were derived by grouping through network semantic network visualization. Group 1 is 54 texts, 'Diverse Characteristics of Z-Generation Fashion Consumers', Group 2 is 7 Texts, 'Z-Generation's teenagers Fashion Powers', Group 3 is 8 Texts, 'Z-Generation's Celebrity Fashions' Interest and Fashion', Group 4 named 'Gucci', the most popular luxury fashion of the Z-generation as one text.