• Title/Summary/Keyword: Text information

Search Result 4,380, Processing Time 0.031 seconds

A Methodology for Customer Core Requirement Analysis by Using Text Mining : Focused on Chinese Online Cosmetics Market (텍스트 마이닝을 활용한 사용자 핵심 요구사항 분석 방법론 : 중국 온라인 화장품 시장을 중심으로)

  • Shin, Yoon Sig;Baek, Dong Hyun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.44 no.2
    • /
    • pp.66-77
    • /
    • 2021
  • Companies widely use survey to identify customer requirements, but the survey has some problems. First of all, the response is passive due to pre-designed questionnaire by companies which are the surveyor. Second, the surveyor needs to have good preliminary knowledge to improve the quality of the survey. On the other hand, text mining is an excellent way to compensate for the limitations of surveys. Recently, the importance of online review is steadily grown, and the enormous amount of text data has increased as Internet usage higher. Also, a technique to extract high-quality information from text data called Text Mining is improving. However, previous studies tend to focus on improving the accuracy of individual analytics techniques. This study proposes the methodology by combining several text mining techniques and has mainly three contributions. Firstly, able to extract information from text data without a preliminary design of the surveyor. Secondly, no need for prior knowledge to extract information. Lastly, this method provides quantitative sentiment score that can be used in decision-making.

Future and Directions for Research in Full Text Databases (본문 데이타베이스 연구에 관한 고찰과 그 전망)

  • Ro Jung Soon
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.17
    • /
    • pp.49-83
    • /
    • 1989
  • A Full text retrieval system is a natural language document retrieval system in which the full text of all documents in a collection is stored on a computer so that every word in every sentence of every document can be located by the machine. This kind of IR System is recently becoming rapidly available online in the field of legal, newspaper, journal and reference book indexing. Increased research interest has been in this field. In this paper, research on full text databases and retrieval systems are reviewed, directions for research in this field are speculated, questions in the field that need answering are considered, and variables affecting online full text retrieval and various role that variables play in a research study are described. Two obvious research questions in full text retrieval have been how full text retrieval performs and how to improve the retrieval performance of full text databases. Research to improve the retrieval performance has been incorporated with ranking or weighting algorithms based on word occurrences, combined menu-driven and query-driven systems, and improvement of computer architectures and record structure for databases. Recent increase in the number of full text databases with various sizes, forms and subject matters, and recent development in computer architecture artificial intelligence, and videodisc technology promise new direction of its research and scholarly growth. Studies on the interrelationship between every elements of the full text retrieval situation and the relationship between each elements and retrieval performance may give a professional view in theory and practice of full text retrieval.

  • PDF

Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet

  • Kim, Chul-Won;Park, Sun
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.241-246
    • /
    • 2013
  • A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent structure and the topical composition of the documents. Further, the organization of knowledge into an ontology is expensive. In this paper, we propose a new enhanced text document clustering method using non-negative matrix factorization (NMF) and WordNet. The semantic terms extracted as cluster labels by NMF can represent the inherent structure of a document cluster well. The proposed method can also improve the quality of document clustering that uses cluster labels and term weights based on term mutual information of WordNet. The experimental results demonstrate that the proposed method achieves better performance than the other text clustering methods.

A Study on the Implementation and Performance Evaluation of Full-text Information Retrieval System based on Scientific Paper′s Content Structure (학술논문의 내용구조에 의한 전문검색시스템 구현과 성능평가에 관한 연구)

  • 이두영;이병기
    • Journal of the Korean Society for information Management
    • /
    • v.15 no.3
    • /
    • pp.73-93
    • /
    • 1998
  • Conventional full-text information retrieval system has been proved with high recall ratio and low precision ratio. One of the disadvantages of full-text IR system is that it is not designed to reflect the user's information need. It is due to the fact that full-text IR system has been designed based on physical and logical structure of document without considering the content of document. The purpose of the study is to develop more effective full-text IR system by resolving such disadvantages of conventional system. The study has developed new method of designing full-text IR system by using Content Structure Markup Language(CSML) other than conventioanal SGML.

  • PDF

Extending TextAE for annotation of non-contiguous entities

  • Lever, Jake;Altman, Russ;Kim, Jin-Dong
    • Genomics & Informatics
    • /
    • v.18 no.2
    • /
    • pp.15.1-15.6
    • /
    • 2020
  • Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.

An Exploratory Analysis of Online Discussion of Library and Information Science Professionals in India using Text Mining

  • Garg, Mohit;Kanjilal, Uma
    • Journal of Information Science Theory and Practice
    • /
    • v.10 no.3
    • /
    • pp.40-56
    • /
    • 2022
  • This paper aims to implement a topic modeling technique for extracting the topics of online discussions among library professionals in India. Topic modeling is the established text mining technique popularly used for modeling text data from Twitter, Facebook, Yelp, and other social media platforms. The present study modeled the online discussions of Library and Information Science (LIS) professionals posted on Lis Links. The text data of these posts was extracted using a program written in R using the package "rvest." The data was pre-processed to remove blank posts, posts having text in non-English fonts, punctuation, URLs, emails, etc. Topic modeling with the Latent Dirichlet Allocation algorithm was applied to the pre-processed corpus to identify each topic associated with the posts. The frequency analysis of the occurrence of words in the text corpus was calculated. The results found that the most frequent words included: library, information, university, librarian, book, professional, science, research, paper, question, answer, and management. This shows that the LIS professionals actively discussed exams, research, and library operations on the forum of Lis Links. The study categorized the online discussions on Lis Links into ten topics, i.e. "LIS Recruitment," "LIS Issues," "Other Discussion," "LIS Education," "LIS Research," "LIS Exams," "General Information related to Library," "LIS Admission," "Library and Professional Activities," and "Information Communication Technology (ICT)." It was found that the majority of the posts belonged to "LIS Exam," followed by "Other Discussions" and "General Information related to the Library."

Implementation of Text Summarize Automation Using Document Length Normalization (문서 길이 정규화를 이용한 문서 요약 자동화 시스템 구현)

  • 이재훈;김영천;이성주
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2001.12a
    • /
    • pp.51-55
    • /
    • 2001
  • With the rapid growth of the World Wide Web and electronic information services, information is becoming available on-Line at an incredible rate. One result is the oft-decried information overload. No one has time to read everything, yet we often have to make critical decisions based on what we are able to assimilate. The technology of automatic text summarization is becoming indispensable for dealing with this problem. Text summarization is the process of distilling the most important information from a source to produce an abridged version for a particular user or task. Information retrieval(IR) is the task of searching a set of documents for some query-relevant documents. On the other hand, text summarization is considered to be the task of searching a document, a set of sentences, for some topic-relevant sentences. In this paper, we show that document information, that is more reliable and suitable for query, using document length normalization of which is gained through information retrieval . Experimental results of this system in newspaper articles show that document length normalization method superior to other methods use query itself.

  • PDF

A Study on the Eye-Hand Coordination for Korean Text Entry Interface Development (한글 문자 입력 인터페이스 개발을 위한 눈-손 Coordination에 대한 연구)

  • Kim, Jung-Hwan;Hong, Seung-Kweon;Myung, Ro-Hae
    • Journal of the Ergonomics Society of Korea
    • /
    • v.26 no.2
    • /
    • pp.149-155
    • /
    • 2007
  • Recently, various devices requiring text input such as mobile phone IPTV, PDA and UMPC are emerging. The frequency of text entry for them is also increasing. This study was focused on the evaluation of Korean text entry interface. Various models to evaluate text entry interfaces have been proposed. Most of models were based on human cognitive process for text input. The cognitive process was divided into two components; visual scanning process and finger movement process. The time spent for visual scanning process was modeled as Hick-Hyman law, while the time for finger movement was determined as Fitts' law. There are three questions on the model-based evaluation of text entry interface. Firstly, are human cognitive processes (visual scanning and finger movement) during the entry of text sequentially occurring as the models. Secondly, is it possible to predict real text input time by previous models. Thirdly, does the human cognitive process for text input vary according to users' text entry speed. There was time gap between the real measured text input time and predicted time. The time gap was larger in the case of participants with high speed to enter text. The reason was found out investigating Eye-Hand Coordination during text input process. Differently from an assumption that visual scan on the keyboard is followed by a finger movement, the experienced group performed both visual scanning and finger movement simultaneously. Arrival Lead Time was investigated to measure the extent of time overlapping between two processes. 'Arrival Lead Time' is the interval between the eye fixation on the target button and the button click. In addition to the arrival lead time, it was revealed that the experienced group uses the less number of fixations during text entry than the novice group. This result will contribute to the improvement of evaluation model for text entry interface.

A MVC Framework for Visualizing Text Data (텍스트 데이터 시각화를 위한 MVC 프레임워크)

  • Choi, Kwang Sun;Jeong, Kyo Sung;Kim, Soo Dong
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.39-58
    • /
    • 2014
  • As the importance of big data and related technologies continues to grow in the industry, it has become highlighted to visualize results of processing and analyzing big data. Visualization of data delivers people effectiveness and clarity for understanding the result of analyzing. By the way, visualization has a role as the GUI (Graphical User Interface) that supports communications between people and analysis systems. Usually to make development and maintenance easier, these GUI parts should be loosely coupled from the parts of processing and analyzing data. And also to implement a loosely coupled architecture, it is necessary to adopt design patterns such as MVC (Model-View-Controller) which is designed for minimizing coupling between UI part and data processing part. On the other hand, big data can be classified as structured data and unstructured data. The visualization of structured data is relatively easy to unstructured data. For all that, as it has been spread out that the people utilize and analyze unstructured data, they usually develop the visualization system only for each project to overcome the limitation traditional visualization system for structured data. Furthermore, for text data which covers a huge part of unstructured data, visualization of data is more difficult. It results from the complexity of technology for analyzing text data as like linguistic analysis, text mining, social network analysis, and so on. And also those technologies are not standardized. This situation makes it more difficult to reuse the visualization system of a project to other projects. We assume that the reason is lack of commonality design of visualization system considering to expanse it to other system. In our research, we suggest a common information model for visualizing text data and propose a comprehensive and reusable framework, TexVizu, for visualizing text data. At first, we survey representative researches in text visualization era. And also we identify common elements for text visualization and common patterns among various cases of its. And then we review and analyze elements and patterns with three different viewpoints as structural viewpoint, interactive viewpoint, and semantic viewpoint. And then we design an integrated model of text data which represent elements for visualization. The structural viewpoint is for identifying structural element from various text documents as like title, author, body, and so on. The interactive viewpoint is for identifying the types of relations and interactions between text documents as like post, comment, reply and so on. The semantic viewpoint is for identifying semantic elements which extracted from analyzing text data linguistically and are represented as tags for classifying types of entity as like people, place or location, time, event and so on. After then we extract and choose common requirements for visualizing text data. The requirements are categorized as four types which are structure information, content information, relation information, trend information. Each type of requirements comprised with required visualization techniques, data and goal (what to know). These requirements are common and key requirement for design a framework which keep that a visualization system are loosely coupled from data processing or analyzing system. Finally we designed a common text visualization framework, TexVizu which is reusable and expansible for various visualization projects by collaborating with various Text Data Loader and Analytical Text Data Visualizer via common interfaces as like ITextDataLoader and IATDProvider. And also TexVisu is comprised with Analytical Text Data Model, Analytical Text Data Storage and Analytical Text Data Controller. In this framework, external components are the specifications of required interfaces for collaborating with this framework. As an experiment, we also adopt this framework into two text visualization systems as like a social opinion mining system and an online news analysis system.

Deep-Learning Approach for Text Detection Using Fully Convolutional Networks

  • Tung, Trieu Son;Lee, Gueesang
    • International Journal of Contents
    • /
    • v.14 no.1
    • /
    • pp.1-6
    • /
    • 2018
  • Text, as one of the most influential inventions of humanity, has played an important role in human life since ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications such as the text data extracted from images that can provide information for automatic annotation, indexing, language translation, and the assistance systems for impaired persons. Therefore, natural-scene text detection with active research topics regarding computer vision and document analysis is very important. Previous methods have poor performances due to numerous false-positive and true-negative regions. In this paper, a fully-convolutional-network (FCN)-based method that uses supervised architecture is used to localize textual regions. The model was trained directly using images wherein pixel values were used as inputs and binary ground truth was used as label. The method was evaluated using ICDAR-2013 dataset and proved to be comparable to other feature-based methods. It could expedite research on text detection using deep-learning based approach in the future.