• Title/Summary/Keyword: TextMining

Search Result 1,563, Processing Time 0.031 seconds

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

Proposal of Brand Evaluation Map through Big Data : Focus on The Hyundai Motor's Product Evaluation (빅데이터를 통한 브랜드 평가 맵 제안 : 현대자동차 제품 평가 중심으로)

  • Youn, Dae Myung;Lee, Yong Hyuck;Lee, Bong Gyou
    • Journal of Information Technology Services
    • /
    • v.19 no.4
    • /
    • pp.1-11
    • /
    • 2020
  • Through text mining, sentiment analysis, and semiotics analysis, this study aims to reinterpret the meaning of user emotional words and related words to derive strategic elements of brand and design. After selecting a local car manufacturer whose user opinion on the brand is a clear topic, web-crawl the car comments of the manufacturer directly created by the users online. Then, analyze the extracted morphology and its associated words and convert them to fit the marketing mix theory. Through this process, propose a methodology that allows consumers to supplement and improve brand elements with negative sensibilities, and to inherit elements with positive sensibilities and manage brands reasonably. In particular, the Map presented in this study are considered to be fully utilized as information for overall brand management.

A Study on Agile Transformation in the New Digital Age

  • Lee, Jee Young
    • International Journal of Advanced Culture Technology
    • /
    • v.8 no.1
    • /
    • pp.82-88
    • /
    • 2020
  • In the face of recent digital and digital transformation, companies and industries are trying to be agile to adapt and respond to change. Agile paradigm is spreading beyond the boundaries of existing applications such as IT-related projects and software development. In this regard, this study, we analyzed the diffusion of agile paradigm by text mining abstracts of research papers from 2001 to 2019. In addition, we discussed agile transformation in the Fourth Industrial Revolution. Through this study, we confirmed that we are studying agile transformation in various fields such as business environment, corporate organizational culture, manufacturing industry, and supply chain. The results of this study will contribute to understanding the meaning and role of agile as a basic paradigm for digital transformation in the Fourth Industrial Revolution.

Sustainable Industry-Academia-Government Collaborative Education Focusing on Advantages of Industry: Long-term Internship after 5years Practice

  • Morimoto, Emi;Yamanaka, Hideo
    • Journal of Engineering Education Research
    • /
    • v.15 no.5
    • /
    • pp.47-53
    • /
    • 2012
  • Practical problem-solving studies in a company or organization have provided great advantages for our university and students. For example, such studies can lead them to build a stronger relationship with local governments and companies as well as develop their research through collaborative studies. On the other hand, comments from companies or organizations that accepted our students showed that they did not always have advantages. This study seeks ways to establish a sustainable long-term internship program that can offer advantages for companies. Advantages and disadvantages of the internship are written by the company on the evaluated sheet. These feedback comments are analyzed by text-mining approach. It is shown that there are three types of company and organizations depending on their reasons for accepting students. Next, suitable internship programs for each type, including their period and expense distribution are presented.

Topic Model Analysis of Research Trend on Spatial Big Data (공간빅데이터 연구 동향 파악을 위한 토픽모형 분석)

  • Lee, Won Sang;Sohn, So Young
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.41 no.1
    • /
    • pp.64-73
    • /
    • 2015
  • Recent emergence of spatial big data attracts the attention of various research groups. This paper analyzes the research trend on spatial big data by text mining the related Scopus DB. We apply topic model and network analysis to the extracted abstracts of articles related to spatial big data. It was observed that optics, astronomy, and computer science are the major areas of spatial big data analysis. The major topics discovered from the articles are related to mobile/cloud/smart service of spatial big data in urban setting. Trends of discovered topics are provided over periods along with the results of topic network. We expect that uncovered areas of spatial big data research can be further explored.

Recommended Chocolate Applications Based On The Propensity To Consume Dining outside Using Big Data On Social Networks

  • Lee, Tae-gyeong;Moon, Seok-jae;Ryu, Gihwan
    • International Journal of Advanced Culture Technology
    • /
    • v.8 no.3
    • /
    • pp.325-333
    • /
    • 2020
  • In the past, eating outside was usually the purpose of eating. However, it has recently expanded into a restaurant culture market. In particular, a dessert culture is being established where people can talk and enjoy. Each consumer has a different tendency to buy chocolate such as health, taste, and atmosphere. Therefore, it is time to recommend chocolate according to consumers' tendency to eat out. In this paper, we propose a chocolate recommendation application based on the tendency to eat out using data on social networks. To collect keyword-based chocolate information, Textom is used as a text mining big data analysis solution.Text mining analysis and related topics are extracted and modeled. Because to shorten the time to recommend chocolate to users. In addition, research on the propensity of eating out is based on prior research. Finally, it implements hybrid app base.

Data Standardization for the Enhanced Utilization of Public Government Data (활용성 제고를 위한 공공데이터 표준화 연구)

  • Kim, Eun Jin;Kim, Minsu;Kim, Hee-Woong
    • Knowledge Management Research
    • /
    • v.20 no.4
    • /
    • pp.23-38
    • /
    • 2019
  • The Korean government has been trying to create new economic value-added and jobs by the openness and utilization of open government data. However, most of open government data has poor utilization rate. Although open government data standardization is a major cause of those inactivation, it is not sufficient to conduct empirical research on open government data itself. Based on this trend, this paper aims to find the priority area for opening data and suggests a realistic directions of standardization of open government data. Text mining and social network analysis approaches are used to analyze open government data and standardization. This research suggests the guides to open government data managers in practical view from selection of data to standardization direction. In addition, this research has academic implications to the knowledge management systems in terms of suggesting standardization direction by using various techniques.

Continuous Audits Using Decision Support Systems

  • Mohammadi, Shaban
    • The Journal of Industrial Distribution & Business
    • /
    • v.6 no.3
    • /
    • pp.5-8
    • /
    • 2015
  • Purpose - This article's aim is to examine how the utilization of existing and future decision-support systems will lead to a change in the auditing process. Research design, data, and methodology - An information system is a special decision-support system that combines information obtained from various sources and communicates among them to help in assessing appropriate complex financial decisions. This paper analyzes techniques such as data and text mining as components of decision-support systems to be used in the auditing process. Results - We present views about how existing decision-support systems will lead to a change in audits. Auditors, who currently collect significant data manually, will in the future move towards management through complex decision-support systems. Conclusions - Although some internal audit functions are integrated into systems of continuous monitoring, the use of such systems remains limited. Thus, instead of multiple decision-support systems, a unified decision-support system can be deployed for this that includes sensors integrated within a company in different contexts (e.g., production, sales, and accounting) that continually monitors violations of controls, unusual patterns, and unusual transactions.

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

A Study on Prediction of Patent Registration using Text Mining (텍스트 마이닝을 이용한 특허 등록 예측에 관한 연구)

  • Koo, Jung-Min;Park, Sang-Sung;Shin, Young-Geon;Jung, Won-Kyo;Jang, Dong-Sik
    • Proceedings of the KAIS Fall Conference
    • /
    • 2009.05a
    • /
    • pp.325-328
    • /
    • 2009
  • Recently, as importance of knowledge property right is rising, a patent is being issue. A patent is exclusive rights of knowledge or technique, and it must be registered for approval of rights. Therefore, prediction of patent registration can be important information for company or individuals which gain profit using a patent. In this paper, we proposed a method for prediction of patent registration using text mining and a algorithm for constructing database.

  • PDF