• Title/Summary/Keyword: Data mining analysis

Search Result 2,192, Processing Time 0.035 seconds

Multi-Dimensional Analysis Method of Product Reviews for Market Insight (마켓 인사이트를 위한 상품 리뷰의 다차원 분석 방안)

  • Park, Jeong Hyun;Lee, Seo Ho;Lim, Gyu Jin;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.57-78
    • /
    • 2020
  • With the development of the Internet, consumers have had an opportunity to check product information easily through E-Commerce. Product reviews used in the process of purchasing goods are based on user experience, allowing consumers to engage as producers of information as well as refer to information. This can be a way to increase the efficiency of purchasing decisions from the perspective of consumers, and from the seller's point of view, it can help develop products and strengthen their competitiveness. However, it takes a lot of time and effort to understand the overall assessment and assessment dimensions of the products that I think are important in reading the vast amount of product reviews offered by E-Commerce for the products consumers want to compare. This is because product reviews are unstructured information and it is difficult to read sentiment of reviews and assessment dimension immediately. For example, consumers who want to purchase a laptop would like to check the assessment of comparative products at each dimension, such as performance, weight, delivery, speed, and design. Therefore, in this paper, we would like to propose a method to automatically generate multi-dimensional product assessment scores in product reviews that we would like to compare. The methods presented in this study consist largely of two phases. One is the pre-preparation phase and the second is the individual product scoring phase. In the pre-preparation phase, a dimensioned classification model and a sentiment analysis model are created based on a review of the large category product group review. By combining word embedding and association analysis, the dimensioned classification model complements the limitation that word embedding methods for finding relevance between dimensions and words in existing studies see only the distance of words in sentences. Sentiment analysis models generate CNN models by organizing learning data tagged with positives and negatives on a phrase unit for accurate polarity detection. Through this, the individual product scoring phase applies the models pre-prepared for the phrase unit review. Multi-dimensional assessment scores can be obtained by aggregating them by assessment dimension according to the proportion of reviews organized like this, which are grouped among those that are judged to describe a specific dimension for each phrase. In the experiment of this paper, approximately 260,000 reviews of the large category product group are collected to form a dimensioned classification model and a sentiment analysis model. In addition, reviews of the laptops of S and L companies selling at E-Commerce are collected and used as experimental data, respectively. The dimensioned classification model classified individual product reviews broken down into phrases into six assessment dimensions and combined the existing word embedding method with an association analysis indicating frequency between words and dimensions. As a result of combining word embedding and association analysis, the accuracy of the model increased by 13.7%. The sentiment analysis models could be seen to closely analyze the assessment when they were taught in a phrase unit rather than in sentences. As a result, it was confirmed that the accuracy was 29.4% higher than the sentence-based model. Through this study, both sellers and consumers can expect efficient decision making in purchasing and product development, given that they can make multi-dimensional comparisons of products. In addition, text reviews, which are unstructured data, were transformed into objective values such as frequency and morpheme, and they were analysed together using word embedding and association analysis to improve the objectivity aspects of more precise multi-dimensional analysis and research. This will be an attractive analysis model in terms of not only enabling more effective service deployment during the evolving E-Commerce market and fierce competition, but also satisfying both customers.

Production of a hypothetical polyene substance by activating a cryptic fungal PKS-NRPS hybrid gene in Monascus purpureus (홍국Monascus purpureus에서 진균 PKS-NRPS 하이브리드 유전자의 발현 유도를 통한 미지 polyene 화합물의 생성)

  • Suh, Jae-Won;Balakrishnan, Bijinu;Lim, Yoon Ji;Lee, Doh Won;Choi, Jeong Ju;Park, Si-Hyung;Kwon, Hyung-Jin
    • Journal of Applied Biological Chemistry
    • /
    • v.61 no.1
    • /
    • pp.83-91
    • /
    • 2018
  • Advances in bacterial and fungal genome mining uncover a plethora of cryptic secondary metabolite biosynthetic gene clusters. Guided by the genome information, targeted transcriptional derepression could be employed to determine the product of a cryptic gene cluster and to explore its biological role. Monascus spp. are food grade filamentous fungi popular in eastern Asia and several genome data belong to them are now available. We achieved transcription activation of a cryptic fungal polyketide synthase-nonribosomal peptide synthase gene Mpfus1 in Monascus purpureus ${\Delta}MpPKS5$ by inserting Aspergillus gpdA promoter at the upstream of Mpfus1 through double crossover gene replacement. The gene cluster with Mpfus1 show a high similarity to those for the biosynthesis of conjugated polyene derivatives with 2-pyrrolidone ring and the mycotoxin fusarin is the representative member of this group. The ${\Delta}MpPKS5$ is incapable of producing azaphilone pigment, providing an excellent background to identify chromogenic and UV-absorbing compounds. Activation of Mpfus1 resulted in a yellow hue on mycelia and its methanol extract exhibit a maximum absorption at 365 nm. HPLC analysis of the organic extracts indicated the presence of a variety of yellow compounds in the extract. This implies that the product of MpFus1 is metabolically or chemically unstable. LC-MS analysis guided us to predict the MpFus1 product and to propose that the Mpfus1-containing gene cluster encode the biosynthesis of a desmethyl analogue of fusarin. This study showcases the genome mining in Monascus and the possibility to unveil new biological activities embedded in it.

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

  • Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.221-241
    • /
    • 2018
  • As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.

EEG Classification for depression patients using decision tree and possibilistic support vector machines (뇌파의 의사 결정 트리 분석과 가능성 기반 서포트 벡터 머신 분석을 통한 우울증 환자의 분류)

  • Sim, Woo-Hyeon;Lee, Gi-Yeong;Chae, Jeong-Ho;Jeong, Jae-Seung;Lee, Do-Heon
    • Bioinformatics and Biosystems
    • /
    • v.1 no.2
    • /
    • pp.134-138
    • /
    • 2006
  • Depression is the most common and widespread mood disorder. About 20% of the population might suffer a major, incapacitating episode of depression during their lifetime. This disorder can be classified into two types: major depressive disorders and bipolar disorder. Since pharmaceutical treatments are different according to types of depression disorders, correct and fast classification is quite critical for depression patients. Yet, classical statistical method, such as minnesota multiphasic personality inventory (MMPI), have some difficulties in applying to depression patients, because the patients suffer from concentration. We used electroencephalogram (EEG) analysis method fer classification of depression. We extracted nonlinearity of information flows between channels and estimated approximate entropy (ApEn) for the EEG at each channel. Using these attributes, we applied two types of data mining classification methods: decision tree and possibilistic support vector machines (PSVM). We found that decision tree showed 85.19% accuracy and PSVM exhibited 77.78% accuracy for classification of depression, 30 patients with major depressive disorder and 24 patients having bipolar disorder.

  • PDF

Analysis of Abroad Mid- to Long-Term R&D Themes and Market Information in the Geological Information and Mineral Resources Fields (지질정보 및 광물자원 분야 국외 중장기 연구개발 주제 및 시장정보 분석)

  • Ahn, Eun-Young
    • Economic and Environmental Geology
    • /
    • v.52 no.6
    • /
    • pp.637-645
    • /
    • 2019
  • Due to the transformation to the intelligent information society, the rapid change of our life and environment is expected. The Ministry of Science and ICT (MSIT) and the National Research Council of Science and Technology (NST) introduced a five-year government supported research institution's planning and evaluation based on the mid-to long-term perspective. This study collects international benchmarking information including industry, academia, and research fields by collecting mid- and long-term strategy reports from public research institutes, surveys by experts from abroad universities and research institutes, and analyzing overseas market information reports. The British Geological Survey (BGS), the U.S. Geological Survey (USGS) and the japanese geological survey related institutes (AIST-GSJ) plans for three-dimensional national geological information, predictions of geological environmental disasters, and development of important metals and material in the low carbon economic transformation and in the era of the Fourth Industrial Revolution. The mid- and long-term program emphasizes basic and public research on geological information through abroad experts survey such as the IPGP-CNRS etc. The market analysis of the mining automation and digital map sectors has been able to derive the fields in which the role of public research institutes by the market is expected such as data collection on land and in the air, mobile or three-dimensional information production, smooth/fast/real-time maps, custom map design, mapping support to various platforms, geological environmental risk assessment and disaster management information and maps.

A Study on the Soil Contamination(Maps) Using the Handheld XRF and GIS in Abandoned Mining Areas (휴대용 XRF와 GIS를 이용한 폐광산 지역의 토양오염에 관한 연구)

  • Lee, Hyeon-Gyu;Choi, Yo-Soon
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.17 no.3
    • /
    • pp.195-206
    • /
    • 2014
  • In this study, soil contamination maps related to Cu and Pb were created at the Busan abandoned mine in Korea using a handheld X-Ray Fluorescence(XRF) and Geographic Information Systems(GIS). Hydrological analysis was performed using the Digital Elevation Model(DEM) of the study area to identify the flow directions of surface runoff where pollutants can be dispersed from the soil contamination sources. 24 locations for measuring the soil contamination related to Cu and Pb were selected by considering the result of hydrological analysis. The results measured at the 24 locations using the handheld XRF showed that the highest value of Cu contamination is 8,255ppm and that of Pb is 2,146ppm. The field investigation data were entered into ArcGIS software, and then soil contamination maps regarding Cu and Pb with a 5m grid-spacing were created after performing spatial interpolations using the ordinary kriging method. As a result, we could know that high concentrations of Cu and Pb are presented at the waste and tailings dumps around the abandoned mine openings. This study also showed that the handheld XRF and GIS can be utilized to create soil contamination maps related to Cu and Pb in the field.

Analysis of Leaf Node Ranking Methods for Spatial Event Prediction (의사결정트리에서 공간사건 예측을 위한 리프노드 등급 결정 방법 분석)

  • Yeon, Young-Kwang
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.17 no.4
    • /
    • pp.101-111
    • /
    • 2014
  • Spatial events are predictable using data mining classification algorithms. Decision trees have been used as one of representative classification algorithms. And they were normally used in the classification tasks that have label class values. However since using rule ranking methods, spatial prediction have been applied in the spatial prediction problems. This paper compared rule ranking methods for the spatial prediction application using a decision tree. For the comparison experiment, C4.5 decision tree algorithm, and rule ranking methods such as Laplace, M-estimate and m-branch were implemented. As a spatial prediction case study, landslide which is one of representative spatial event occurs in the natural environment was applied. Among the rule ranking methods, in the results of accuracy evaluation, m-branch showed the better accuracy than other methods. However in case of m-brach and M-estimate required additional time-consuming procedure for searching optimal parameter values. Thus according to the application areas, the methods can be selectively used. The spatial prediction using a decision tree can be used not only for spatial predictions, but also for causal analysis in the specific event occurrence location.

Identification of Emerging Research at the national level: Scientometric Approach using Scopus (국가적 차원의 유망연구영역 탐색: Scopus 데이터베이스를 이용한 과학계량학적 접근)

  • Yeo, Woon-Dong;Sohn, Eun-Soo;Jung, Eui-Seob;Lee, Chang-Hoan
    • Journal of Information Management
    • /
    • v.39 no.3
    • /
    • pp.95-113
    • /
    • 2008
  • In todays environment in which scientific technologies are changing very fast than ever, companies have to monitor and search emerging technologies to gain competitiveness. Actually many nations try to do that. Most of them use Dephi approach based on experts review as a searching method. But experts review has been criticised for probability of inclination and its derivative problems in the sense that it is accomplished only by expert's subjectivity. To overcome such problems, we used Scientometric Method for identifying emerging technology that had been done by Delphi as a rule. We made three particular efforts in order to improve the Quality of the result. Firstly, we selected one alternative database between SCI and Scopus hoping to see evenly-distributing results in wide fields on the front burner. Secondly we used Fractional citation counting in counting citation number in the stage of linear regression analysis. Lastly, we verified Scientometric result with experts opinions to minimize probable errors in a Scientometric research. As a result, we derived 290 emerging technologies from Scientometric analysis with Scopus Database, and visualized them on 2-dimension map with data mining system named KnowledgeMatrix which was developed by KISTI.

Research Suggestion for Disaster Prediction using Safety Report of Korea Government (안전신문고를 이용한 재난 예측 방법론 제안)

  • Lee, Jun;Shin, Jindong;Cho, Sangmyeong;Lee, Sanghwa
    • Journal of Korean Society of Disaster and Security
    • /
    • v.12 no.4
    • /
    • pp.15-26
    • /
    • 2019
  • Anjunshinmungo (The safety e-report) has been in operation since 2014, and there are about 1 million cumulative reports by June 2019. This study analyzes the contents of more than 1 million safety newspapers reported at the present time of information age to determine how powerful and meaningful the people's voice and interest are. In particular, we are interested in forecasting ability. We wanted to check whether the report of the safety newspaper was related to possible disasters. To this end, the researchers received data reported in the safety newspaper as text and analyzed it by natural language analysis methodology. Based on this, the newspaper articles during the analysis of the safety newspaper were analyzed, and the correlation between the contents of the newspaper and the newspaper was analyzed. As a result, accidents occurred within a few months as the number of reports related to response and confirmation increased, and analyzing the contents of safety reports previously reported on social instability can be used to predict future disasters.

Factors analysis of the cyanobacterial dominance in the four weirs installed in of Nakdong River (낙동강의 중·하류 4개보에서 남조류 우점 환경 요인 분석)

  • Kim, Sung jin;Chung, Se woong;Park, Hyung seok;Cho, Young cheol;Lee, Hee suk
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2019.05a
    • /
    • pp.413-413
    • /
    • 2019
  • 하천과 호수에서 남조류의 이상 과잉증식 문제(이하 녹조문제)는 담수생태계의 생물다양성을 감소시키며, 음용수의 이취미 원인물질을 발생시켜 물 이용에 장해가 된다. 또한 독소를 생산하는 유해남조류가 대량 증식할 경우에는 가축이나 인간의 건강에 치명적 해를 끼치기도 한다. 그 동안 국내에서 녹조문제는 댐 저수지와 하구호와 같은 정체수역에서 간헐적으로 문제를 일으켰으나, 4대강사업(2010-2011)으로 16개의 보가 설치된 이후 낙동강, 금강, 영산강 등 대하천에서도 광범위하게 발생되고 있어 중요한 사회적 환경적 이슈로 대두되었다. 한편, 대하천에 설치된 보 구간에서 빈번히 발생하는 녹조현상의 원인에 대해서는 전 지구적 기온상승에 따른 기후변화의 영향이라는 주장과 유역으로부터 영양염류의 과도한 유입, 가뭄에 따른 유량감소, 보 설치에 따른 체류시간 증가 등 다양한 의견이 제시되고 있으나, 대상 유역과 수체의 특성에 따라 녹조 발생의 원인이 상이하거나 또는 다양한 요인이 복합적으로 작용하기 때문에 보편적 해석(universal interpretation)이 어려운 것이 현실이다. 따라서 각 수계별, 보별 녹조현상에 대한 정확한 원인분석과 효과적인 대책 마련을 위해서는 집중된 실험자료와 데이터마이닝 기법에 근거로 한 보다 과학적이고 객관적인 접근이 이루어져야 한다. 본 연구에서는 2012년 보 설치 이후 남조류에 의한 녹조현상이 빈번히 발생하고 있는 낙동강 4개보(강정고령보, 달성보, 합천창녕보, 창녕함안보)를 대상으로 집중적인 현장조사와 실험분석을 수행하고, 수집된 기상, 수문, 수질, 조류 자료에 대해 통계분석과 다양한 데이터모델링 기법을 적용하여 보별 남조류 우점 환경조건과 이를 제어하기 위한 주요 조절변수를 규명하는데 있다. 연구대상 보 별 수질과 식물플랑크톤의 정성 및 정량 실험은 2017년 5월부터 2018년 11월까지 2년에 걸쳐 실시하였으며, 남조류 세포수 밀도와 환경요인과의 상관성 분석을 실시하고, 단계적 다중회귀모델(Step-wise Multiple Linear Regressions, SMLR), 랜덤포레스트(Random Forests, RF) 모델과 재귀적 변수 제거 기법(Recursive Feature Elimination using Random Forest, RFE-RF)을 이용한 변수중요도 평가, 의사결정나무(Decision Tree, DT), 주성분분석(Principal Component Analysis, PCA) 기법 등 다양한 모수적 및 비모수적 데이터마이닝 결과를 바탕으로 각 보별 남 조류 우점 환경요인을 종합적으로 해석하였다.

  • PDF