Search | Korea Science

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
- Journal of Intelligence and Information Systems
- /
- v.24 no.3
- /
- pp.21-44
- /
- 2018
In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.
https://doi.org/10.13088/jiis.2018.24.3.021 인용 PDF KSCI

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
- Journal of Intelligence and Information Systems
- /
- v.24 no.3
- /
- pp.221-241
- /
- 2018
As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.
https://doi.org/10.13088/jiis.2018.24.3.221 인용 PDF KSCI

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
- Journal of Intelligence and Information Systems
- /
- v.26 no.1
- /
- pp.23-45
- /
- 2020
Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.
https://doi.org/10.13088/jiis.2020.26.1.023 인용 PDF KSCI

Development of Estimation System for Housing Remodeling Cost through Influence Analysis by Design Elements (설계요소별 영향분석을 통한 공동주택 리모델링 공사비개산견적 산출 시스템 개발)

Kim, Jun;Cha, Heesung
- Korean Journal of Construction Engineering and Management
- /
- v.19 no.6
- /
- pp.65-78
- /
- 2018
In As urban apartment are aging, the necessity of reconstruction or remodeling to extend the life of buildings is increasing. In such a case, a co-housing association is formed to implement decisions on reconstruction or remodeling projects. At this time, the most important thing for the co-housing association is the business feasibility based on the input of the construction cost.In the case of reconstruction, it is possible to estimate the construction cost by using the accumulated construction cost data, and then evaluate the feasibility using the construction cost. However, in case of remodeling, it is difficult to calculate the accurate construction cost because the number of accumulated construction cost data is small. In addition, non-specialist clients often require estimates of various design factors, often negatively impacting the accuracy of estimates and the duration of estimates. Therefore, in this study, proposed method to reflect the opinion of the owner who is a non-expert, as a design element, and a method of calculating the expected construction cost according to the design element, and constructed this system so that it can be easily used by the non-specialist owner. In order to clearly reflect the requirements of the non-specialist owner in the estimates, extracts the design elements from the existing remodeling cases, classify them, and suggest a plan for the client to choose. In order to reflect the design factors to the estimates, the existing apartment house remodeling cases were investigated and the design factors were extracted to have a large effect on the construction cost. Finally, developed system based on MS Excel so that the above contents can be easily used by a non-specialist client. In order to verify the accuracy of the proposed estimate in this study, verified the accuracy of 80% of the results by substituting the case of remodeling quotations and obtained a positive result from the questionnaire survey to examine the ease of use of the non-specialist customer. In this study, propose an estimate estimation method using four cases. If the remodeling cases are accumulated continuously, the expected effect of this study will be higher.
https://doi.org/10.6106/KJCEM.2018.19.6.065 인용 PDF KSCI

Implementation of Reporting Tool Supporting OLAP and Data Mining Analysis Using XMLA (XMLA를 사용한 OLAP과 데이타 마이닝 분석이 가능한 리포팅 툴의 구현)

Choe, Jee-Woong;Kim, Myung-Ho
- Journal of KIISE:Computing Practices and Letters
- /
- v.15 no.3
- /
- pp.154-166
- /
- 2009
Database query and reporting tools, OLAP tools and data mining tools are typical front-end tools in Business Intelligence environment which is able to support gathering, consolidating and analyzing data produced from business operation activities and provide access to the result to enterprise's users. Traditional reporting tools have an advantage of creating sophisticated dynamic reports including SQL query result sets, which look like documents produced by word processors, and publishing the reports to the Web environment, but data source for the tools is limited to RDBMS. On the other hand, OLAP tools and data mining tools have an advantage of providing powerful information analysis functions on each own way, but built-in visualization components for analysis results are limited to tables or some charts. Thus, this paper presents a system that integrates three typical front-end tools to complement one another for BI environment. Traditional reporting tools only have a query editor for generating SQL statements to bring data from RDBMS. However, the reporting tool presented by this paper can extract data also from OLAP and data mining servers, because editors for OLAP and data mining query requests are added into this tool. Traditional systems produce all documents in the server side. This structure enables reporting tools to avoid repetitive process to generate documents, when many clients intend to access the same dynamic document. But, because this system targets that a few users generate documents for data analysis, this tool generates documents at the client side. Therefore, the tool has a processing mechanism to deal with a number of data despite the limited memory capacity of the report viewer in the client side. Also, this reporting tool has data structure for integrating data from three kinds of data sources into one document. Finally, most of traditional front-end tools for BI are dependent on data source architecture from specific vendor. To overcome the problem, this system uses XMLA that is a protocol based on web service to access to data sources for OLAP and data mining services from various vendors.
PDF KSCI

Spatio-temporal enhancement of forest fire risk index using weather forecast and satellite data in South Korea (기상 예보 및 위성 자료를 이용한 우리나라 산불위험지수의 시공간적 고도화)

KANG, Yoo-Jin;PARK, Su-min;JANG, Eun-na;IM, Jung-ho;KWON, Chun-Geun;LEE, Suk-Jun
- Journal of the Korean Association of Geographic Information Studies
- /
- v.22 no.4
- /
- pp.116-130
- /
- 2019
In South Korea, forest fire occurrences are increasing in size and duration due to various factors such as the increase in fuel materials and frequent drying conditions in forests. Therefore, it is necessary to minimize the damage caused by forest fires by appropriately providing the probability of forest fire risk. The purpose of this study is to improve the Daily Weather Index(DWI) provided by the current forest fire forecasting system in South Korea. A new Fire Risk Index(FRI) is proposed in this study, which is provided in a 5km grid through the synergistic use of numerical weather forecast data, satellite-based drought indices, and forest fire-prone areas. The FRI is calculated based on the product of the Fine Fuel Moisture Code(FFMC) optimized for Korea, an integrated drought index, and spatio-temporal weighting approaches. In order to improve the temporal accuracy of forest fire risk, monthly weights were applied based on the forest fire occurrences by month. Similarly, spatial weights were applied using the forest fire density information to improve the spatial accuracy of forest fire risk. In the time series analysis of the number of monthly forest fires and the FRI, the relationship between the two were well simulated. In addition, it was possible to provide more spatially detailed information on forest fire risk when using FRI in the 5km grid than DWI based on administrative units. The research findings from this study can help make appropriate decisions before and after forest fire occurrences.
https://doi.org/10.11108/kagis.2019.22.4.116 인용 PDF KSCI

The Effect of Firm Characteristics on the Relationship between Managerial Ability and Firm Performance (기업특성이 경영자능력과 경영성과의 관계에 미치는 영향)

Cho, Sang-Min;Yoo, Ji-Yeon
- Management & Information Systems Review
- /
- v.37 no.1
- /
- pp.103-122
- /
- 2018
This paper expands the results of previous studies indicating that manager's ability positively affects business performance to analyze whether the degree to which the role of manager's ability improves business performance appears differently according to the characteristics of enterprises. As for the characteristics of enterprises, whether enterprises correspond to enterprises with high levels of funding constraints or late movers in the market is considered. Enterprises with high levels of funding constraints greatly require managers' roles not only for efficient use of funds but also for smooth financing. Late movers require more judgments of professional managers to overcome insufficient resources held and low profitability. In the case of enterprises with corporate characteristics with high dependency on the manager, the business performance is expected to greatly vary with the ability of the manager. The empirical analysis was conducted with listed companies from 2010 to 2014, manager's ability was measured by first measuring the efficiency of the entire enterprise through data envelopment analysis (DEA) using the methodology of Demerjian et al.(2012) and removing enterprise characteristics factors thereafter. Business performance was measured by the return on industrial fixed assets. The results of the empirical analysis indicated that the degree to which manager's ability improves business performance was higher in managerial competence enhances managerial performance in enterprises with high levels of funding constraints and late movers. Business performance is considered to have been improved further in cases where manager's ability is high because investments were made more efficiently through smooth funding. In addition, in the case of late movers in relatively poor environments, business performance was improved further because high manager's ability induced efficient decision making. In this paper, we extend the precedent study that the manager's ability improves the management performance, and confirm that the manager's ability to improve the managerial performance can be different according to the situation of the company. In addition, it is meaningful to analyze empirically whether a company's managerial ability is more important. This paper expanded the results of previous studies indicating that manager's ability improves performance to identify that the degree to which manager's ability improves business performance may appear differently according to situations in which enterprises are placed. In addition, this paper is meaningful in that it empirically analyzed what enterprises require manager's ability more importantly.
https://doi.org/10.29214/damis.2018.37.1.006 인용 PDF KSCI

Analysis of Actual Conditions of Unnatural Death Cases and Questionnaire for Initial Crime Scene Investigation of Police (변사체 발생실태 및 경찰의 현장 초동조치에 관한 설문 분석 - 경북지역을 중심으로 -)

Cho, Doo-Won;Chae, Jong-Min
- Journal of forensic and investigative science
- /
- v.1 no.1
- /
- pp.11-30
- /
- 2006
The preliminary investigative activities by the police officer play a critical role in identifying the cause of death in unnatural death investigations. The failure to secure the crime scene leads to the destruction of significant evidence, which results in the difficulty or impossibility to identify the cause of death. In order to prevent this jeopardizing crucial evidence, and to identify the level of preliminary investigation on the scene, this research is conducted and analyzed with questionnaires of 300 police first responders and 100 detectives. As a result, it was disclosed that there is a possibility for first responders to fail to ensure scene security, scene observation, and canvass interviews. Besides, when medical personnel have no choice but to contaminate the crime scene in order to save lives, it is necessary for them to take photos and to take proper actions before they enter the scene. The importance of scene-control education cannot be emphasized enough in order to prevent media from entering and destroying the evidence. Through research of actual conditions of unnatural death cases which occurred in Kyongbook Province for last five years, the statistics regarding a few different types of death were analyzed as follows. Evidence that homicide, suicide, accidental death, and disaster deaths have increased year by year. Therefore, it is deemed necessary for the government to take multilateral policies to reduce them, and for police to reinforce their investigative skills. Further, the insufficient number of autopsy facilities and forensic pathologists, only 13% of the deceased (1,237 cases) have had an autopsy conducted to identify the cause of death for last five years. The other, 87.3% (8,496 cases) of the deceased, were handled through simple postmortem examination. The significance of this percentage is that there is still the possibility not revealing the cause of unjust deaths. Therefore, it is necessary to furnish police agency with the reasonable amount of funding for autopsies and maintaining enough forensic pathologists.
PDF

The Surgical Diagnosis for Detecting Early Gastric Cancer and Lymph Node Metastasis: Its Role for Making the Decision of the Limited Surgery (조기위암 및 림프절 전이에 대한 수술 중 외과적 병기판정의 정확도 및 유용성)

Park, Eun-Kyu;Jeong, Oh;Ryu, Seong-Yeop;Ju, Jae-Kyun;Kim, Dong-Yi;Jeong, Mi-Ran;Kim, Ho-Goon;Kim, Hoe-Won;Park, Young-Kyu
- Journal of Gastric Cancer
- /
- v.9 no.3
- /
- pp.104-109
- /
- 2009
Purpose: The aim of this study is to evaluate the accuracy of surgically diagnosing early gastric cancer (EGC) and lymph node metastasis, and to determine its role for performing limited surgery for EGC. Materials and Methods: We reviewed 369 patients who underwent gastrectomy for primary gastric carcinoma. The surgical diagnosis was evaluated by determining its sensitivity, specificity and accuracy, and this was compared with the preoperative examinations. Results: The sensitivity, specificity, and accuracy of the intraoperative diagnosis for EGC were 74.5%, 95.7% and 83.7%, respectively. The predictive value for EGC according to the intraoperative diagnosis was 95.7%. The surgical diagnosis of EGC showed higher specificity and a higher predictive value than preoperative examinations, which significantly reduced the risk of underestimating advanced gastric cancer (AGC) to EGC. The sensitivity, specificity, and accuracy for lymph node metastasis according to the surgical diagnosis were 73.2%, 78.1% and 76.4%, respectively. For 70 patients with a discrepancy in the diagnosis of EGC between the pre- and intra-operative diagnosis, the surgical diagnosis was correct in 63 (90%) patients, but the preoperative examinations were correct in only 7 (10%) patients. Conclusion: The surgical diagnosis showed better accuracy than the preoperative examinations for detecting EGC and lymph node metastasis. Our results suggest that the decision for conducting limited surgery based on the surgical diagnosis might reduce the risk of under-treatment of AGC to EGC better than the preoperative examinations.
PDF

Analysis of Potential Infection Site by Highly Pathogenic Avian Influenza Using Model Patterns of Avian Influenza Outbreak Area in Republic of Korea (국내 조류인플루엔자 발생 지역의 모델 패턴을 활용한 고병원성조류인플루엔자(HPAI)의 감염가능 지역 분석)

EOM, Chi-Ho;PAK, Sun-Il;BAE, Sun-Hak
- Journal of the Korean Association of Geographic Information Studies
- /
- v.20 no.2
- /
- pp.60-74
- /
- 2017
To facilitate prevention of highly pathogenic avian influenza (HPAI), a GIS is widely used for monitoring, investigating epidemics, managing HPAI-infected farms, and eradicating the disease. After the outbreak of foot-and-mouth disease in 2010 and 2011, the government of the Republic of Korea (ROK) established the GIS-based Korean Animal Health Integrated System (KAHIS) to avert livestock epidemics, including HPAI. However, the KAHIS is not sufficient for controlling HPAI outbreaks due to lack of responsibility in fieldwork, such as sterilization of HPAI-infected poultry farms and regions, control of infected animal movement, and implementation of an eradication strategy. An outbreak prediction model to support efficient HPAI control in the ROK is proposed here, constructed via analysis of HPAI outbreak patterns in the ROK. The results show that 82% of HPAI outbreaks occurred in Jeolla and Chungcheong Provinces. The density of poultry farms in these regions were $2.2{\pm}1.1/km^2$ and $4.2{\pm}5.6/km^2$, respectively. In addition, reared animal numbers ranged between 6,537 and 24,250 individuals in poultry farms located in HPAI outbreak regions. Following identification of poultry farms in HPAI outbreak regions, an HPAI outbreak prediction model was designed using factors such as the habitat range for migratory birds(HMB), freshwater system characteristics, and local road networks. Using these factors, poultry farms which reared 6,500-25,000 individuals were filtered and compared with number of farms actually affected by HPAI outbreaks in the ROK. The HPAI prediction model shows that 90.0% of the number of poultry farms and 54.8% of the locations of poultry farms overlapped between an actual HPAI outbreak poultry farms reported in 2014 and poultry farms estimated by HPAI outbreak prediction model in the present study. These results clearly show that the HPAI outbreak prediction model is applicable for estimating HPAI outbreak regions in ROK.
https://doi.org/10.11108/kagis.2017.20.2.060 인용 PDF KSCI

Search Result 8,026, Processing Time 0.043 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)