Search | Korea Science

A Study on the Application of Outlier Analysis for Fraud Detection: Focused on Transactions of Auction Exception Agricultural Products (부정 탐지를 위한 이상치 분석 활용방안 연구 : 농수산 상장예외품목 거래를 대상으로)

Kim, Dongsung;Kim, Kitae;Kim, Jongwoo;Park, Steve
- Journal of Intelligence and Information Systems
- /
- v.20 no.3
- /
- pp.93-108
- /
- 2014
To support business decision making, interests and efforts to analyze and use transaction data in different perspectives are increasing. Such efforts are not only limited to customer management or marketing, but also used for monitoring and detecting fraud transactions. Fraud transactions are evolving into various patterns by taking advantage of information technology. To reflect the evolution of fraud transactions, there are many efforts on fraud detection methods and advanced application systems in order to improve the accuracy and ease of fraud detection. As a case of fraud detection, this study aims to provide effective fraud detection methods for auction exception agricultural products in the largest Korean agricultural wholesale market. Auction exception products policy exists to complement auction-based trades in agricultural wholesale market. That is, most trades on agricultural products are performed by auction; however, specific products are assigned as auction exception products when total volumes of products are relatively small, the number of wholesalers is small, or there are difficulties for wholesalers to purchase the products. However, auction exception products policy makes several problems on fairness and transparency of transaction, which requires help of fraud detection. In this study, to generate fraud detection rules, real huge agricultural products trade transaction data from 2008 to 2010 in the market are analyzed, which increase more than 1 million transactions and 1 billion US dollar in transaction volume. Agricultural transaction data has unique characteristics such as frequent changes in supply volumes and turbulent time-dependent changes in price. Since this was the first trial to identify fraud transactions in this domain, there was no training data set for supervised learning. So, fraud detection rules are generated using outlier detection approach. We assume that outlier transactions have more possibility of fraud transactions than normal transactions. The outlier transactions are identified to compare daily average unit price, weekly average unit price, and quarterly average unit price of product items. Also quarterly averages unit price of product items of the specific wholesalers are used to identify outlier transactions. The reliability of generated fraud detection rules are confirmed by domain experts. To determine whether a transaction is fraudulent or not, normal distribution and normalized Z-value concept are applied. That is, a unit price of a transaction is transformed to Z-value to calculate the occurrence probability when we approximate the distribution of unit prices to normal distribution. The modified Z-value of the unit price in the transaction is used rather than using the original Z-value of it. The reason is that in the case of auction exception agricultural products, Z-values are influenced by outlier fraud transactions themselves because the number of wholesalers is small. The modified Z-values are called Self-Eliminated Z-scores because they are calculated excluding the unit price of the specific transaction which is subject to check whether it is fraud transaction or not. To show the usefulness of the proposed approach, a prototype of fraud transaction detection system is developed using Delphi. The system consists of five main menus and related submenus. First functionalities of the system is to import transaction databases. Next important functions are to set up fraud detection parameters. By changing fraud detection parameters, system users can control the number of potential fraud transactions. Execution functions provide fraud detection results which are found based on fraud detection parameters. The potential fraud transactions can be viewed on screen or exported as files. The study is an initial trial to identify fraud transactions in Auction Exception Agricultural Products. There are still many remained research topics of the issue. First, the scope of analysis data was limited due to the availability of data. It is necessary to include more data on transactions, wholesalers, and producers to detect fraud transactions more accurately. Next, we need to extend the scope of fraud transaction detection to fishery products. Also there are many possibilities to apply different data mining techniques for fraud detection. For example, time series approach is a potential technique to apply the problem. Even though outlier transactions are detected based on unit prices of transactions, however it is possible to derive fraud detection rules based on transaction volumes.
https://doi.org/10.13088/jiis.2014.20.3.093 인용 PDF KSCI

Efficient Topic Modeling by Mapping Global and Local Topics (전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안)

Choi, Hochang;Kim, Namgyu
- Journal of Intelligence and Information Systems
- /
- v.23 no.3
- /
- pp.69-94
- /
- 2017
Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out. As the result, big data analysis is expected to be performed by demanders of analysis themselves. Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document. Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents. However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling. In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.
https://doi.org/10.13088/jiis.2017.23.3.069 인용 PDF KSCI

Development of Analytical Method for Detection of Fungicide Validamycin A Residues in Agricultural Products Using LC-MS/MS (LC-MS/MS를 이용한 농산물 중 살균제 Validamycin A의 시험법 개발)

Park, Ji-Su;Do, Jung-Ah;Lee, Han Sol;Park, Shin-min;Cho, Sung Min;Shin, Hye-Sun;Jang, Dong Eun;Cho, Myong-Shik;Jung, Yong-hyun;Lee, Kangbong
- Journal of Food Hygiene and Safety
- /
- v.34 no.1
- /
- pp.22-29
- /
- 2019
Validamycin A is an aminoglycoside fungicide produced by Streptomyces hygroscopicus that inhibits trehalase. The purpose of this study was to develop a method for detecting validamycin A in agricultural samples to establish MRL values for use in Korea. The validamycin A residues in samples were extracted using methanol/water (50/50, v/v) and purified with a hydrophilic-lipophilic balance (HLB) cartridges. The analyte was quantified and confirmed by liquid chromatograph-tandem mass spectrometer (LC-MS/MS) in positive ion mode using multiple reaction monitoring (MRM). Matrix-matched calibration curves were linear over the calibration ranges (0.005~0.5 ng) into a blank extract with $R^2$ > 0.99. The limits of detection and quantification were 0.005 and 0.01 mg/kg, respectively. For validation validamycin A, recovery studies were carried out three different concentration levels (LOQ, $LOQ{\times}10$, $LOQ{\times}50$, n = 5) with five replicates at each level. The average recovery range was from 72.5~118.3%, with relative standard deviation (RSD) less than 10.3%. All values were consistent with the criteria ranges requested in the Codex guidelines (CAC/GL 40-1993, 2003) and the NIFDS (National Institute of Food and Drug Safety) guideline (2016). Therefore, the proposed analytical method is accurate, effective and sensitive for validamycin A determination in agricultural commodities.
https://doi.org/10.13103/JFHS.2019.34.1.22 인용 PDF KSCI

A Deep Learning Based Approach to Recognizing Accompanying Status of Smartphone Users Using Multimodal Data (스마트폰 다종 데이터를 활용한 딥러닝 기반의 사용자 동행 상태 인식)

Kim, Kilho;Choi, Sangwoo;Chae, Moon-jung;Park, Heewoong;Lee, Jaehong;Park, Jonghun
- Journal of Intelligence and Information Systems
- /
- v.25 no.1
- /
- pp.163-177
- /
- 2019
As smartphones are getting widely used, human activity recognition (HAR) tasks for recognizing personal activities of smartphone users with multimodal data have been actively studied recently. The research area is expanding from the recognition of the simple body movement of an individual user to the recognition of low-level behavior and high-level behavior. However, HAR tasks for recognizing interaction behavior with other people, such as whether the user is accompanying or communicating with someone else, have gotten less attention so far. And previous research for recognizing interaction behavior has usually depended on audio, Bluetooth, and Wi-Fi sensors, which are vulnerable to privacy issues and require much time to collect enough data. Whereas physical sensors including accelerometer, magnetic field and gyroscope sensors are less vulnerable to privacy issues and can collect a large amount of data within a short time. In this paper, a method for detecting accompanying status based on deep learning model by only using multimodal physical sensor data, such as an accelerometer, magnetic field and gyroscope, was proposed. The accompanying status was defined as a redefinition of a part of the user interaction behavior, including whether the user is accompanying with an acquaintance at a close distance and the user is actively communicating with the acquaintance. A framework based on convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent networks for classifying accompanying and conversation was proposed. First, a data preprocessing method which consists of time synchronization of multimodal data from different physical sensors, data normalization and sequence data generation was introduced. We applied the nearest interpolation to synchronize the time of collected data from different sensors. Normalization was performed for each x, y, z axis value of the sensor data, and the sequence data was generated according to the sliding window method. Then, the sequence data became the input for CNN, where feature maps representing local dependencies of the original sequence are extracted. The CNN consisted of 3 convolutional layers and did not have a pooling layer to maintain the temporal information of the sequence data. Next, LSTM recurrent networks received the feature maps, learned long-term dependencies from them and extracted features. The LSTM recurrent networks consisted of two layers, each with 128 cells. Finally, the extracted features were used for classification by softmax classifier. The loss function of the model was cross entropy function and the weights of the model were randomly initialized on a normal distribution with an average of 0 and a standard deviation of 0.1. The model was trained using adaptive moment estimation (ADAM) optimization algorithm and the mini batch size was set to 128. We applied dropout to input values of the LSTM recurrent networks to prevent overfitting. The initial learning rate was set to 0.001, and it decreased exponentially by 0.99 at the end of each epoch training. An Android smartphone application was developed and released to collect data. We collected smartphone data for a total of 18 subjects. Using the data, the model classified accompanying and conversation by 98.74% and 98.83% accuracy each. Both the F1 score and accuracy of the model were higher than the F1 score and accuracy of the majority vote classifier, support vector machine, and deep recurrent neural network. In the future research, we will focus on more rigorous multimodal sensor data synchronization methods that minimize the time stamp differences. In addition, we will further study transfer learning method that enables transfer of trained models tailored to the training data to the evaluation data that follows a different distribution. It is expected that a model capable of exhibiting robust recognition performance against changes in data that is not considered in the model learning stage will be obtained.
https://doi.org/10.13088/jiis.2019.25.1.163 인용 PDF KSCI HTML

Comparison of One-day and Two-day Protocol of $^{11}C$-Acetate and $^{18}F$-FDG Scan in Hepatoma (간암환자에 있어서 $^{11}C$-Acetate와 $^{18}F$-FDG PET/CT 검사의 당일 검사법과 양일 검사법의 비교)

Kang, Sin-Chang;Park, Hoon-Hee;Kim, Jung-Yul;Lim, Han-Sang;Kim, Jae-Sam;Lee, Chang-Ho
- The Korean Journal of Nuclear Medicine Technology
- /
- v.14 no.2
- /
- pp.3-8
- /
- 2010
Purpose: $^{11}C$-Acetate PET/CT is useful in detecting lesions that are related to livers in the human body and leads to a sensitivity of 87.3%. On the other hand, $^{18}F$-FDG PET/CT has a sensitivity of 47.3% and it has been reported that if both $^{18}F$-FDG and $^{11}C$-Acetate PET/CT are carried out together, their cumulative sensitivity is around 100%. However, the normal intake of the pancreas and the spleen in $^{11}C$-Acetate PET/CT can influence the $^{18}F$-FDG PET/CT leading to an inaccurate diagnosis. This research was aimed at the verification of the usefulness of how much influence these two radioactive medical supplies can cause on the medical images through comparative analysis between the one-day and two-day protocol. Materials and Methods: This research was carried out based on 46 patients who were diagnosed with liver cancer and have gone through the PET/CT (35 male, 11 female participants, average age: $54{\pm}10.6$ years, age range: 29-69 years). The equipment used for this test was the Biograph TruePoint40 PET/CT (Siemens Medical Systems, USA) and 21 participants who went through the one-day protocol test were first given the $^{11}C$-Acetate PET/CT and the $^{18}F$-FDG PET/CT, the latter exactly after one hour. The other 25 participants who went through the two-day protocol test were given the $^{11}C$-Acetate PET/CT on the first day and the $^{18}F$-FDG PET/CT on the next day. These two groups were then graded comparatively by assigning identical areas of interest of the pancreas and the spleen in the $^{18}F$-FDG images and by measuring the Standard Uptake Value (SUV). SPSS Ver.17 (SPSS Inc., USA) was used for statistical analysis, where statistical significance was found through the unpaired t-test. Results: After analyzing the participants' medical images from each of the two different protocol types, the average${\pm}$standard deviation of the SUV of the pancreas carried out under the two-day protocol were as follows: head $1.62{\pm}0.32$ g/mL, body $1.57{\pm}0.37$ g/mL, tail $1.49{\pm}0.33$ g/mL and the spleen $1.53{\pm}0.28$ g/mL. Whereas, the results for participants carried out under the one-day protocol were as follows: head $1.65{\pm}0.35$ g/mL, body $1.58{\pm}0.27$ g/mL, tail $1.49{\pm}0.28$ g/mL and the spleen $1.66{\pm}0.29$ g/mL. Conclusion: It was found that no statistical significant difference existed between the one-day and two-day protocol SUV in the pancreas and the spleen (p<0.05), and nothing which could be misconceived as false positive were found from the PET/CT medical image analysis. From this research, it was also found that no overestimation of the SUV occurred from the influence of $^{11}C$-Acetate on the $^{18}F$-FDG medical images where those two tests were carried out for one day. This result was supported by the statistical significance of the SUV of measurement. If $^{11}C$-Acetate becomes commercialized in the future, the diagnostic ability of liver diseases can be improved by $^{18}F$-FDG and one-day protocol. It is from this result where tests can be accomplished in one day without the interference phenomenon of the two radioactive medical supplies and furthermore, could reduce the waiting time improving customer satisfaction.
PDF

Assessment of Bone Metastasis using Nuclear Medicine Imaging in Breast Cancer : Comparison between PET/CT and Bone Scan (유방암 환자에서 골전이에 대한 핵의학적 평가)

Cho, Dae-Hyoun;Ahn, Byeong-Cheol;Kang, Sung-Min;Seo, Ji-Hyoung;Bae, Jin-Ho;Lee, Sang-Woo;Jeong, Jin-Hyang;Yoo, Jeong-Soo;Park, Ho-Young;Lee, Jae-Tae
- Nuclear Medicine and Molecular Imaging
- /
- v.41 no.1
- /
- pp.30-41
- /
- 2007
Purpose: Bone metastasis in breast cancer patients are usually assessed by conventional Tc-99m methylene diphosphonate whole-body bone scan, which has a high sensitivity but a poor specificity. However, positron emission tomography with $^{18}F-2-deoxyglucose$ (FDG-PET) can offer superior spatial resolution and improved specificity. FDG-PET/CT can offer more information to assess bone metastasis than PET alone, by giving a anatomical information of non-enhanced CT image. We attempted to evaluate the usefulness of FDG-PET/CT for detecting bone metastasis in breast cancer and to compare FDG-PET/CT results with bone scan findings. Materials and Methods: The study group comprised 157 women patients (range: $28{\sim}78$ years old, $mean{\pm}SD=49.5{\pm}8.5$) with biopsy-proven breast cancer who underwent bone scan and FDG-PET/CT within 1 week interval. The final diagnosis of bone metastasis was established by histopathological findings, radiological correlation, or clinical follow-up. Bone scan was acquired over 4 hours after administration of 740 MBq Tc-99m MDP. Bone scan image was interpreted as normal, low, intermediate or high probability for osseous metastasis. FDG PET/CT was performed after 6 hours fasting. 370 MBq F-18 FDG was administered intravenously 1 hour before imaging. PET data was obtained by 3D mode and CT data, used as transmission correction database, was acquired during shallow respiration. PET images were evaluated by visual interpretation, and quantification of FDG accumulation in bone lesion was performed by maximal SUV(SUVmax) and relative SUV(SUVrel). Results: Six patients(4.4%) showed metastatic bone lesions. Four(66.6%) of 6 patients with osseous metastasis was detected by bone scan and all 6 patients(100%) were detected by PET/CT. A total of 135 bone lesions found on either FDG-PET or bone scan were consist of 108 osseous metastatic lesion and 27 benign bone lesions. Osseous metastatic lesion had higher SUVmax and SUVrel compared to benign bone lesion($4.79{\pm}3.32$ vs $1.45{\pm}0.44$, p=0.000, $3.08{\pm}2.85$ vs $0.30{\pm}0.43$, p=0.000). Among 108 osseous metastatic lesions, 76 lesions showed as abnormal uptake on bone scan, and 76 lesions also showed as increased FDG uptake on PET/CT scan. There was good agreement between FDG uptake and abnormal bone scan finding (Kendall tau-b : 0.689, p=0.000). Lesion showed increased bone tracer uptake had higher SUVmax and SUVrel compared to lesion showed no abnormal bone scan finding ($6.03{\pm}3.12$ vs $1.09{\pm}1.49$, p=0.000, $4.76{\pm}3.31$ vs $1.29{\pm}0.92$, p=0.000). The order of frequency of osseous metastatic site was vertebra, pelvis, rib, skull, sternum, scapula, femur, clavicle, and humerus. Metastatic lesion on skull had highest SUVmax and metastatic lesion on rib had highest SUVrel. Osteosclerotic metastatic lesion had lowest SUVmax and SUVrel. Conclusion: These results suggest that FDG-PET/CT is more sensitive to detect breast cancer patients with osseous metastasis. CT scan must be reviewed cautiously skeleton with bone window, because osteosclerotic metastatic lesion did not showed abnormal FDG accumulation frequently.
PDF KSCI

Influences of Air Pollution on the Growth of Ornamental Trees - With Particular Reference to SO₂ - (대기오염(大氣汚染)이 조경수목(造景樹木)의 생육(生育)에 미치는 영향(影響) - 아황산(亞黃酸)가스에 대(對)하여 -)

Kim, Tae Wook
- Journal of Korean Society of Forest Science
- /
- v.29 no.1
- /
- pp.20-53
- /
- 1976
For the purpose of detecting the capability of the trees to resist air pollution and of determining the tree species best suited for purification of polluted air, particularly with regard to $SO_2$ contamination, six following ornamental tree species were selected as experimental materials: i.e., Hibiscus syriacus L., Ginkgo biloba L., Forsythia koreana Nak., Syringa dilatata Nak., Larix leptolepis Gordon, and Pinus rigida Miller. The susceptiblities of the trees were observed and analyzed on the basis of area ratio of smoke injury spots to the total leaf area. The results of the experiments are as follows: I. The Susceptibilities to Sulfur Dioxide. (1) The decreasing order of tolerance to $SO_2$ by species was as follows: 1. Hibiscus syriacus 2. Ginkgo biloba, 3. Forsythia koreana, 4. Syringa dilatata, 5. Larix leptolepis, and 6. Pinus rigida. In general, Hibiscus syriacus and Ginkgo biloba can be grouped as the most resistant and Larix leptolepis and Pinus rigida as the least resistant and Forsythia koreana and Syringa dilatata as of intermediate resistance. (2) The sulfur content of the leaves treated by $SO_2$ increased in proportion to the increase of the concentration of the fumigation. The content in the coniferous species proved to be less than that of the broad-leaved species, but Ginkgo biloba proved to contain as much sulfur as broad-leaved species. (3) The earlier-stage leaves fumigated in June with the $SO_2$ concentration up-to-l-ppm showed that sulfur content increases in proportion to the increase of the concentration of the fumigation, but the difference between concentration was not so significant. (4) The later-stage leaves fumigated in October showed higher sulfur content than the earlier stage leaves, and a wider range of difference in sulfur content was detected among different concentrations. The limit of fumigation resulting in culmination of sulfur absoption in broad-leaved species, such as Syringa dilatata, Hibiscus syriacus, and Forsythia koreana proved to be around 0.6 ppm. (5) Due to the sprouting ability and the adventitious bud formation, the recovery from $SO_2$ fumigation was prominent in Hibiscus syriacus, Syringa dilatata, and Forsythia koreana. (6) The differences in the smoke spot color were recognized by species: namely, dirt-brown in Syringa dilatata, brilliant yellowish-brown in Pinus rigida and Ginkgo biloba, whitish-yellow in Hibiscus syriacus and reddish-brown in Forsythia koreana. (7) The leaf margins proved to be most susceptible, and the leaf bases of the mid-rib most tolerant. In both Ginkgo biloba and Larix leptolepis, the younger leaves were more resistant to $SO_2$ than the older ones. II. The ulfur Content of the Leaves of the Ornamental Trees Growing in the City of Seoul. (1) The sulfur contents in the leaves of the Seoul City ornamental trees showed a remarkably higher value than those of the leaves in the non-polluted areas. The sulfur content of the leaves in the non-polluted area proved to be in the following descending order: Salix pseudo-lasiogyne Leveille, Ginkgo biloba L., Alianthus altissima swingle, Platanus orientalis L., and Populus deltoides Marsh. (2) In respect to the sulfur contents in the leaves of the ornamental trees in the city of Seoul, the air pollution proved to be the worst in the areas of Seoul Railroad Station, the Ahyun Pass, and the Entrance to Ewha Womans University. The areas of Deogsu Palace, Gyeongbog Palace, Changdeog Palace, Changgyeong Park and the Hyehwa Intersection were least polluted, and the areas of the East Gate, the Ulchi Intersection and the Seodaemun Intersection are in the intermediate state.
PDF

A Proposal of a Keyword Extraction System for Detecting Social Issues (사회문제 해결형 기술수요 발굴을 위한 키워드 추출 시스템 제안)

Jeong, Dami;Kim, Jaeseok;Kim, Gi-Nam;Heo, Jong-Uk;On, Byung-Won;Kang, Mijung
- Journal of Intelligence and Information Systems
- /
- v.19 no.3
- /
- pp.1-23
- /
- 2013
To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.
https://doi.org/10.13088/jiis.2013.19.3.001 인용 PDF KSCI

Search Result 8,398, Processing Time 0.048 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)