• Title/Summary/Keyword: Preprocessing Methods

Search Result 506, Processing Time 0.026 seconds

Generating Data and Applying Machine Learning Methods for Music Genre Classification (음악 장르 분류를 위한 데이터 생성 및 머신러닝 적용 방안)

  • Bit-Chan Eom;Dong-Hwi Cho;Choon-Sung Nam
    • Journal of Internet Computing and Services
    • /
    • v.25 no.4
    • /
    • pp.57-64
    • /
    • 2024
  • This paper aims to enhance the accuracy of music genre classification for music tracks where genre information is not provided, by utilizing machine learning to classify a large amount of music data. The paper proposes collecting and preprocessing data instead of using the commonly employed GTZAN dataset in previous research for genre classification in music. To create a dataset with superior classification performance compared to the GTZAN dataset, we extract specific segments with the highest energy level of the onset. We utilize 57 features as the main characteristics of the music data used for training, including Mel Frequency Cepstral Coefficients (MFCC). We achieved a training accuracy of 85% and a testing accuracy of 71% using the Support Vector Machine (SVM) model to classify into Classical, Jazz, Country, Disco, Soul, Rock, Metal, and Hiphop genres based on preprocessed data.

A Study on AI-Based Real Estate Rate of Return Decision Models of 5 Sectors for 5 Global Cities: Seoul, New York, London, Paris and Tokyo (인공지능 (AI) 기반 섹터별 부동산 수익률 결정 모델 연구- 글로벌 5개 도시를 중심으로 (서울, 뉴욕, 런던, 파리, 도쿄) -)

  • Wonboo Lee;Jisoo Lee;Minsang Kim
    • Journal of Korean Society for Quality Management
    • /
    • v.52 no.3
    • /
    • pp.429-457
    • /
    • 2024
  • Purpose: This study aims to provide useful information to real estate investors by developing a profit determination model using artificial intelligence. The model analyzes the real estate markets of six selected cities from multiple perspectives, incorporating characteristics of the real estate market, economic indicators, and policies to determine potential profits. Methods: Data on real estate markets, economic indicators, and policies for five cities were collected and cleaned. The data was then normalized and split into training and testing sets. An AI model was developed using machine learning algorithms and trained with this data. The model was applied to the six cities, and its accuracy was evaluated using metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-squared by comparing predicted profits to actual outcomes. Results: The profit determination model was successfully applied to the real estate markets of six cities, showing high accuracy and predictability in profit forecasts. The study provided valuable insights for real estate investors, demonstrating the model's utility for informed investment decisions. Conclusion: The study identified areas for future improvement, suggesting the integration of diverse data sources and advanced machine learning techniques to enhance predictive capabilities.

Location Generalization Method of Moving Object using $R^*$-Tree and Grid ($R^*$-Tree와 Grid를 이용한 이동 객체의 위치 일반화 기법)

  • Ko, Hyun;Kim, Kwang-Jong;Lee, Yon-Sik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.2 s.46
    • /
    • pp.231-242
    • /
    • 2007
  • The existing pattern mining methods[1,2,3,4,5,6,11,12,13] do not use location generalization method on the set of location history data of moving object, but even so they simply do extract only frequent patterns which have no spatio-temporal constraint in moving patterns on specific space. Therefore, it is difficult for those methods to apply to frequent pattern mining which has spatio-temporal constraint such as optimal moving or scheduling paths among the specific points. And also, those methods are required more large memory space due to using pattern tree on memory for reducing repeated scan database. Therefore, more effective pattern mining technique is required for solving these problems. In this paper, in order to develop more effective pattern mining technique, we propose new location generalization method that converts data of detailed level into meaningful spatial information for reducing the processing time for pattern mining of a massive history data set of moving object and space saving. The proposed method can lead the efficient spatial moving pattern mining of moving object using by creating moving sequences through generalizing the location attributes of moving object into 2D spatial area based on $R^*$-Tree and Area Grid Hash Table(AGHT) in preprocessing stage of pattern mining.

  • PDF

A Study on the Efficiency of Internet Keyword Advertisement According to CPM and CPC Methods by Analyzing Transactional Data (키워드 검색 광고 운영 DB 데이터 분석을 통한 CPM와 CPC방식의 광고효과 연구)

  • Kim, Do-Yeon;Lim, Gyoo-Gun;Lee, Dae-Chul
    • The Journal of Society for e-Business Studies
    • /
    • v.16 no.4
    • /
    • pp.139-154
    • /
    • 2011
  • Recently Internet keyword search service providers tends to use CPC advertizement method rather than CPM method. However researches how much the CPC method is beneficial to advertisers than CPM method in certain perspectives are insufficient and not performed systematically. So this paper tries to do comparative analysis about the two methods by analyzing the real transactional DB data from an advertizement agency. Due to the difficulties of direct comparison between the two methods because of their different expose positions on the Web and different types of attributes in DB, we did some preprocessing step for the transactional data. From the result of analysis, the click rate of CPC is higher than CPM by 1.3% and the unit cost for the CPC per one click is lower than CPM method by 51 Won. It shows the CPC method is more effective than CPM method for advertizement from the point of advertizement effectiveness (CTR) and advertizement cost (CPC). We hope this research would give useful information to advertisers and marketing managers in making advertizement strategy, marketing decision and budgeting.

Hyperspectral Imaging Information System for Analyzing the Urchin Barren Phenomenon to Ensure the Safety of Seaweed-Derived Biomass (해조류 유래 바이오매스 안전성 확보를 위한 갯녹음 현상 분석 초분광영상 정보 시스템)

  • Yong-Suk Kim;Sang-Mok Chang
    • Clean Technology
    • /
    • v.30 no.3
    • /
    • pp.175-187
    • /
    • 2024
  • Seaweeds are widely distributed along national coastlines around the world, and the biomass derived from them is an important marine biological organism. Seaweed is a crucial component of a healthy marine ecosystem. However, changes in marine environments have led to the occurrence of urchin barrens, and the damage caused by this phenomenon is steadily increasing. As a result, investigations into the distribution and spread of urchin barrens in the coastal areas of South Korea are being conducted regularly so efficient detection technologies are essential. One of the technologies that can swiftly and accurately analyze extensive areas is detection technology based on hyperspectral image information systems. This study aims to present the latest hyperspectral imaging technology for investigating the current status of urchin barrens and the methods for classifying this technology, including principles, preprocessing techniques, and correction methods. This study also proposes a classification technique for urchin barrens along the coast of Jeju Island that uses hyperspectral images and categorizes the urchin barrens into initial, intermediate, and advanced stages. The results showed that approximately 17.5% of the experimental areas were in the advanced stage. Based on this, various management and restoration methods tailored to different categories of urchin barren can be proposed.

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

Improved Social Network Analysis Method in SNS (SNS에서의 개선된 소셜 네트워크 분석 방법)

  • Sohn, Jong-Soo;Cho, Soo-Whan;Kwon, Kyung-Lag;Chung, In-Jeong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.4
    • /
    • pp.117-127
    • /
    • 2012
  • Due to the recent expansion of the Web 2.0 -based services, along with the widespread of smartphones, online social network services are being popularized among users. Online social network services are the online community services which enable users to communicate each other, share information and expand human relationships. In the social network services, each relation between users is represented by a graph consisting of nodes and links. As the users of online social network services are increasing rapidly, the SNS are actively utilized in enterprise marketing, analysis of social phenomenon and so on. Social Network Analysis (SNA) is the systematic way to analyze social relationships among the members of the social network using the network theory. In general social network theory consists of nodes and arcs, and it is often depicted in a social network diagram. In a social network diagram, nodes represent individual actors within the network and arcs represent relationships between the nodes. With SNA, we can measure relationships among the people such as degree of intimacy, intensity of connection and classification of the groups. Ever since Social Networking Services (SNS) have drawn increasing attention from millions of users, numerous researches have made to analyze their user relationships and messages. There are typical representative SNA methods: degree centrality, betweenness centrality and closeness centrality. In the degree of centrality analysis, the shortest path between nodes is not considered. However, it is used as a crucial factor in betweenness centrality, closeness centrality and other SNA methods. In previous researches in SNA, the computation time was not too expensive since the size of social network was small. Unfortunately, most SNA methods require significant time to process relevant data, and it makes difficult to apply the ever increasing SNS data in social network studies. For instance, if the number of nodes in online social network is n, the maximum number of link in social network is n(n-1)/2. It means that it is too expensive to analyze the social network, for example, if the number of nodes is 10,000 the number of links is 49,995,000. Therefore, we propose a heuristic-based method for finding the shortest path among users in the SNS user graph. Through the shortest path finding method, we will show how efficient our proposed approach may be by conducting betweenness centrality analysis and closeness centrality analysis, both of which are widely used in social network studies. Moreover, we devised an enhanced method with addition of best-first-search method and preprocessing step for the reduction of computation time and rapid search of the shortest paths in a huge size of online social network. Best-first-search method finds the shortest path heuristically, which generalizes human experiences. As large number of links is shared by only a few nodes in online social networks, most nods have relatively few connections. As a result, a node with multiple connections functions as a hub node. When searching for a particular node, looking for users with numerous links instead of searching all users indiscriminately has a better chance of finding the desired node more quickly. In this paper, we employ the degree of user node vn as heuristic evaluation function in a graph G = (N, E), where N is a set of vertices, and E is a set of links between two different nodes. As the heuristic evaluation function is used, the worst case could happen when the target node is situated in the bottom of skewed tree. In order to remove such a target node, the preprocessing step is conducted. Next, we find the shortest path between two nodes in social network efficiently and then analyze the social network. For the verification of the proposed method, we crawled 160,000 people from online and then constructed social network. Then we compared with previous methods, which are best-first-search and breath-first-search, in time for searching and analyzing. The suggested method takes 240 seconds to search nodes where breath-first-search based method takes 1,781 seconds (7.4 times faster). Moreover, for social network analysis, the suggested method is 6.8 times and 1.8 times faster than betweenness centrality analysis and closeness centrality analysis, respectively. The proposed method in this paper shows the possibility to analyze a large size of social network with the better performance in time. As a result, our method would improve the efficiency of social network analysis, making it particularly useful in studying social trends or phenomena.

Prediction and analysis of acute fish toxicity of pesticides to the rainbow trout using 2D-QSAR (2D-QSAR방법을 이용한 농약류의 무지개 송어 급성 어독성 분석 및 예측)

  • Song, In-Sik;Cha, Ji-Young;Lee, Sung-Kwang
    • Analytical Science and Technology
    • /
    • v.24 no.6
    • /
    • pp.544-555
    • /
    • 2011
  • The acute toxicity in the rainbow trout (Oncorhynchus mykiss) was analyzed and predicted using quantitative structure-activity relationships (QSAR). The aquatic toxicity, 96h $LC_{50}$ (median lethal concentration) of 275 organic pesticides, was obtained from EU-funded project DEMETRA. Prediction models were derived from 558 2D molecular descriptors, calculated in PreADMET. The linear (multiple linear regression) and nonlinear (support vector machine and artificial neural network) learning methods were optimized by taking into account the statistical parameters between the experimental and predicted p$LC_{50}$. After preprocessing, population based forward selection were used to select the best subsets of descriptors in the learning methods including 5-fold cross-validation procedure. The support vector machine model was used as the best model ($R^2_{CV}$=0.677, RMSECV=0.887, MSECV=0.674) and also correctly classified 87% for the training set according to EU regulation criteria. The MLR model could describe the structural characteristics of toxic chemicals and interaction with lipid membrane of fish. All the developed models were validated by 5 fold cross-validation and Y-scrambling test.

A High Performance License Plate Recognition System (고속처리 자동차 번호판 인식시스템)

  • 남기환;배철수
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.6 no.8
    • /
    • pp.1352-1357
    • /
    • 2002
  • This Paper describes algorithm to extract license plates in vehicle images. Conventional methods perform preprocessing on the entire vehicle image to produce the edge image and binarize it. Hough transform is applied to the binary image to find horizontal and vertical lines, and the license plate area is extracted using the characteristics of license plates. Problems with this approach are that real-time processing is not feasible due to long processing time and that the license plate area is not extracted when lighting is irregular such as at night or when the plate boundary does not show up in the image. This research uses the gray level transition characteristics of license plates to verify the digit area by examining the digit width and the level difference between the background area the digit area, and then extracts the plate area by testing the distance between the verified digits. This research solves the problem of failure in extracting the license plates due to degraded plate boundary as in the conventional methods and resolves the problem of the time requirement by processing the real time such that practical application is possible. This paper Presents a power automated license plate recognition system, which is able to read license numbers of cars, even under circumstances, which are far from ideal. In a real-life test, the percentage of rejected plates wan 13%, whereas 0.4% of the plates were misclassified. Suggestions for further improvements are given.

Artificial Intelligence Algorithms, Model-Based Social Data Collection and Content Exploration (소셜데이터 분석 및 인공지능 알고리즘 기반 범죄 수사 기법 연구)

  • An, Dong-Uk;Leem, Choon Seong
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.23-34
    • /
    • 2019
  • Recently, the crime that utilizes the digital platform is continuously increasing. About 140,000 cases occurred in 2015 and about 150,000 cases occurred in 2016. Therefore, it is considered that there is a limit handling those online crimes by old-fashioned investigation techniques. Investigators' manual online search and cognitive investigation methods those are broadly used today are not enough to proactively cope with rapid changing civil crimes. In addition, the characteristics of the content that is posted to unspecified users of social media makes investigations more difficult. This study suggests the site-based collection and the Open API among the content web collection methods considering the characteristics of the online media where the infringement crimes occur. Since illegal content is published and deleted quickly, and new words and alterations are generated quickly and variously, it is difficult to recognize them quickly by dictionary-based morphological analysis registered manually. In order to solve this problem, we propose a tokenizing method in the existing dictionary-based morphological analysis through WPM (Word Piece Model), which is a data preprocessing method for quick recognizing and responding to illegal contents posting online infringement crimes. In the analysis of data, the optimal precision is verified through the Vote-based ensemble method by utilizing a classification learning model based on supervised learning for the investigation of illegal contents. This study utilizes a sorting algorithm model centering on illegal multilevel business cases to proactively recognize crimes invading the public economy, and presents an empirical study to effectively deal with social data collection and content investigation.

  • PDF