• Title/Summary/Keyword: domain-specific model

Search Result 289, Processing Time 0.026 seconds

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space (전문어의 범용 공간 매핑을 위한 비선형 벡터 정렬 방법론)

  • Kim, Junwoo;Yoon, Byungho;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.127-146
    • /
    • 2022
  • Recently, as word embedding has shown excellent performance in various tasks of deep learning-based natural language processing, researches on the advancement and application of word, sentence, and document embedding are being actively conducted. Among them, cross-language transfer, which enables semantic exchange between different languages, is growing simultaneously with the development of embedding models. Academia's interests in vector alignment are growing with the expectation that it can be applied to various embedding-based analysis. In particular, vector alignment is expected to be applied to mapping between specialized domains and generalized domains. In other words, it is expected that it will be possible to map the vocabulary of specialized fields such as R&D, medicine, and law into the space of the pre-trained language model learned with huge volume of general-purpose documents, or provide a clue for mapping vocabulary between mutually different specialized fields. However, since linear-based vector alignment which has been mainly studied in academia basically assumes statistical linearity, it tends to simplify the vector space. This essentially assumes that different types of vector spaces are geometrically similar, which yields a limitation that it causes inevitable distortion in the alignment process. To overcome this limitation, we propose a deep learning-based vector alignment methodology that effectively learns the nonlinearity of data. The proposed methodology consists of sequential learning of a skip-connected autoencoder and a regression model to align the specialized word embedding expressed in each space to the general embedding space. Finally, through the inference of the two trained models, the specialized vocabulary can be aligned in the general space. To verify the performance of the proposed methodology, an experiment was performed on a total of 77,578 documents in the field of 'health care' among national R&D tasks performed from 2011 to 2020. As a result, it was confirmed that the proposed methodology showed superior performance in terms of cosine similarity compared to the existing linear vector alignment.

Study on the Effect of Self-Disclosure Factor on Exposure Behavior of Social Network Service (자기노출 요인이 소셜 네트워크 서비스의 노출행동에 미치는 영향에 관한 연구)

  • Do Soon Kwon;Seong Jun Kim;Jung Eun Kim;Hye In Jeong;Ki Seok Lee
    • Information Systems Review
    • /
    • v.18 no.3
    • /
    • pp.209-233
    • /
    • 2016
  • Internet companies that utilize social network have increased in number. The introduction of diverse social media services facilitated innovative changes in e-business. Social network service (SNS), which is a domain of social media, is a web-based service designed to strengthen human relations in the Internet and build new social relations. The remarkable growth of social network services and the profit generation and perception of this service are the new growth engines of this digital age. Given this development, many global IT companies views SNS as the most powerful form of social media. Thus, they invest efforts to develop business models using SNS.2) This study verifies the impact of privacy exposure in SNS as a result of privacy invasion. This study examines the purpose of using the SNS and user's awareness of the significance of personal information, which are key factors that affect self-disclosure of personal information. This study utilizes theory of reasoned action (TRA) to provide a theoretical platform that describes the specific behavior and emotional response of individuals. This study presents a research model that considers negative attitude (negatude). In this model, self-disclosure in SNS is considered a TRA. TRA is a subjective norm, a behavioral intention, and a key variable of exposure behavior. A survey was conducted on college students at Y university in Seoul to empirically verify the research model. The students have experiences in using SNS. A total of 198 samples were collected. Path analysis was applied to analyze the relations of factors. The results of path analysis show the statistically insignificant impact of privacy invasion on negatude, subjective norm, behavioral intention, and exposure behavior. The impact of unrecognized privacy invasion was also considered insignificant. The impacts of intention to use SNS on negatude, subjective norm, behavioral intention, and exposure behavior was significant. A significant impact was also found for the significance of personal information on subjective norm, behavioral intention, and exposure behavior, whereas the impact on negatude was insignificant. The impact of subjective norm on behavioral intention was significant. Lastly, the impact of behavioral intention on exposure behavior was insignificant. These findings are significant because the study examined the process of self-disclosure by integrating psychological and social factors based on theoretical discussion.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

Development and Application of Scientific Inquiry-based STEAM Education Program for Free-Learning Semester in Middle School (중학교 자유학기제에 적합한 과학 탐구 중심의 융합인재교육 프로그램 개발 및 적용)

  • Jeong, Hyeondo;Lee, Hyonyong
    • Journal of Science Education
    • /
    • v.41 no.3
    • /
    • pp.334-350
    • /
    • 2017
  • The purposes of this study are to develop scientific-inquiry based on STEAM education program and to investigate the effects of the program on middle-school students' interests, self-efficacy, and career choice about science, technology/engineering, and mathematics. In order to develop this program, the literature investigation and previous studies were conducted, so that finally the developmental direction was based on scientific inquiry and the developmental theme and model were selected. A total 92 first-graders in G middle-school of Daegu city were participated in this study. A single group pre-post test paired t-test was conducted to figure out changes of students' interest, self-efficacy, and career choices before or after applying this program. In addition, in-depth interviews were conducted with 14 students to find their specific responses. The results of this study were as follows. First, STEAM education program on the theme of 'RC Airplane' was developed on the basis of the 'ADBA' model. Second, the developed STEAM educational program not only results a decisive difference statistically but also has significant effects on middle-school students' interests, self-efficacy, and career choice in science, technology/engineering, and mathematics, who are involved in the free-semester program, across the overall affective domain. In conclusion, the STEAM educational program in this study could affect significant meanings to middle-school students during the free-semester. It could contribute to facilitate middle-school students' education for happiness and to grow the creative STEAM talents.

How to improve the accuracy of recommendation systems: Combining ratings and review texts sentiment scores (평점과 리뷰 텍스트 감성분석을 결합한 추천시스템 향상 방안 연구)

  • Hyun, Jiyeon;Ryu, Sangyi;Lee, Sang-Yong Tom
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.219-239
    • /
    • 2019
  • As the importance of providing customized services to individuals becomes important, researches on personalized recommendation systems are constantly being carried out. Collaborative filtering is one of the most popular systems in academia and industry. However, there exists limitation in a sense that recommendations were mostly based on quantitative information such as users' ratings, which made the accuracy be lowered. To solve these problems, many studies have been actively attempted to improve the performance of the recommendation system by using other information besides the quantitative information. Good examples are the usages of the sentiment analysis on customer review text data. Nevertheless, the existing research has not directly combined the results of the sentiment analysis and quantitative rating scores in the recommendation system. Therefore, this study aims to reflect the sentiments shown in the reviews into the rating scores. In other words, we propose a new algorithm that can directly convert the user 's own review into the empirically quantitative information and reflect it directly to the recommendation system. To do this, we needed to quantify users' reviews, which were originally qualitative information. In this study, sentiment score was calculated through sentiment analysis technique of text mining. The data was targeted for movie review. Based on the data, a domain specific sentiment dictionary is constructed for the movie reviews. Regression analysis was used as a method to construct sentiment dictionary. Each positive / negative dictionary was constructed using Lasso regression, Ridge regression, and ElasticNet methods. Based on this constructed sentiment dictionary, the accuracy was verified through confusion matrix. The accuracy of the Lasso based dictionary was 70%, the accuracy of the Ridge based dictionary was 79%, and that of the ElasticNet (${\alpha}=0.3$) was 83%. Therefore, in this study, the sentiment score of the review is calculated based on the dictionary of the ElasticNet method. It was combined with a rating to create a new rating. In this paper, we show that the collaborative filtering that reflects sentiment scores of user review is superior to the traditional method that only considers the existing rating. In order to show that the proposed algorithm is based on memory-based user collaboration filtering, item-based collaborative filtering and model based matrix factorization SVD, and SVD ++. Based on the above algorithm, the mean absolute error (MAE) and the root mean square error (RMSE) are calculated to evaluate the recommendation system with a score that combines sentiment scores with a system that only considers scores. When the evaluation index was MAE, it was improved by 0.059 for UBCF, 0.0862 for IBCF, 0.1012 for SVD and 0.188 for SVD ++. When the evaluation index is RMSE, UBCF is 0.0431, IBCF is 0.0882, SVD is 0.1103, and SVD ++ is 0.1756. As a result, it can be seen that the prediction performance of the evaluation point reflecting the sentiment score proposed in this paper is superior to that of the conventional evaluation method. In other words, in this paper, it is confirmed that the collaborative filtering that reflects the sentiment score of the user review shows superior accuracy as compared with the conventional type of collaborative filtering that only considers the quantitative score. We then attempted paired t-test validation to ensure that the proposed model was a better approach and concluded that the proposed model is better. In this study, to overcome limitations of previous researches that judge user's sentiment only by quantitative rating score, the review was numerically calculated and a user's opinion was more refined and considered into the recommendation system to improve the accuracy. The findings of this study have managerial implications to recommendation system developers who need to consider both quantitative information and qualitative information it is expect. The way of constructing the combined system in this paper might be directly used by the developers.

A Characterization of Oil Sand Reservoir and Selections of Optimal SAGD Locations Based on Stochastic Geostatistical Predictions (지구통계 기법을 이용한 오일샌드 저류층 해석 및 스팀주입중력법을 이용한 비투멘 회수 적지 선정 사전 연구)

  • Jeong, Jina;Park, Eungyu
    • Economic and Environmental Geology
    • /
    • v.46 no.4
    • /
    • pp.313-327
    • /
    • 2013
  • In the study, three-dimensional geostatistical simulations on McMurray Formation which is the largest oil sand reservoir in Athabasca area, Canada were performed, and the optimal site for steam assisted gravity drainage (SAGD) was selected based on the predictions. In the selection, the factors related to the vertical extendibility of steam chamber were considered as the criteria for an optimal site. For the predictions, 110 borehole data acquired from the study area were analyzed in the Markovian transition probability (TP) framework and three-dimensional distributions of the composing media were predicted stochastically through an existing TP based geostatistical model. The potential of a specific medium at a position within the prediction domain was estimated from the ensemble probability based on the multiple realizations. From the ensemble map, the cumulative thickness of the permeable media (i.e. Breccia and Sand) was analyzed and the locations with the highest potential for SAGD applications were delineated. As a supportive criterion for an optimal SAGD site, mean vertical extension of a unit permeable media was also delineated through transition rate based computations. The mean vertical extension of a permeable media show rough agreement with the cumulative thickness in their general distribution. However, the distributions show distinctive disagreement at a few locations where the cumulative thickness was higher due to highly alternating juxtaposition of the permeable and the less permeable media. This observation implies that the cumulative thickness alone may not be a sufficient criterion for an optimal SAGD site and the mean vertical extension of the permeable media needs to be jointly considered for the sound selections.

Use of Human Serum Albumin Fusion Tags for Recombinant Protein Secretory Expression in the Methylotrophic Yeast Hansenula polymorpha (메탄올 자화효모 Hansenula polymorpha에서의 재조합 단백질 분비발현을 위한 인체 혈청 알부민 융합단편의 활용)

  • Song, Ji-Hye;Hwang, Dong Hyeon;Oh, Doo-Byoung;Rhee, Sang Ki;Kwon, Ohsuk
    • Microbiology and Biotechnology Letters
    • /
    • v.41 no.1
    • /
    • pp.17-25
    • /
    • 2013
  • The thermotolerant methylotrophic yeast Hansenula polymorpha is an attractive model organism for various fundamental studies, such as the genetic control of enzymes involved in methanol metabolism, peroxisome biogenesis, nitrate assimilation, and resistance to heavy metals and oxidative stresses. In addition, H. polymorpha has been highlighted as a promising recombinant protein expression host, especially due to the availability of strong and tightly regulatable promoters. In this study, we investigated the possibility of employing human serum albumin (HSA) as the fusion tag for the secretory expression of heterologous proteins in H. polymorpha. A set of four expression cassettes, which contained the methanol oxidase (MOX) promoter, translational HSA fusion tag, and the terminator of MOX, were constructed. The expression cassettes were also designed to contain sequences for accessory elements including His8-tag, $2{\times}(Gly_4Ser_1)$ linkers, tobacco etch virus protease recognition sites (Tev), multi-cloning sites, and strep-tags. To determine the effects of the size of the HSA fusion tag on the secretory expression of the target protein, each cassette contained the HSA gene fragment truncated at a specific position based on its domain structure. By using the Green fluorescence protein gene as the reporter, the properties of each expression cassette were compared in various conditions. Our results suggest that the translational HSA fusion tag is an efficient tool for the secretory expression of recombinant proteins in H. polymorpha.

Academic Enrichment beginning from the Great Learning(大學, Dae Hak, or Da Xue in Chinese) toward the Essentials of the Studies of the Sages(聖學輯要, Seong Hak Jibyo) in the respect of Cultivating Oneself(修己, sugi) (수기(修己)의 측면에서 본 『대학(大學)』에서 『성학집요(聖學輯要)』로의 학문적 심화)

  • Shin, Chang Ho
    • (The)Study of the Eastern Classic
    • /
    • no.34
    • /
    • pp.63-88
    • /
    • 2009
  • This paper was a quest of pattern of holding "Dae Hak - the Great Learning" during Joseon Period having investigated the characteristics of the Essentials of the Studies of the Sages(聖學輯要, Seong Hak Jibyo) that was compiled by Lee I was a reinterpretation of the Great Learning, and also academic enrichment. During the period of Joseon Dynasty, the Great Learning had held the most important position as core scripture in the intellectual society that pursued Seong Hak(聖學, sage learning). Throughout the Joseon Period, the Great Learning was the essential text for the Emperorship Learning(帝王學, Jewang Hak) as well as Seong Hak, and it can also be said that Seong Hak Jibyo compiled by Yulgok - the courtesy name of Lee I, was the comprehensive collections thereof. While compiling Seong Hak Jibyo, Yulgok presented a model of Seong Hak of Joseon, which was based on "the Great Learning". Yul Gok organized the system of "Seong Hak Jibyo" largely in five parts, and properly arranged the Three Cardinal Principles(三綱領, samgangryeong) and Eight Articles or Steps(八條目, paljomok) therein. Particularly, in the Chapter Two, "Cultivating Oneself(修己, sugi)", Yulgok deal with 'being able to manifest one's bright virtue'(明明德, myeong myeong deok) among the Three Cardinal Principles as the core curriculum, meanwhile, Yulgok also covered "Investigation of things, gyeongmul(格物)," "Extension of knowledge, chiji(致知)," "Sincerity of the will, Seongui(誠意)," "Rectification of the mind, Jeongshim(正心)," "Cultivation of the personal life, susin(修身)," among Paljomok(eight steps) as the ultimate purpose of 'Stopping in perfect goodness'(止於至善, jieojiseon) These well preserve the principal system of Confucianism where "Cultivating oneself and regulating others (修己治人, sugichiin)" are core value, and his instructions as such also back up academic validity logically by presenting specific guidelines for practice according to each domain. Reinterpretation of "The Great Learning" by Yulgok in Seong Hak Jibyo is an arena to investigate the characteristics of Confucianism in Joseon Period, which was different from that of China, furthermore, such guidelines might take a role as criteria to understand the characteristics of humans and learning possessed by Korean people.

A Ranking Algorithm for Semantic Web Resources: A Class-oriented Approach (시맨틱 웹 자원의 랭킹을 위한 알고리즘: 클래스중심 접근방법)

  • Rho, Sang-Kyu;Park, Hyun-Jung;Park, Jin-Soo
    • Asia pacific journal of information systems
    • /
    • v.17 no.4
    • /
    • pp.31-59
    • /
    • 2007
  • We frequently use search engines to find relevant information in the Web but still end up with too much information. In order to solve this problem of information overload, ranking algorithms have been applied to various domains. As more information will be available in the future, effectively and efficiently ranking search results will become more critical. In this paper, we propose a ranking algorithm for the Semantic Web resources, specifically RDF resources. Traditionally, the importance of a particular Web page is estimated based on the number of key words found in the page, which is subject to manipulation. In contrast, link analysis methods such as Google's PageRank capitalize on the information which is inherent in the link structure of the Web graph. PageRank considers a certain page highly important if it is referred to by many other pages. The degree of the importance also increases if the importance of the referring pages is high. Kleinberg's algorithm is another link-structure based ranking algorithm for Web pages. Unlike PageRank, Kleinberg's algorithm utilizes two kinds of scores: the authority score and the hub score. If a page has a high authority score, it is an authority on a given topic and many pages refer to it. A page with a high hub score links to many authoritative pages. As mentioned above, the link-structure based ranking method has been playing an essential role in World Wide Web(WWW), and nowadays, many people recognize the effectiveness and efficiency of it. On the other hand, as Resource Description Framework(RDF) data model forms the foundation of the Semantic Web, any information in the Semantic Web can be expressed with RDF graph, making the ranking algorithm for RDF knowledge bases greatly important. The RDF graph consists of nodes and directional links similar to the Web graph. As a result, the link-structure based ranking method seems to be highly applicable to ranking the Semantic Web resources. However, the information space of the Semantic Web is more complex than that of WWW. For instance, WWW can be considered as one huge class, i.e., a collection of Web pages, which has only a recursive property, i.e., a 'refers to' property corresponding to the hyperlinks. However, the Semantic Web encompasses various kinds of classes and properties, and consequently, ranking methods used in WWW should be modified to reflect the complexity of the information space in the Semantic Web. Previous research addressed the ranking problem of query results retrieved from RDF knowledge bases. Mukherjea and Bamba modified Kleinberg's algorithm in order to apply their algorithm to rank the Semantic Web resources. They defined the objectivity score and the subjectivity score of a resource, which correspond to the authority score and the hub score of Kleinberg's, respectively. They concentrated on the diversity of properties and introduced property weights to control the influence of a resource on another resource depending on the characteristic of the property linking the two resources. A node with a high objectivity score becomes the object of many RDF triples, and a node with a high subjectivity score becomes the subject of many RDF triples. They developed several kinds of Semantic Web systems in order to validate their technique and showed some experimental results verifying the applicability of their method to the Semantic Web. Despite their efforts, however, there remained some limitations which they reported in their paper. First, their algorithm is useful only when a Semantic Web system represents most of the knowledge pertaining to a certain domain. In other words, the ratio of links to nodes should be high, or overall resources should be described in detail, to a certain degree for their algorithm to properly work. Second, a Tightly-Knit Community(TKC) effect, the phenomenon that pages which are less important but yet densely connected have higher scores than the ones that are more important but sparsely connected, remains as problematic. Third, a resource may have a high score, not because it is actually important, but simply because it is very common and as a consequence it has many links pointing to it. In this paper, we examine such ranking problems from a novel perspective and propose a new algorithm which can solve the problems under the previous studies. Our proposed method is based on a class-oriented approach. In contrast to the predicate-oriented approach entertained by the previous research, a user, under our approach, determines the weights of a property by comparing its relative significance to the other properties when evaluating the importance of resources in a specific class. This approach stems from the idea that most queries are supposed to find resources belonging to the same class in the Semantic Web, which consists of many heterogeneous classes in RDF Schema. This approach closely reflects the way that people, in the real world, evaluate something, and will turn out to be superior to the predicate-oriented approach for the Semantic Web. Our proposed algorithm can resolve the TKC(Tightly Knit Community) effect, and further can shed lights on other limitations posed by the previous research. In addition, we propose two ways to incorporate data-type properties which have not been employed even in the case when they have some significance on the resource importance. We designed an experiment to show the effectiveness of our proposed algorithm and the validity of ranking results, which was not tried ever in previous research. We also conducted a comprehensive mathematical analysis, which was overlooked in previous research. The mathematical analysis enabled us to simplify the calculation procedure. Finally, we summarize our experimental results and discuss further research issues.