• Title/Summary/Keyword: Authorship of a text

Search Result 13, Processing Time 0.022 seconds

Text Categorization for Authorship based on the Features of Lingual Conceptual Expression

  • Zhang, Quan;Zhang, Yun-liang;Yuan, Yi
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.515-521
    • /
    • 2007
  • The text categorization is an important field for the automatic text information processing. Moreover, the authorship identification of a text can be treated as a special text categorization. This paper adopts the conceptual primitives' expression based on the Hierarchical Network of Concepts (HNC) theory, which can describe the words meaning in hierarchical symbols, in order to avoid the sparse data shortcoming that is aroused by the natural language surface features in text categorization. The KNN algorithm is used as computing classification element. Then, the experiment has been done on the Chinese text authorship identification. The experiment result gives out that the processing mode that is put forward in this paper achieves high correct rate, so it is feasible for the text authorship identification.

  • PDF

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words (한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.33 no.2
    • /
    • pp.51-59
    • /
    • 2020
  • Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.

Authorship Attribution of Web Texts with Korean Language Applying Deep Learning Method (딥러닝을 활용한 웹 텍스트 저자의 남녀 구분 및 연령 판별 : SNS 사용자를 중심으로)

  • Park, Chan Yub;Jang, In Ho;Lee, Zoon Ky
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.147-155
    • /
    • 2016
  • According to rapid development of technology, web text is growing explosively and attracting many fields as substitution for survey. The user of Facebook is reaching up to 113 million people per month, Twitter is used in various institution or company as a behavioral analysis tool. However, many research has focused on meaning of the text itself. And there is a lack of study for text's creation subject. Therefore, this research consists of sex/age text classification with by using 20,187 Facebook users' posts that reveal the sex and age of the writer. This research utilized Convolution Neural Networks, a type of deep learning algorithms which came into the spotlight as a recent image classifier in web text analyzing. The following result assured with 92% of accuracy for possibility as a text classifier. Also, this research was minimizing the Korean morpheme analysis and it was conducted using a Korean web text to Authorship Attribution. Based on these feature, this study can develop users' multiple capacity such as web text management information resource for worker, non-grammatical analyzing system for researchers. Thus, this study proposes a new method for web text analysis.

Authorship Attribution in Korean Using Frequency Profiles (빈도 정보를 이용한 한국어 저자 판별)

  • Han, Na-Rae
    • Korean Journal of Cognitive Science
    • /
    • v.20 no.2
    • /
    • pp.225-241
    • /
    • 2009
  • This paper presents an authorship attribution study in Korean conducted on a corpus of newspaper column texts. Based on the data set consisting of a total of 160 columns written by four columnists of Chosun Daily, the approach utilizes relative frequencies of various lexical units in Korean such as fully inflected words, morphemes, syllables and their bigrams in an attempt to establish authorship of a blind text selected from the set. Among these various lexical units, "the morpheme" is found to be most effective in predicting who among the four potential candidates authored a text, reporting accuracies of over 93%. The results indicate that quantitative and statistical techniques in authorship attribution and computational stylistics can be successfully applied to Korean texts.

  • PDF

Identifying Mobile Owner based on Authorship Attribution using WhatsApp Conversation

  • Almezaini, Badr Mohammd;Khan, Muhammad Asif
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.7
    • /
    • pp.317-323
    • /
    • 2021
  • Social media is increasingly becoming a part of our daily life for communicating each other. There are various tools and applications for communication and therefore, identity theft is a common issue among users of such application. A new style of identity theft occurs when cybercriminals break into WhatsApp account, pretend as real friends and demand money or blackmail emotionally. In order to prevent from such issues, data mining can be used for text classification (TC) in analysis authorship attribution (AA) to recognize original sender of the message. Arabic is one of the most spoken languages around the world with different variants. In this research, we built a machine learning model for mining and analyzing the Arabic messages to identify the author of the messages in Saudi dialect. Many points would be addressed regarding authorship attribution mining and analysis: collect Arabic messages in the Saudi dialect, filtration of the messages' tokens. The classification would use a cross-validation technique and different machine-learning algorithms (Naïve Baye, Support Vector Machine). Results of average accuracy for Naïve Baye and Support Vector Machine have been presented and suggestions for future work have been presented.

A Classification Model for Attack Mail Detection based on the Authorship Analysis (작성자 분석 기반의 공격 메일 탐지를 위한 분류 모델)

  • Hong, Sung-Sam;Shin, Gun-Yoon;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.35-46
    • /
    • 2017
  • Recently, attackers using malicious code in cyber security have been increased by attaching malicious code to a mail and inducing the user to execute it. Especially, it is dangerous because it is easy to execute by attaching a document type file. The author analysis is a research area that is being studied in NLP (Neutral Language Process) and text mining, and it studies methods of analyzing authors by analyzing text sentences, texts, and documents in a specific language. In case of attack mail, it is created by the attacker. Therefore, by analyzing the contents of the mail and the attached document file and identifying the corresponding author, it is possible to discover more distinctive features from the normal mail and improve the detection accuracy. In this pager, we proposed IADA2(Intelligent Attack mail Detection based on Authorship Analysis) model for attack mail detection. The feature vector that can classify and detect attack mail from the features used in the existing machine learning based spam detection model and the features used in the author analysis of the document and the IADA2 detection model. We have improved the detection models of attack mails by simply detecting term features and extracted features that reflect the sequence characteristics of words by applying n-grams. Result of experiment show that the proposed method improves performance according to feature combinations, feature selection techniques, and appropriate models.

Construction of Shakespeare Authorship in the Eighteenth Century: An Example of Edmond Malone's Edition. (18세기 셰익스피어 저자론-말로운의 편집서 중심으로)

  • Han, Younglim
    • Journal of English Language & Literature
    • /
    • v.59 no.4
    • /
    • pp.645-666
    • /
    • 2013
  • In the history of the study of Shakespeare's texts the eighteenth century marked the emergence of editors, and in the history of Shakespearean editing Edmond Malone's emphasis on documentary evidence inaugurated a new stage. Malone's antiquarian scholarship sought to establish Shakespeare in the theatrical context of his age and a historically informed view of the physical circumstances under which he wrote his plays. Malone's editorial use of historical sources in terms of Shakespeare's past formulated a new mode of ascertaining his authorship: the construction of Shakespeare as a man of the theatre as well as of literature. Malone was the first scholar to recognize Shakespeare's merits as an actor, and to introduce the concept of the theatrical Shakespeare, which has become the scholarly norm since. In this respect this paper is designed to demonstrate that Malone's editorial principle and practice are characteristic of the identification of the factual documents of Shakespeare's biography, the authentication of his material to attain his true text, and the construction of his personal experiences through intensive readings of his plays. In conclusion, Malone's new criteria laid the foundation for the progress towards authorizing Shakespeare, thereby canonizing him as a figure of the theatrical and literary authority.

The Identification Framework for source code author using Authorship Analysis and CNN (작성자 분석과 CNN을 적용한 소스 코드 작성자 식별 프레임워크)

  • Shin, Gun-Yoon;Kim, Dong-Wook;Hong, Sung-sam;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.19 no.5
    • /
    • pp.33-41
    • /
    • 2018
  • Recently, Internet technology has developed, various programs are being created and therefore various codes are being made through many authors. On this aspect, some author deceive a program or code written by other particular author as they make it themselves and use other writers' code indiscriminately, or not indicating the exact code which has been used. Due to this makes it more and more difficult to protect the code. In this paper, we propose author identification framework using Authorship Analysis theory and Natural Language Processing(NLP) based on Convolutional Neural Network(CNN). We apply Authorship Analysis theory to extract features for author identification in the source code, and combine them with the features being used text mining to perform author identification using machine learning. In addition, applying CNN based natural language processing method to source code for code author classification. Therefore, we propose a framework for the identification of authors using the Authorship Analysis theory and the CNN. In order to identify the author, we need special features for identifying the authors only, and the NLP method based on the CNN is able to apply language with a special system such as source code and identify the author. identification accuracy based on Authorship Analysis theory is 95.1% and identification accuracy applied to CNN is 98%.

People Re-identification: A Multidisciplinary Challenge (사람 재식별: 학제간 연구 과제)

  • Cheng, Dong-Seon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.12 no.6
    • /
    • pp.135-139
    • /
    • 2012
  • The wide diffusion of internet and the overall increased reliance on technology for information communication, dissemination and gathering have created an unparalleled mass of data. Sifting through this data is defining and will define in the foreseeable future a big part of contemporary computer science. Within this data, a growing proportion is given by personal information, which represents a unique opportunity to study human activities extensively and live. One important recurring challenge in many disciplines is the problem of people re-identification. In its broadest definition, re-identification is the problem of newly recognizing previously identified people, such as following an unknown person while he walks through many different surveillance cameras in different locations. Our goals is to review how several diverse disciplines define and meet this challenge, from person re-identification in video-surveillance to authorship attribution in text samples to distinguishing users based on their preferences of pictures. We further envision a situation where multidisciplinary solutions might be beneficial.

A Symphony of Language

  • Kim, Chin W.
    • Lingua Humanitatis
    • /
    • v.2 no.2
    • /
    • pp.5-50
    • /
    • 2002
  • This paper aims to illustrate and illuminate the relationship between language and its neighbor disciplines, in particular between language and literature, language and religion, and language and music. 1. Language and literature. Literature is an art of language. Therefore, linguistics, the science of language, should be able to explain how the grammar of literature elevates and ordinary language into a literary language. I illustrate poetic syntax with examples from Shelley, Coleridge, and Wordsworth. 2. Language and religion. I show how a linguistic analysis of a religious text can illuminate the background, authorship, chronology, etc., of a religious text with an example from the Book of Daniel. I also illustrate how a misanalysis of a poetic meter led to a mistranslation with an example from the Book of Psalms. 3. Language and music. First I trace an epochal event in the history of the Western music, i.e., the change of the musical style from the liturgical music of Latin in which the rhythm was created by the alternation of syllable duration into the liberated music of German in which the rhythm was generated by the alternation of lexical stress. I then illustrate a parallelism between linguistic and musical structures with several musical pieces including Gregorian chant, the 16th century music of Palestrina, the 17th century music of Schutz, the 18th century music of Mozart, and the 19th century Viennese music. Finally, the importance of text-tune (verse-melody) association is discussed with examples of mismatches in translated Korean hymns and contemporary Korean lyrical songs. In the concluding part, I speculate on some factors that are responsible for the same organizational devices in three different modes of human communication. An answer may be that all are under the same laws of mind that govern the way man perceives and organizes nature, i.e., the same cognitive abilities of man, in particular, the capacity to organize and impose structure on their respective inputs.

  • PDF