• Title/Summary/Keyword: Newspaper Corpus

Search Result 22, Processing Time 0.021 seconds

Aspects of Language Use in Newspaper Articles: A Corpus Linguistic Perspective (신문 기사의 언어 사용 양상: 코퍼스언어학적 접근)

  • Song, Kyung-Hwa;Kang, Beom-Mo
    • Korean Journal of Cognitive Science
    • /
    • v.17 no.4
    • /
    • pp.255-269
    • /
    • 2006
  • The purpose of this study is to analyze newspaper articles from corpus linguistic point of view. We used a large corpus of newspaper articles built from <21st century Sejong Project> and counted occurrences of certain expressions. A newspaper article is divided into the headline, the lead and the body. We tried to figure out how to measure the characteristics of indication and compression which are typical to headlines. Then, we focused on the differences between the headline and the lead. finally, we analyzed the sentence structure and measured the ratio of the frequency of common nouns in the body. This study verifies the existing stylistic theories of newspapers and shows new aspects of language use in newspaper articles. Texts like newspaper articles are the results of human language processing and they in turn affect the development of cognitive ability of language.

  • PDF

A Spelling Error Correction Model in Korean Using a Correction Dictionary and a Newspaper Corpus (교정사전과 신문기사 말뭉치를 이용한 한국어 철자 오류 교정 모델)

  • Lee, Se-Hee;Kim, Hark-Soo
    • The KIPS Transactions:PartB
    • /
    • v.16B no.5
    • /
    • pp.427-434
    • /
    • 2009
  • With the rapid evolution of the Internet and mobile environments, text including spelling errors such as newly-coined words and abbreviated words are widely used. These spelling errors make it difficult to develop NLP (natural language processing) applications because they decrease the readability of texts. To resolve this problem, we propose a spelling error correction model using a spelling error correction dictionary and a newspaper corpus. The proposed model has the advantage that the cost of data construction are not high because it uses a newspaper corpus, which we can easily obtain, as a training corpus. In addition, the proposed model has an advantage that additional external modules such as a morphological analyzer and a word-spacing error correction system are not required because it uses a simple string matching method based on a correction dictionary. In the experiments with a newspaper corpus and a short message corpus collected from real mobile phones, the proposed model has been shown good performances (a miss-correction rate of 7.3%, a F1-measure of 97.3%, and a false positive rate of 1.1%) in the various evaluation measures.

Critical Discourse Analysis of '5.18' in 'Honam' and 'Yeongnam' Local Newspapers by Using Corpus (코퍼스를 이용한 '호남'과 '영남' 지역신문에서의 '5.18'에 대한 비판적 담화분석)

  • Lee, Sukeui;Jin, Duhyeon
    • Korean Linguistics
    • /
    • v.76
    • /
    • pp.83-112
    • /
    • 2017
  • In this paper, newspaper articles were collected through '5.18' keyword search results and the news corpus was constructed from the collected data. In the articles of local newspapers 'Honam' and 'Yeongnam', the ideological differences regarding '5.18' were investigated. The ideological differences of local newspaper discourse through objective figures was analyzed.. The subjects of the newspaper articles, the frequency of nouns and predicates were analyzed. The use and meaning of the intended vocabulary were examined. As a result of analyzing the title of the newspaper article, the discourse written in 'Honam' emphasized the necessity of re - recognition of 5.18. In both regions, the word "Gwangju" is often used. However, 'Gwangju' in 'Honam' newspaper means spiritual space, not physical space. In Honam regional newspapers, there are many vocabularies describing the events such as 'shoot' and 'fire', this calls for recollection and memory of '5.18'. In the analysis of newspaper discourse, the analysis of the contrast between the local newspapers was very insignificant, but, this study was conducted to analyze the discourse among local newspapers.

A Corpus-driven Approach to Korean and English Newspaper Obituaries (빈도 분석을 활용한 한·영 사망기사 특징 비교)

  • Shin, Hyejung
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.11
    • /
    • pp.592-601
    • /
    • 2014
  • This study examines newspaper obituaries in Korean media and English media. Initially, 100 Korean obituaries were collected from the JoongAng Ilbo which span over more than three years, from May 2011 to August 2014. After that, another 50 Korean obituaries were gathered from the DongA Ilbo which were published over the same time period with the JoongAng Ilbo. As for English newspapers, obituaries from the New York Times and the Guardian were included in the corpus for comparison. First, the structure and composition of obituaries in each language (Korean and English) are compared. Korean obituaries show a pattern of a combination of a death notice and an obituary. Second, distinct features of each newspaper are discussed. The JoongAng Ilbo has its obituary section titled "Life and Memories", and the DongA Ilbo's obituaries are under the heading of "Rest in Peace." Obituaries in the New York Times appear in print on different pages of the paper according to the deceased's field of interest. Following discussion of formal structure and characteristics of each newspaper, Korean and English obituaries will be compared in terms of content and cultural context.

A Study on the general language use of ROOJIN : in Headline Database of Newspaper Articles and Balanced Corpus of Contemporary Written Japanese by KOKKEN (현대일본문장어의 「노인(老人)」사용실태 - 国硏「ことばに関する新聞記事見出しデ?タベ?ス」 「現代日本語書き言葉均衡コ?パス」를 분석대상으로)

  • Oh, Mi sun
    • Cross-Cultural Studies
    • /
    • v.25
    • /
    • pp.627-648
    • /
    • 2011
  • The study analyzed a diachronic distribution, social meanings and social evaluations of ROOJIN. 'Headline Database of Newspaper Articles' and 'Balanced Corpus of Contemporary Written Japanese' by KOKKEN were used as research data. There were 305 newspaper articles (About 0.2%) which contained the word ROOJIN at 'Headline Database of Newspaper Articles'. The number of newspaper articles related to ROOJIN started to increase in a rapid rate in 1972 and 1973. They were also increased in 1976, from 1981 to 1987, 1992 and 1993. The reasons of increasing of newspaper articles related to ROOJIN on those 4 periods of time could be summarized as follows. Firstly, there was a increasement of ROOJIN who are lonely, are not able to move about freely or live alone. Secondly, the understanding of a symptom of aging called BOKE was necessary. Thirdly, there were negative evaluations in a society towards ROOJIN. There were 453 cases which contained the word ROOJIN at 'Balanced Corpus of Contemporary Written Japanese' on the data since 2000. The most frequently used words were ones that are related to senior care facilities. There were 109 cases (24%) which contain those words. '~SISETSU', '~SENTA-', '~HO-MU' were presented as words related to senior care facilities. Among them, 78 cases contained the word '~HO-MU' which was similar to a home with family members. The second most frequently used words were ones that are related to 'welfare for the aged' and they are led by 'medical care for the aged'. They occupied about 8%. Institutionalization of medical care for the aged, medical expenses, nursing were presented as words related to 'medical care for the aged'. Words that were related to 'welfare for the aged' led by 'senior care facilities' and 'medical care for the aged' occupied about 32% of research data. As mentioned above, problems of the aged in Modern Japan such as negative evaluations in a society towards ROOJIN, ROOJIN who are lonely, are not able to move about freely or live alone, BOKE could be identified by analyzing the data. Also, The frequent usage of words such as 'Home for the aged', 'medical care for the aged' and 'nursing' could be identified. The outcome of analysis suggested that a family traditionally had a function of solving problems of the aged but that function was reduced in modern Japan. It also suggested that there was a tendency to outsource problems of the aged as much as possible.

A Genre Analysis of Newspaper Articles for Korean Language Education -Based on the linguistic analysis of newspaper articles and reading materials in Korean language textbooks- (한국어 읽기 교육을 위한 기사문 장르분석 -신문기사 및 교재 기사문의 언어학적 분석을 바탕으로-)

  • Lee, Seungyeon;Sim, Jiyeon;Shin, Jungha
    • Journal of Korean language education
    • /
    • v.28 no.3
    • /
    • pp.53-83
    • /
    • 2017
  • The goal of this study is to examine whether the genre characteristics of newspaper articles are appropriately reflected in Korean language textbooks. For the purpose of this study, two corpora were built with 17 textbook articles and 60 newspaper articles respectively. The average sentence length and frequency of vocabulary in each corpus were measured. It was found that the sentences of articles in textbooks tended to have longer sentence length and more complicated structures than the articles in newspapers. For instance, sentences in the textbook articles had more verbal endings, such as conjunctive and transforming endings. On the other hand, in case of vocabulary representing 'timeliness', there was a high frequency of adverbs and nouns which were related to year, month, and time in actual articles, while it is found to be very limited in textbooks. Also, typical translative styles such as '-ko itta', '-e ttareumyun' were more prominent in textbooks than in newspaper articles. In the case of abbreviated and omitted form of particles, this was a characteristic that appeared only in actual articles because of the constraint of space. It is significant that this paper offers suggestions for the development of reading materials for Korean language education by revealing that the genre typology of actual newspaper articles is not adequately reflected in current textbooks.

Environment for Translation Domain Adaptation and Continuous Improvement of English-Korean Machine Translation System

  • Kim, Sung-Dong;Kim, Namyun
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.12 no.2
    • /
    • pp.127-136
    • /
    • 2020
  • This paper presents an environment for rule-based English-Korean machine translation system, which supports the translation domain adaptation and the continuous translation quality improvement. For the purposes, corpus is essential, from which necessary information for translation will be acquired. The environment consists of a corpus construction part and a translation knowledge extraction part. The corpus construction part crawls news articles from some newspaper sites. The extraction part builds the translation knowledge such as newly-created words, compound words, collocation information, distributional word representations, and so on. For the translation domain adaption, the corpus for the domain should be built and the translation knowledge should be constructed from the corpus. For the continuous improvement, corpus needs to be continuously expanded and the translation knowledge should be enhanced from the expanded corpus. The proposed web-based environment is expected to facilitate the tasks of domain adaptation and translation system improvement.

Enhancement of a language model using two separate corpora of distinct characteristics

  • Cho, Sehyeong;Chung, Tae-Sun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.3
    • /
    • pp.357-362
    • /
    • 2004
  • Language models are essential in predicting the next word in a spoken sentence, thereby enhancing the speech recognition accuracy, among other things. However, spoken language domains are too numerous, and therefore developers suffer from the lack of corpora with sufficient sizes. This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but incorrectly biased. With our approach, two n-gram statistics are combined by extending the idea of Katz's backoff and therefore is called a dual-source backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus.

Relative Clauses in a Modern Diachronic Corpus of Singapore English

  • Lee, Kit Mun
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.1
    • /
    • pp.31-60
    • /
    • 2020
  • This paper investigates changes in relativization in Singapore English broadsheet newspapers from 1993 to 2016. One of the first diachronic studies in Singapore English (SgE), it also explores corresponding data from the diachronic Siena-Bologna (SiBol) news corpus. As SgE is in the endonormative stabilization phase in Schneider's (2007) Dynamic Model of postcolonial Englishes, divergence from British English (BrE) is to be expected. In this study, the dataset is a new Singapore English Newspaper (SEN) corpus compiled from local news articles in 1993, 2005 and 2016, and the corpus tool employed is Sketch Engine. The results reveal changes in relativization practices in SEN over the given period, many of which occur in a similar pattern as those identified in SiBol, albeit at varying rates of change. Most significant of these include a sharp decline in the which relativizer in restrictive relative clauses with non-animate antecedents, complemented by a rise in that. The change has been so rapid that although which relative clauses were more common than that clauses in 1993, that has subsequently overtaken which for both the corpora. One shift in SEN that is different from SiBol is the increase in frequency of non-restrictive relative clauses in SgE. The likely motivators for the changes in the two varieties are identified as colloquialization, densification and prescriptivism. The effect each of these factors could have had on the varieties are discussed, as well as the implications that the findings have on our understanding of the evolutionary status of SgE as a postcolonial variety.

Collocation Networks and Covid-19 in Letters to the Editor: A Malaysian Case Study

  • Joharry, Siti Aeisha;Turiman, Syamimi
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.1
    • /
    • pp.1-30
    • /
    • 2020
  • The present study examines language used to talk about the global coronavirus pandemic during a three-month period of movement control order in Malaysia. More specifically, a corpus of online letters to the editor of a local popular national newspaper was collected during the time in which the official quarantine instruction was initiated, resulting in a total of 303 online letters written by Malaysians that were analyzed through use of corpus linguistics techniques. For this purpose, the latest version of #LancsBox 5.0 (Brezina et al., 2020) is used to analyze patterns of language surrounding the portrayal of Covid-19 and further visualizing them by use of collocation networks. Findings present 25 statistically significant collocates that share an interesting relationship in revealing what the letters are about and thus, reflecting how Malaysians perceive and receive news about the pandemic during this time. Recurring topics and expressions include describing the virus in terms of metaphorical use of language (Covid-19 does not discriminate), preparing for an economic fallout (Prihatin Economic Stimulus Package), and preference to associate Covid-19 as a pandemic (impacts of the Covid19 pandemic) rather than an outbreak (first/second/third wave of the outbreak). Implications of the study resonates with findings from Azizan et al. (2020) where constructions of positive discourse among Malaysian writers may reflect the culture and society that make up the nation.