Proceedings of the Korean Society for Language and Information Conference (한국언어정보학회:학술대회논문집)
Korean Society for Language and Information (KSLI)
- Annual
Domain
- Linguistics > Linguistics, General
2002.02a
-
This talk provides an overview of current work in my research group on the syntactic annotation of the T bingen corpus of spoken German and of the German Reference Corpus (Deutsches Referenzkorpus: DEREKO) of written texts. Morpho-syntactic and syntactic annotation as well as annotation of function-argument structure for these corpora is performed automatically by a hybrid architecture that combines robust symbolic parsing with finite-state methods ("chunk parsing" in the sense Abney) with memory-based parsing (in the sense of Daelemans). The resulting robust annotations can be used by theoretical linguists, who lire interested in large-scale, empirical data, and by computational linguists, who are in need of training material for a wide range of language technology applications. To aid retrieval of annotated trees from the treebank, a query tool VIQTORYA with a graphical user interface and a logic-based query language has been developed. VIQTORYA allows users to query the treebanks for linguistic structures at the word level, at the level of individual phrases, and at the clausal level.
-
As pact of a long-ranged project that aims at establishing database-theoretic semantics as a model of computational semantics, this presentation focuses on the development of a syntactic component for processing strings of words or sentences to construct semantic data structures. For design arid modeling purposes, the present treatment will be restricted to the analysis of some problematic constructions of Korean involving semi-free word order, conjunction arid temporal anchoring, and adnominal modification and antecedent binding. The present work heavily relies on Hausser's (1999, 2000) SLIM theory for language that is based on surface compositionality, time-linearity arid two other conditions on natural language processing. Time-linear syntax for natural language has been shown to be conceptually simple and computationally efficient. The associated semantics is complex, however, because it must deal with situated language involving interactive multi-agents. Nevertheless, by processing input word strings in a time-linear mode, the syntax cart incrementally construct the necessary semantic structures for relevant queries and valid inferences. The fragment of Korean syntax will be implemented in Malaga, a C-type implementation language that was enriched for both programming and debugging purposes arid that was particluarly made suitable for implementing in Left-Associative Grammar. This presentation will show how the system of syntactic rules with constraining subrules processes Korean sentences in a step-by-step time-linear manner to incrementally construct semantic data structures that mainly specify relations with their argument, temporal, and binding structures.
-
Automatic identification of Chinese personal names in unrestricted texts is a key task in Chinese word segmentation, and can affect other NLP tasks such as word segmentation and information retrieval, if it is not properly addressed. This paper (1) demonstrates the problems of Chinese personal name identification in some If applications, (2) analyzes the structure of Chinese personal names, and (3) further presents the relevant processing strategies. The geographical differences of Chinese personal names between Beijing and Hong Kong are highlighted at the end. It shows that variation in names across different Chinese communities constitutes a critical factor in designing Chinese personal name Identification algorithm.
-
One main complexity of the copula constructions concerns a mismatch between morphology and syntactic constituency: the copula seems to form a morphological unit with the immediately preceding element, whereas in terms of syntax the copula appears to take this as its syntactic complement. In capturing such mismatches, we show that the copula is treated as an independent verb at the level of tectogrammatical structure (or syntax tree), whereas as a bound morpheme at the level of phonogram-matical structure (or domain tree), in terms of Dowty 1992 (or Reape 1994). This paper, adopting the notion of DOMAIN in HPSG, shows that copula constructions are a subtype of compacting-constructions. These constructions compact the domain value of the copula and that of its preceding element together into one domain unit, eventually making it inert to syntactic phenomena such as scrambling, deletion and pro-form substitution. This construction-based approach provides a clean analysis for the formation of the copula construction and related phenomena.
-
The information extraction is to delimit in advance, as part of the specification of the task, the semantic range of the output and to filter information from large volumes of texts. The most representative word of the document is composed of named entities and pronouns. Therefore, it is important to resolve coreference in order to extract the meaningful information in information extraction. Coreference resolution is to find name entities co-referencing real-world entities in the documents. Results of coreference resolution are used for name entity detection and template generation. This paper presents the heuristic-based approach for coreference resolution in Korean. We constructed the heuristics expanded gradually by using the corpus and derived the salience factors of antecedents as the importance measure in Korean. Our approach consists of antecedents selection and antecedents weighting. We used three kinds of salience factors that are used to weight each antecedent of the anaphor. The experiment result shows 80% precision.
-
In this paper, we address two questions concerning negative imperatives in Korean: (i) what is the morpho-syntactic nature of mal in negative imperatives\ulcorner; and (ii) why is it impossible to form negative imperatives with short negation an\ulcorner We will argue that the clause structure of imperatives include a projection of deontic modality and a projection of imperative operator encoding illocutionary force, and that oaf is a lexicalization of long negation and deontic modality. We then propose that a negative imperative with short negation is ruled out because such construction maps onto incoherent interpretation which can be spelled out as I direct you to bring about a negative state or a negative event.
-
This paper discusses issues in building a 54-thousand-word Korean Treebank using a phrase structure annotation, along with developing annotation guidelines based on the morpho-syntactic phenomena represented in the corpus. Various methods that were employed for quality control and the evaluation on the Treebank are also presented.
-
Structural analysis of compound words is necessary and an important process in natural language processing. Proposed here is a corpus- and statistics- based method for the structural analysis of compound words in Japanese. We determine the structure of a compound word by using Internet corpus and calculating the strength of word association among its constituent words. Experiments with 5, 6, 7, and 8 kanji compound words show that our method works well and its performance is better than those of other comparable studies.
-
The issue of adjuncts has long been a neglected field of linguistic study whether it be syntactic or semantic. It is only in Pustejovsky (1995) that we find a brief mention of adjuncts. In addition to what the author calls true arguments, default arguments, and shadow arguments, he sets up a class of true adjuncts citing the following sentence, Mary drove down to new York on Tuesday. We will take up a small lexical item sugiru in Japanese, and we will argue that we should posit the notion of implicit adjuncts in describing the properties with the small Japanese lexical item sugiru. Throughout the discussions that follow we will demonstrate how the notion is independently motivated irrespective of what linguistic theory we are going to adopt.
-
This paper aims to give an explanation of the combination of certain nouns and the verb ha-'do'. Although the verb ha-'do'normally takes an event type argument, it takes some substantival nouns such as paiolin 'violin', umsikcem 'restaurant', and so on. A substantival noun undergoes type shifting, because the governing verb ha-'do'coerces an entity type noun to an event reading, taking missing information from the qualia of the entity type noun. In addition, some nouns like ppallay 'laundry'are dot objects. The verb laking a dot objects selects a proper type between multiple subtypes of the dot object. Type pumping operation makes that selection possible.
-
Following Herburger (2000), I will develop an event-based semantics for Japanese emphatic particles which can address the issue of the mechanism of association with focus involving the emphatic particles. The proposed semantics makes use of Herburger's three key ideas: events as basic entities, decomposition of predicates into subatomic formulas, and separation of backgrounded and foregrounded information.
-
This paper compares secondary predication constructions such as small clause complements, resultatives, and depictives in English and Korean. It argues that these two typologically different languages employ different modes of satisfying the Case Filter with regard to the Case of the subjects of small clauses. More specifically, it is argued that the subject of a small clause in English is Accusative Case-marked by the higher governing verb, while that ul ]Korean is Nominative Case-marked by default.
-
The so-called overapplication of Coda Neutralization in Korean, the occurrence of a neutralized consonant in a non-neutralizing environment, is often considered as evidence for serial derivation. In this paper I propose that the neutralization effect at surface is not a result of a phonological process at an intermediate level in serial derivation, but due to a constraint requiring the integrity of the morphological constituent: EDGE-INTEGRITY. It is argued that this is not reducible to an alignment constraint, but a genuine faithfulness constraint on the edge of a morphological constituent. The putative opacity related with the coda neutralization is shown to be an epiphenomenon arising from the ambisyllabic representation of a consonant at a morphological juncture, satisfying both EDGE-INTEGRITY arid Syllabic Conditions. Consonant Copy in the Jeju dialect provides further evidence for EDGE-INTEGRITY, the Only difference being that the conflict between Syllabic Conditions and EDGE-INTEGRITY is resolved by insertion of a copied consonant.
-
An Alignment based technique for Text Translation between Traditional Chinese and Simplified ChineseAligned parallel corpora have proved very useful in many natural language processing tasks, including statistical machine translation and word sense disambiguation. In this paper, we describe an alignment technique for extracting transfer mapping from the parallel corpus. During building our system and data collection, we observe that there are three types of translation approaches can be used. We especially focuses on Traditional Chinese and Simplified Chinese text lexical translation and a method for extracting transfer mappings for machine translation.
-
This paper describes our ongoing Korean-Chinese machine translation system, which is based on verb patterns. A verb pattern consists of a source language pattern part for analysis and a target language pattern part for generation. Knowledge description on lexical level makes it easy to achieve accurate analyses and natural, correct generation. These features are very important and effective in machine translation between languages with quite different linguistic structures including Korean and Chinese. We performed a preliminary evaluation of our current system and reported the result in the paper.
-
A homonym could be disambiguated by another words in the context as nouns, predicates used with the homonym. This paper using semantic information (co-occurrence data) obtained from definitions of part of speech (POS) tagged UMRD-S
$^1$ ), In this research, we have analyzed the result of an experiment on a homonym disambiguation system based on statistical model, to which Bayes'theorem is applied, and suggested a model established of the weight of sense rate and the weight of distance to the adjacent words to improve the accuracy. The result of applying the homonym disambiguation system using semantic information to disambiguating homonyms appearing on the dictionary definition sentences showed average accuracy of 98.32% with regard to the most frequent 200 homonyms. We selected 49 (31 substantives and 18 predicates) out of the 200 homonyms that were used in the experiment, and performed an experiment on 50,703 sentences extracted from Sejong Project tagged corpus (i.e. a corpus of morphologically analyzed words) of 3.5 million words that includes one of the 49 homonyms. The result of experimenting by assigning the weight of sense rate(prior probability) and the weight of distance concerning the 5 words at the front/behind the homonym to be disambiguated showed better accuracy than disambiguation systems based on existing statistical models by 2.93%, -
Korean predicative verb forms obligatorily denote the three categories speech level, mood and sentence type which are not handled by most of the automatic word form recognition systems for this language. These categories are marked by special endings. This paper examines predicative verb forms concentrating on the lexical description of these endings in the framework of Left-Associative Grammar (LAG). Additionally this paper suggests a system to analyse verb forms in these aspects. The results of this study have been implemented using Malaga
$^2$ and integrated into an automatic word form recognition system for Korerin called KMM (Korean Malaya Morphology). -
In this paper it has been proposed that concession should be analysed as involving scalar implicatures and that an alternative set of situations have to be assumed to account for the the relative nature of likelihood of event occurrence. This paper also claims that the notion of likelihood is the basis of the corresponding pragmatic inference and a universal quantification effect. Unexpectedness, which is conceptually tied to concession, on the other hand involves the same kind of pragmatic inference but presuppose the existence of an alternative set of individuals instead of an alternative set of situations.
-
The purpose of this paper is to analyze the lexical-semantic structure of morphologically derived passive verbs in Korean based on Pustejovsky (1995)'s Generative Lexicon Theory (GL) and to explain the change of the root verb's lexical-semantic structure by means of passivization. Passivization in this paper is defined as the unaccusaztivization. In Argument Structure of derived passive verbs, the agent argument is deleted and the theme argument is realized as a syntactic subject. As for Event Structure, derived passives express left-headed event (achievement), whereas their roots denote right-headed event (accomplishment). In Qualia Structure, passive verbs and root ones have the same Fomal Role, but in Agentive Role of passive verbs, an act weakens to a process. Both Formal and Agentive Roles have the same theme argument.
-
ECM across a CP in Korean poses difficulties from the standpoint of the locality of A-movement/agreement. A phase-based analysis is developed which requires two steps: (i) in the embedded CP, VP/VP containing its VP-internal subject first moves to Spec-CP, which renders the subject accessible to the matrix v, in accordance with Chomsky's Phase Impenetrability Condition; (ii) ECM takes place in a local relation between the matrix v and the embedded subject. It is shown that the otherwise puzzling fact that ECM across a CP, but not Passivization across a CP, is affected by the type of the embedded verb in Korean is accounted for in a principled way, based on the assumption that CP and CP, but not TP and VP, are phases.
-
In this paper, we propose a method of generating a proper categorization of morphemes by giving a hierarchical part-of-speech system and a corpus tagged using this part-of-speech system. Our method use hierarchical information in the part-of-speech system and statistical information in the corpus to generate a category set. The statistical information is based on the context of occurrence of categories. First, we specify the format of given information. Then, we describe an algorithm to generate a proper categorization. Finally, we present the results of our experiments in applying this method. We obtained a moderately proper categorization and found several candidates for improvement .
-
Japanese coordinate noun phrases by the particle TO are often ambiguous on whether it means two parallel propositions (and), or a mutual case relation (with or against), as deep case structure. It was a hard problem to determine it, though they are widely used. We propose a method of solving the ambiguity by analyzing mutualness of verbs and adjectives. The mutualness is determined by three features of each verb or adjective. The first feature Indicates permission of mutual expression in the subject, and the second In the object. The last shows if a verb is voluntary. Using this method we design a parsing mechanism, where matching of features is represented as neutralization between predicate arguments.
-
This paper investigates the internal structure of finite small clauses (FSC). I will propose that a FSC is base-generated at Spec-CP and a null operator is involved to check the formal features of the embedded T and turn a sentence into a predicate.
-
We propose an algorithm for the automatic acquisition of a bilingual lexicon in the legal domain. We make use of a parallel corpus of bilingual court judgments, aligned to the sentence level, and analyse the bilingual context profiles to extract corresponding legal terms in both languages. Our method is different from those in past studies as it does not require any prior knowledge source, and naturally extends to multi-word terms in either language. A pilot test was done with a sample of ten legal terms, each with ten or more occurrences in the data. Encouraging results of about 75% average accuracy were obtained. This figure does not only reflect the effectiveness of the method for bilingual lexicon acquisition, but also its potential for bilingual alignment at the word or expression level.
-
I examine various controversial aspects of Chinese prosody-tone structure, syllable structure, stress, and intonation-and stress the need to view all of these as interacting systems, aspects of a hierarchical prosodic structure. 1 examine various proposals at these various levels of the hierarchy and suggest which are most appropriate. Specifically, 1 suggest the adoption of Bao's version of syllable and tone, and Chen's account of stress. As for intonation, it is still not possible to make any definitive claims regarding an optimal model, but I examine work done by Kratochvil, Shih, and Carding et al, and suggest promising directions for future work.
-
A system to assist call routing task for telephone operators at the Directorate General of Telecommunications (DGT) in Taiwan is reported in this paper. The system was developed based on DGT organization profile with description of its six divisions instead of a corpus of recorded and transcribed call-routing dialogs. An acoustic module and an information retrieval module were built specifically for this task. The construction of IR module was based on term extraction and thesaurus discovery processes. By integrating acoustic and IR module, the system achieves satisfactory performance and provides a promising approach to call routing. Simulation results indicated that the proposed algorithm outperforms standard classification methods. A working system based on the proposed approach has been implemented and experimental results are presented.
-
This paper aims to account for the backward anaphora that seem to be against the c-command requirements in the anaphor-antecedent relations. It was claimed that the binding conditions should apply at LF fur the backward binding cases involving phych-verbs and causatives. Under the recent development of minimalism where the concept of levels disappears to adopt a cyclic derivation, the data that show the backward binding phenomena have not been discussed in the area of the binding theory. In this paper, I argue that the backward binding cases can be incorporated into the core binding phenomena with the general assumptions on the thematic prominence. It is discussed how the dependency between NPs involving backward anaphora is determined by the thematic prominency. The Agree operation takes place between the probe T and the goal with the uninterpretable u[a] and [prominent] feature, by which an anaphor is valued, producing a proper interpretation.
-
Since thesaurus is used as a knowledge resource in many natural language processing systems, it is very useful and necessary for the high quality systems, especially for dealing with semantics. In this paper, we introduce a semi-automatic method for the construction of Korean noun semantic hierarchy by utilizing a monolingual MRD and an existing thesaurus.
-
This paper provides computational algorithms for a Korean reflexive caki, for which both sentence-bound and long-distance readings are possible. Its analyses are based on Chierichia's theory in Categorial Grammar, and a CCG-like system is introduced for the implementation. In this system, we can get both readings of caki with the same resolution mechanisms, while the difference is where the reflexive is resolved. These algorithms enable us to account for the distributions and characteristics off long-distance reflexive caki with a more unified way.
-
Taiwanese abounds with verbal complexes. Among them, phasal complexes, resultative complexes, and directional complexes are alike in that their second component denotes some sort of result. Moreover, they behave similarly in that they can occur in V-ho-Y, V-e/be-Y, and V-bo-V forms. Despite the similarities, they still differ from one another in several aspects, such as whether objects are allowed inside or after the verbal complex, whether infixing changes their basic meaning, etc. This paper examines their individual properties carefully and proposes that these three types of complexes are all different from one another in their formation and thus the difference in their syntactic behavior. Directional complexes are syntactic phrases, resultative complexes are compounds derived in syntax, and while some phasal complexes are also syntactically derived compounds, others are compounds formed in the lexicon. This paper aims to argue that words (or compounds in this case) can be formed in syntax as well as in the lexicon.
-
The issues in polysemy with respect to the verbs in WordNet will be discussed in this paper. The hypernymy/hyponymy structure of the multiple senses is observed when we try to build a bilingual network for Chinese and English. There are several types of polysemic patterns and a co-hypernym may have the same word form as its subordinates. Fellbaum (2000) dubbed autotroponymy that the verbs linked by mailer relation share the same verb form. However, her syntactic criteria seem not compatible to the hierarchies in WN. Either the criteria or the network should be reconducted. For most verbs in WN 1.7, polysemous relations are unlikely to extend over 3 levels of IS-A relation. Highly polysemous verbs are more complicated and may be involved in certain semantic structures. Semi-automatic sense grouping may be helpful for multimlinguital information retrieveal.
-
No Abstract(See Full Text)
-
This paper discusses constraints on grammaticalization, a primarily diachronic process through which lexical elements take on grammatical functions. In particular, it will argue that two constraints on this process, namely Persistence and Lwering, explain the different distributional patterns of time-relationship adverbs in Japanese, Korean, English and German. Furthermore, it will suggest that the distributional difference between Japanese and Korean time-relationship adverbs is not an isolated phenomenon but is a reflection of the overall semantic typological differences between the two languages in the sense of Hawkins (1986).
-
This paper presents a unified account of three kinds of constructions in which more than one NP can show up with the same case in simple sentences in Japanese and Korean: double subject, double nominative object and double accusative constructions. Noting that the second NPs in these constructions are functional or relational, this paper proposes to assign them the category and type different from the first NPs. We show the derivations of these three constructions in a parallel manner, and explain the asymmetries in extractability between possessor and possessed NPs in relativization.
-
This paper puts forward an analysis of scope interactions between Japanese adverbial quantifiers like mainichi 'everyday'and tokidoki 'sometimes'and a negative morpheme nai 'not'on the basis of f(ocus)-structures. In this analysis, three f-structures are assigned to a sentence with an adverbial quantifier and a negative morpheme. One of them represents a negation-wide reading, and the other two represent quantifier-wide readings. Some f-structures, however, are unacceptable due to semantic or pragmatic factors. Different scope behaviors of the two quantifiers mentioned above can then be ascribed to acceptability of f-structures.
-
Shi (2000) claims that topics must be related to a syntactic position in the comment, thus denying the existence of dangling topics in Chinese. Under Shi's analysis, the dangling topic sentences in Chinese are not topic-comment but subject-predicate sentences. However, Shi's arguments are not without problems. In this paper we argue that topics in Chinese can be licensed not only by a syntactic gap but also by a semantic gap/variable without syntactic realization. Under our analysis, all the dangling topics discussed in Shi (2000) are, in fact, not subjects but topics licensed by a semantic gap/variable that can turn the relevant comment into an open predicate, thus licensing dangling topics and deriving well-formed topic-comment constructions. Our analysis fares better than Shi's in not only unifying the licensing mechanism of a topic to an open predicate without considering how the open predicate is derived, but also unifying the treatment of normal and dangling topics in Chinese,
-
I argue that the so-called psychological predicates like komapta ′thankful,′ mwusepta ′fearful,′ silhta ′loathsome,′ or kulipta ′missing′require a nominative subject and a locative or dative complement, challenging the claim, a conventional wisdom originated from Kuno(1973), that they are two-place "transitive adjectives" requiring a nominative direct object, I also show that those adjectives are subject to having the locative-dative complement extracted, which is ultimately realized as a focused subject or a topic. Thus, in this type of double nominative constructions, the first nominative is a focused subject, and the second nominative forms an embedded clause with the psychological predicate, which functions as the predicate of the whole sentence.
-
No Abstract(See Full Text)
-
The left-associative grammar model (LAG) has been applied successfully to the morphologic and syntactic analysis of various european and asian languages. The algebraic definition of the LAG is very well suited for the application to natural language processing as it inherently obeys de Saussure's second law (de Saussure, 1913, p. 103) on the linear nature of language, which phrase-structure grammar (PSG) and categorial grammar (CG) do not. This paper describes the so-called Loom-LAGs (LLAG) -a specialization of LAGs for the analysis of natural language. Whereas the only means of language-independent abstraction in ordinary LAG is the principle of possible continuations, LLAGs introduce a set of more detailed language-independent generalizations that form the so-called loom of a Loom-LAG. Every LLAG uses the very smut loom and adds the language-specific information in the form of a declarative description of the language -much like an ancient mechanised Jacquard-loom would take a program-card providing the specific pattern for the cloth to be woven. The linguistic information is formulated declaratively in so-called syntax plans that describe the sequential structure of clauses and phrases. This approach introduces the explicit notion of phrases and sentence structure to LAG without violating de Saussure's second law iud without leaving the ground of the original algebraic definition of LAG, LLAGS can in fact be shown to be just a notational variant of LAG -but one that is much better suited for the manual development of syntax grammars for the robust analysis of free texts.
-
No Abstract(See Full Text)
-
No Abstract(See Full Text)
-
This paper proposes that the argument structures be stated in a way that uses probabilities derived from a corpus to replace a Boolean-value system of subcategorization. To do this, we make a cognitive model from a situation to an utterance to explain the phenomena of arguments'ellipsis, though the traditional term ellipsis is not suitable under our new concepts. We claim that the binary distinction is neither rational nor suitable for a real syntactic analysis. To solve this problem, we propose two new concepts argumentness and probabilistic Case structures by adapting the prototype theory. We believe that these concepts are effective in the syntactic analysis of NLP.
-
No Abstract(See Full Text)
-
Korean government has adopted the French TGV as a high-speed transportation system and the first service is scheduled at the end of 2003. TGV-relevant documents are consisted of huge volumes, of which over than 76% has been translated in English. A large part of the English version is, however, incomprehensible without referring to the original French version. The goal of this paper is to demonstrate how DiET 2.5, a lexicon builder, makes it possible to build with ease domain-specific terminology lexicon that may contain multimedia and multilingual data with multi-layered logical information. We believe our wok shows an important step in enlarging the language scope and the development of electronic lexica, and in providing the flexibility of defining any type of the DTD and the interconnectivity among collaborators. As an application of DiET 2.5, we would like to build a TGV-relevant lexicon in the near future.