• Title/Summary/Keyword: Omission

Search Result 472, Processing Time 0.017 seconds

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

A study on the Greeting's Types of Ganchal in Joseon Dynasty (간찰(簡札)의 안부인사(安否人事)에 대한 유형(類型) 연구(硏究))

  • Jeon, Byeong-yong
    • (The)Study of the Eastern Classic
    • /
    • no.57
    • /
    • pp.467-505
    • /
    • 2014
  • I am working on a series of Korean linguistic studies targeting Ganchal(old typed letters in Korea) for many years and this study is for the typology of the [Safety Expression] as the part. For this purpose, [Safety Expression] were divided into a formal types and semantic types, targeting the Chinese Ganchal and Hangul Ganchal of modern Korean Language time(16th century-19th century). Formal types can be divided based on whether Normal position or not, whether Omission or not, whether the Sending letter or not, whether the relationship of the high and the low or not. Normal position form and completion were made the first type which reveal well the typicality of the [Safety Expression]. Original position while [Own Safety] omitted as the second type, while Original position while [Opposite Safety] omitted as the third type, Original position while [Safety Expression] omitted as the fourth type. Inversion type were made as the fifth type which is the most severe solecism in [Safety Expression]. The first type is refers to Original position type that [Opposite Safety] precede the [Own Safety] and the completion type that is full of semantic element. This type can be referred to most typical and normative in that it equipped all components of [Safety Expression]. A second type is that [Safety Expression] is composed of only the [Opposite Safety]. This type is inferior to the first type in terms of set pattern, it is never outdone when it comes to the appearance frequency. Because asking [Opposite Safety] faithfully, omitting [Own Safety] dose not greatly deviate politeness and easy to write Ganchal, it is utilized. The third type is the Original position type showing the configuration of the [Opposite Safety]+Own Safety], but [Opposite Safety] is omitted. The fourth type is a Original position type showing configuration of the [Opposite Safety+Own Safety], but [Safety Expression] is omitted. This type is divided into A ; [Safety Expression] is entirely omitted and B ; such as 'saving trouble', the conventional expression, replace [Safety Expression]. The fifth type is inversion type that shown to structure of the [Own Safety+Opposite Safety], unlike the Original position type. This type is the most severe solecism type and real example is very rare. It is because let leading [Own Safety] and ask later [Opposite Safety] for face save is offend against common decency. In addition, it can be divided into the direct type that [Opposite Safety] and [Own Safety] is directly connected and indirect type that separate into the [story]. The semantic types of [Safety Expression] can be classified based on whether Sending letter or not, fast or slow, whether intimate or not, and isolation or not. For Sending letter, [Safety Expression] consists [Opposite Safety(Climate+Inquiry after health+Mental state)+Own safety(status+Inquiry after health+Mental state)]. At [Opposite safety], [Climate] could be subdivided as [Season] information and [Climate(weather)] information. Also, [Mental state] is divided as receiver's [Family Safety Mental state] and [Individual Safety Mental state]. In [Own Safety], [Status] is divided as receiver's traditional situation; [Recent condition] and receiver's ongoing situation; [Present condition]. [Inquiry after health] is also subdivided as receiver's [Family Safety] and [Individual Safety], [Safety] is as [Family Safety] and [Individual Safety]. Likewise, [Inquiry after health] or [Safety] is usually used as pairs, in dimension of [Family] and [Individual]. This phenomenon seems to have occurred from a big family system, which is defined as taking care of one's parents or grand parents. As for the Written Reply, [Safety Expression] consists [Opposite Safety (Reception+Inquiry after health+Mental state)+Own safety(status+Inquiry after health+Mental state)], and only in [Opposite safety], a difference in semantic structure happens with Sending letter. In [Opposite Safety], [Reception] is divided as [Letter] which is Ganchal that is directly received and [Message], which is news that is received indirectly from people. [Safety] is as [Family Safety] and [Individual Safety], [Mental state] also as [Family Safety Mental state] and [Individual Safety Mental state].