• Title/Summary/Keyword: N-GRAM

Search Result 576, Processing Time 0.024 seconds

Measurement for License Identification of Open Source Software (오픈소스 소프트웨어 라이선스 파일 식별 기술)

  • Yun, Ho-Yeong;Joe, Yong-Joon;Jung, Byung-Ok;Shin, Dong-Myung
    • Journal of Software Assessment and Valuation
    • /
    • v.12 no.2
    • /
    • pp.1-8
    • /
    • 2016
  • In this paper, we study abstracting and identifying license file from a package to prevent unintentional intellectual property infringement because of lost/modified/confliction of license information when redistributing open source software. To invest character of the license files, we analyzed 322 licenses by n-gram and TF-IDF methods, and abstract license files from the packages. We identified license information with a similarity of the registered licenses by cosine measurement.

DGA-DNS Similarity Analysis and APT Attack Detection Using N-gram (N-gram을 활용한 DGA-DNS 유사도 분석 및 APT 공격 탐지)

  • Kim, Donghyeon;Kim, Kangseok
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.28 no.5
    • /
    • pp.1141-1151
    • /
    • 2018
  • In an APT attack, the communication stage between infected hosts and C&C(Command and Control) server is the key stage for intrusion into the attack target. Attackers can control multiple infected hosts by the C&C Server and direct intrusion and exploitation. If the C&C Server is exposed at this stage, the attack will fail. Therefore, in recent years, the Domain Generation Algorithm (DGA) has replaced DNS in C&C Server with a short time interval for making detection difficult. In particular, it is very difficult to verify and detect all the newly registered DNS more than 5 million times a day. To solve these problems, this paper proposes a model to judge DGA-DNS detection by the morphological similarity analysis of normal DNS and DGA-DNS, and to determine the sign of APT attack through it, then we verify its validity.

A Design on Informal Big Data Topic Extraction System Based on Spark Framework (Spark 프레임워크 기반 비정형 빅데이터 토픽 추출 시스템 설계)

  • Park, Kiejin
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.521-526
    • /
    • 2016
  • As on-line informal text data have massive in its volume and have unstructured characteristics in nature, there are limitations in applying traditional relational data model technologies for data storage and data analysis jobs. Moreover, using dynamically generating massive social data, social user's real-time reaction analysis tasks is hard to accomplish. In the paper, to capture easily the semantics of massive and informal on-line documents with unsupervised learning mechanism, we design and implement automatic topic extraction systems according to the mass of the words that consists a document. The input data set to the proposed system are generated first, using N-gram algorithm to build multiple words to capture the meaning of the sentences precisely, and Hadoop and Spark (In-memory distributed computing framework) are adopted to run topic model. In the experiment phases, TB level input data are processed for data preprocessing and proposed topic extraction steps are applied. We conclude that the proposed system shows good performance in extracting meaningful topics in time as the intermediate results come from main memories directly instead of an HDD reading.

An Approach to Detect Spam E-mail with Abnormal Character Composition (비정상 문자 조합으로 구성된 스팸 메일의 탐지 방법)

  • Lee, Ho-Sub;Cho, Jae-Ik;Jung, Man-Hyun;Moon, Jong-Sub
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.6A
    • /
    • pp.129-137
    • /
    • 2008
  • As the use of the internet increases, the distribution of spam mail has also vastly increased. The email's main use was for the exchange of information, however, currently it is being more frequently used for advertisement and malware distribution. This is a serious problem because it consumes a large amount of the limited internet resources. Furthermore, an extensive amount of computer, network and human resources are consumed to prevent it. As a result much research is being done to prevent and filter spam. Currently, research is being done on readable sentences which do not use proper grammar. This type of spam can not be classified by previous vocabulary analysis or document classification methods. This paper proposes a method to filter spam by using the subject of the mail and N-GRAM for indexing and Bayesian, SVM algorithms for classification.

Analysis of Traffic Improvement Measures in Transportation Impact Assessment Using Text Mining : Focusing on City Development Projects in Gyeonggi Province (텍스트마이닝을 활용한 교통영향평가 교통개선대책 분석 : 경기도 도시개발사업을 대상으로)

  • Eun Hye Yang;Hee Chan Kang;Woo-Young Ahn
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.22 no.2
    • /
    • pp.182-194
    • /
    • 2023
  • Traffic impact assessment plays a crucial role in resolving traffic issues that may arise during the implementation of urban and transportation projects. However, reported results diverge, presumably because the items reviewed differ. In this study, we analyze traffic improvement measures approved for traffic impact assessment, identify key items, and present items that should be included in assessments. Specifically, TF-IDF and N-gram analysis and text mining were performed with focus on urban development projects approved in Gyeonggi Province. The results obtained show that keywords associated with newly established transportation infrastructure, such as roads and intersections, were essential assessment items, followed by the locations of entrances and exits and pedestrian connectivity. We recommend that considerations of the items presented in this study be incorporated into future traffic impact assessment guidelines and standards to improve the consistency and objectivity of the assessment process.

Sentiment Classification considering Korean Features (한국어 특성을 고려한 감성 분류)

  • Kim, Jung-Ho;Kim, Myung-Kyu;Cha, Myung-Hoon;In, Joo-Ho;Chae, Soo-Hoan
    • Science of Emotion and Sensibility
    • /
    • v.13 no.3
    • /
    • pp.449-458
    • /
    • 2010
  • As occasion demands to obtain efficient information from many documents and reviews on the Internet in many kinds of fields, automatic classification of opinion or thought is required. These automatic classification is called sentiment classification, which can be divided into three steps, such as subjective expression classification to extract subjective sentences from documents, sentiment classification to classify whether the polarity of documents is positive or negative, and strength classification to classify whether the documents have weak polarity or strong polarity. The latest studies in Opinion Mining have used N-gram words, lexical phrase pattern, and syntactic phrase pattern, etc. They have not used single word as feature for classification. Especially, patterns have been used frequently as feature because they are more flexible than N-gram words and are also more deterministic than single word. Theses studies are mainly concerned with English, other studies using patterns for Korean are still at an early stage. Although Korean has a slight difference in the meaning between predicates by the change of endings, which is 'Eomi' in Korean, of declinable words, the earlier studies about Korean opinion classification removed endings from predicates only to extract stems. Finally, this study introduces the earlier studies and methods using pattern for English, uses extracted sentimental patterns from Korean documents, and classifies polarities of these documents. In this paper, it also analyses the influence of the change of endings on performances of opinion classification.

  • PDF

A Study on the Changes in Perspectives on Unwed Mothers in S.Korea and the Direction of Government Polices: 1995~2020 Social Media Big Data Analysis (한국미혼모에 대한 관점 변화와 정부정책의 방향: 1995년~2020년 소셜미디어 빅데이터 분석)

  • Seo, Donghee;Jun, Boksun
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.12
    • /
    • pp.305-313
    • /
    • 2021
  • This study collected and analyzed big data from 1995 to 2020, focusing on the keywords "unwed mother", "single mother," and "single mom" to present appropriate government support policy directions according to changes in perspectives on unwed mothers. Big data collection platform Textom was used to collect data from portal search sites Naver and Daum and refine data. The final refined data were word frequency analysis, TF-IDF analysis, an N-gram analysis provided by Textom. In addition, Network analysis and CONCOR analysis were conducted through the UCINET6 program. As a result of the study, similar words appeared in word frequency analysis and TF-IDF analysis, but they differed by year. In the N-gram analysis, there were similarities in word appearance, but there were many differences in frequency and form of words appearing in series. As a result of CONCOR analysis, it was found that different clusters were formed by year. This study confirms the change in the perspective of unwed mothers through big data analysis, suggests the need for unwed mothers policies for various options for independent women, and policies that embrace pregnancy, childbirth, and parenting without discrimination within the new family form.

Analysis of dieting practices in 2016 using big data (빅데이터를 통한 2016년의 다이어트 실태 분석)

  • Jung, Eun-Jin;Chang, Un-Jae;Jo, Kyungae
    • Korean Journal of Food Science and Technology
    • /
    • v.51 no.2
    • /
    • pp.176-181
    • /
    • 2019
  • The aim of this study was to analyze dieting practices and tendencies in 2016 using big data. The keywords related to diet were collected from the portal site Naver and analyzed through simple frequency, N-gram, keyword network, and analysis of seasonality. The results showed that exercise had the highest frequency in simple frequency analysis. However, diet menu appeared most frequently in N-gram analysis. In addition, analysis of seasonality showed that the interest of subjects in diet increased steadily from February to July and peaked in October 2016. The monthly frequency of the keyword highfat diet was highest in October, because that showed the 'Low Carbohydrate High Fat' TV program. Although diet showed a certain pattern on a yearly basis, the emergence of new trendy diets in mass media also affects the pattern of diet. Therefore, it is considered that continuous monitoring and analysis of diet is needed rather than periodic monitoring.

An Analysis on Media Trends in Public Agency for Social Service Applying Text Mining (텍스트 마이닝을 적용한 사회서비스원 언론보도기사 분석)

  • Park, Hae-Keung;Youn, Ki-Hyok
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.2
    • /
    • pp.41-48
    • /
    • 2022
  • This study tried to empirically explore which issues related to the social service agency for public(as below SSA), that is, social perceptions were formed, by using mess media related to the SSA. This study is meaningful in that it identifies the overall social perception and trend of SSA through public opinion. In order to extract media trend data, the search used the big data analysis system, Textom, to collect data from the representative portals Naver News and Daum News. The collected texts were 1,299 in 2020 and 1,410 in 2021, for a total of 2,709. As a result of the analysis, first, the most derived words in relation to the frequency of text appearance were 'SSA', 'establishment', and 'operation'. Second, as a result of the N-gram analysis, the pairs of words directly related to the SSA 'SSA and public', 'SSA and opening', 'SSA and launch', and 'SSA and Department Director', 'SSA and Staff', 'SSA and Caregiver' etc. Third, in the results of TF-IDF analysis and word network analysis, similar to the word occurrence frequency and N-gram results, 'establishment', 'operation', 'public', 'launch', 'provided', 'opened', ' 'Holding' and 'Care' were derived. Based on the above analysis results, it was suggested to strengthen the emergency care support group, to commercialize it in detail, and to stabilize jobs.

Synthesis and Biological Activity of 5-S-GAD(N-${\beta}$-alanyl-5-S-glutathionyl-3,4-dihydroxyphenylalanine), a Novel Antibacterial Substance (신규 항균물질 5-S-GAD(N-${\beta}$-alanyl-5-S-glutathionyl-3,4-dihydroxyphenylalanine)의 합성 및 생리활성)

  • Leem, Jae-Yoon;Park, Ho-Yong;Natori, Shunji
    • YAKHAK HOEJI
    • /
    • v.42 no.3
    • /
    • pp.248-256
    • /
    • 1998
  • We had already reported that we purified N-${\beta}$-alanyl-5-S-glutathionyl-3,4-dihydroxyphenylalanine (5-S-GAD), a novel antibacterial substance from the immunized adult Sarcoph aga peregrina (Flesh fly). We found that the antibacterial activity of synthetic 5-S-GAD is equal to that of authentic 5-S-GAD without a specificity of antibacterial activity against Gram positive and Gram negative. Significant synergism was detected between 5-S-GAD and streptomycin against streptomycin resistant strain E.coli K12 594. It has an antitumor activity against several tumor cell lines at a concentration of $100{\mu}M$. However, no cytotoxic activity against murine macrophage was detected at a concentration of $500{\mu}M$. Furthermore, haemolytic activity against sheep erythrocytes was not detected at the same concentration. We suggest that the S-conjugation of glutathion with dihydroxyphenylalanine might be important to increase antibacterial activity of dihydroxyphenylalanme.

  • PDF