• Title/Summary/Keyword: off-topic document

Search Result 5, Processing Time 0.021 seconds

A Focused Crawler by Segmentation of Context Information (주변정보 분할을 이용한 주제 중심 웹 문서 수집기)

  • Cho, Chang-Hee;Lee, Nam-Yong;Kang, Jin-Bum;Yang, Jae-Young;Choi, Joong-Min
    • The KIPS Transactions:PartB
    • /
    • v.12B no.6 s.102
    • /
    • pp.697-702
    • /
    • 2005
  • The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document Indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.

On-Line Topic Segmentation Using Convolutional Neural Networks (합성곱 신경망을 이용한 On-Line 주제 분리)

  • Lee, Gyoung Ho;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.585-592
    • /
    • 2016
  • A topic segmentation module is to divide statements or conversations into certain topic units. Until now, topic segmentation has progressed in the direction of finding an optimized set of segments for a whole document, considering it all together. However, some applications need topic segmentation for a part of document which is not finished yet. In this paper, we propose a model to perform topic segmentation during the progress of the statement with a supervised learning model that uses a convolution neural network. In order to show the effectiveness of our model, we perform experiments of topic segmentation both on-line status and off-line status using C99 algorithm. We can see that our model achieves 17.8 and 11.95 of Pk score, respectively.

Automatic Detection of Off-topic Documents using ConceptNet and Essay Prompt in Automated English Essay Scoring (영어 작문 자동채점에서 ConceptNet과 작문 프롬프트를 이용한 주제-이탈 문서의 자동 검출)

  • Lee, Kong Joo;Lee, Gyoung Ho
    • Journal of KIISE
    • /
    • v.42 no.12
    • /
    • pp.1522-1534
    • /
    • 2015
  • This work presents a new method that can predict, without the use of training data, whether an input essay is written on a given topic. ConceptNet is a common-sense knowledge base that is generated automatically from sentences that are extracted from a variety of document types. An essay prompt is the topic that an essay should be written about. The method that is proposed in this paper uses ConceptNet and an essay prompt to decide whether or not an input essay is off-topic. We introduce a way to find the shortest path between two nodes on ConceptNet, as well as a way to calculate the semantic similarity between two nodes. Not only an essay prompt but also a student's essay can be represented by concept nodes in ConceptNet. The semantic similarity between the concepts that represent an essay prompt and the other concepts that represent a student's essay can be used for a calculation to rank "on-topicness" ; if a low ranking is derived, an essay is regarded as off-topic. We used eight different essay prompts and a student-essay collection for the performance evaluation, whereby our proposed method shows a performance that is better than those of the previous studies. As ConceptNet enables the conduction of a simple text inference, our new method looks very promising with respect to the design of an essay prompt for which a simple inference is required.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

Evaluation on the Implementation of Girl Friendly Science Activity (여학생 친화적 과학활동 프로그램의 운영 평가)

  • Jhun, Young-Seok;Shin, Young-Joon
    • Journal of The Korean Association For Science Education
    • /
    • v.24 no.3
    • /
    • pp.442-458
    • /
    • 2004
  • This study was conducted to develop a plan for a large-scale implementation of the Girl Friendly Science Program based on the results of analysis and investigation of its current pilot implementation, Girl Friendly Science Program materials, which was first developed in 1999 with the support from Ministry of Gender Equality, consist of 1) five theme-based units that are specifically targeted individual students' unique ability, aptitude, and career choice, and 2) differentiated learning materials for 7th through 10th grade female students. All the materials are available at the homepage (http://tes.or.kr/gfsp.cgi) of 'Teachers for Exciting Science(the organization of science teachers in Seoul area)'. Since the materials are well organized by topic and grade level and presented in both Korean word process document and html format, anyone can easily access to the materials for their own instructional use. Ever since its launch the number of visitors to the homepage has been constantly increasing. The evaluation results of the current pilot implementation of the materials that targeted individual students' ability and aptitude showed that it scored high in terms of its alignment to the original purpose, content, level, and effectiveness to implement in classrooms. However, its evaluation scores were low in terms of the convenience for teachers to guide the materials, and its organization and operation. The results also showed a significant change in students' perception of science, and students' positive experiences of science through various interdisciplinary activities. On the other hand, the evaluation of students' experiences with the materials showed that students' assessment about an activity was largely depending on a success or failure of their experiences. Overall, students' evaluation of activities scores were low for simple activities such as cutting off or pasting papers. According to students' achievement test results, differences between pre and post test scores in the Affective Domain was statistically significant (p<0.05), but not in Inquiry Domain. Based on teachers observations, numerous schools where have run this program reported that students' abilities to cooperate, discuss, observe and reason with evidences were improved. In order to implement this program in a larger scale, it is critical to have a strong support of teachers and induce them to change their teaching strategy through building a community of teachers and developing ongoing teacher professional development programs. Finally, there still remain strong needs to develop more programs, and actively discover and train more domestic woman scientists and engineers and collaborate with them to develop more educational materials for girls in all ages.