DOI QR코드

DOI QR Code

Academic Conference Categorization According to Subjects Using Topical Information Extraction from Conference Websites

학회 웹사이트의 토픽 정보추출을 이용한 주제에 따른 학회 자동분류 기법

  • Lee, Sue Kyoung (Department of Industrial and Management Engineering, Incheon National University) ;
  • Kim, Kwanho (Department of Industrial and Management Engineering, Incheon National University)
  • Received : 2017.03.17
  • Accepted : 2017.05.19
  • Published : 2017.05.31

Abstract

Recently, the number of academic conference information on the Internet has rapidly increased, the automatic classification of academic conference information according to research subjects enables researchers to find the related academic conference efficiently. Information provided by most conference listing services is limited to title, date, location, and website URL. However, among these features, the only feature containing topical words is title, which causes information insufficiency problem. Therefore, we propose methods that aim to resolve information insufficiency problem by utilizing web contents. Specifically, the proposed methods the extract main contents from a HTML document collected by using a website URL. Based on the similarity between the title of a conference and its main contents, the topical keywords are selected to enforce the important keywords among the main contents. The experiment results conducted by using a real-world dataset showed that the use of additional information extracted from the conference websites is successful in improving the conference classification performances. We plan to further improve the accuracy of conference classification by considering the structure of websites.

최근 온라인상에 게시된 학회정보가 급증함으로써 주제에 따른 학회정보의 자동분류는 연구자들에게 효율적인 관련 학회 탐색을 가능하게 한다. 그러나 대부분의 학회 목록 제공 서비스에서는 학회명칭, 날짜, 위치, URL 등의 정보만 제공하기 때문에 학회 주제를 파악할 수 있는 정보는 학회명칭에 국한된다. 따라서 본 연구에서는 URL을 통한 학회 웹사이트의 토픽정보를 추출함으로써 학회정보량의 부족문제를 해결하고, 동시에 양질의 정보로 학습의 성능을 향상시키는 기법을 제안한다. 구체적으로는 웹사이트 URL을 통해 수집한 HTML 문서로부터 주요 콘텐츠를 추출하고, 학회명칭과 유사한 토픽 키워드 정보를 선정하여 추가 가중치를 부여한다. 실 데이터를 활용한 실험 결과, 제안된 방법인 추가적인 웹 콘텐츠 정보의 사용은 주제에 따른 학회 분류의 성능을 성공적으로 향상시킬 수 있음을 확인하였다. 추후 연구에서는 웹 사이트의 구조를 고려한 토픽 정보추출을 통해 분류의 정확성을 더욱 향상시킬 계획이다.

Keywords

References

  1. Cho, J., "A New Word Semantic Similarity Measure Method based on WordNet," Journal of Korean Institute of Information Technology, Vol. 11, No. 7, pp. 121-129, 2013.
  2. Ciravegna, F., "$(LP)^2$, An Adaptive Algorithm for Information Extraction from Web-related Texts," Proceeding of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, 2001.
  3. Conference.city, "International Conference Search Engine," [URL] http://www.conference.city/.
  4. Cortes, C. and Vapnik, V., "Support Vector Networks," Machine Learning, Vol. 20, No. 3, pp. 273-297, 1995. https://doi.org/10.1007/BF00994018
  5. Cox, C., Nicolson, J., Finkel, J. R., Manning, C., and Langley, P., "Template Sampling for Leveraging Domain Knowledge in Information Extraction," Proceeding of PASCAL Challenges Workshop, 2005.
  6. Eom, J., "Information Extraction Using a Hidden Markov Model," Thesis of Graduate School of Seoul National University, 2001.
  7. Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proceeding of the 10th European Conference on Machine Learning, Vol. 1398, pp. 137-142, 1998.
  8. Kim, J., Park, S. B., and Lee, S. J., "Information Extraction from Call-for-Papers Using a Hidden Markov Model," Proceeding of 2005 Conference on the HCI Society of Korea, Vol. 2005, No. 1, pp. 967-972, 2005.
  9. Kressel, U., "Pairwise Classification and Support Vector Machines," Advances in Kernel Methods Support Vector Learning, pp. 255-268, 1999.
  10. Lazarinis, F., "Combining Information Retrieval with Information Extraction for Efficient Retrieval of Calls for Papers," Proceeding of IRSG'1998, 1998.
  11. Lee, S. and Kim, H., "Keyword Extraction from News Corpus using Modified TF-IDF," The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp. 59-73, 2009.
  12. Lee, Y., "A Study on Extracting News Contents from News Web Pages," Journal of the Korean Society for Information Management, Vol. 26, No. 1, pp. 305-320, 2009. https://doi.org/10.3743/KOSIM.2009.26.1.305
  13. Leopold, E. and Kindermann, J., "Text Categorization with Support Vector Machines: How to Represent Texts in Input Space?," Machine Learning, Vol. 46, pp. 423-444, 2002. https://doi.org/10.1023/A:1012491419635
  14. Li, Y., Bontcheva, K., and Cunningham, H., "Using Uneven Margins SVM and Perceptron for Information Extraction," Proceeding of the 9th Conference on Computational Natural Language Learning, 2005.
  15. Munkova, D., Munk, M., and Vozar, M., "Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model," 2013 International Conference on Computational Science, Vol. 18, pp. 1198-1207, 2013.
  16. ReadabilityBUNDLE Library, [URL] https://github.com/srijiths/readabilityBUNDLE.
  17. Roh, J.-H., Kim, H.-j., and Chang, J.-Y., "Improving Hypertext Classification Systems Through WordNet-based Feature Abstraction," The Journal of Society for e-Business Studies, Vol. 18, No. 2, pp. 95-110, 2013. https://doi.org/10.7838/jsebs.2013.18.2.095
  18. Ryu, J., "Real-world Pattern Classifications Using Optimal Feature/Classifier Ensemble," Master's Theses for Graduate School of Seoul National University, 2002.
  19. Schneider, K., "Information Extraction from Calls for Papers with Conditional Random Fields and Layout Features," Artificial Intelligence Review, Vol. 25, No. 1, pp. 67-77, 2006. https://doi.org/10.1007/s10462-007-9019-4
  20. Sebastiani, F., "Machine Learning in Automated Text Categorization," ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47, 2002. https://doi.org/10.1145/505282.505283
  21. WikiCFP, "A Semantic wiki for Calls For Papers in Science and Technology Fields," [URL] http://www.wikicfp.com/cfp/.
  22. Wikipedia, "TF-IDF," [URL] https://ko.wikipedia.org/wiki/TF-IDF.
  23. Xia, J., Wen, K., Li, R. and Gu, X., "Optimizing Academic Conference Classification using Social Tags," 2010 13th IEEE International Conference on Computational Science and Engineering, pp. 289-294, 2010.
  24. Xin, X., Li, J., Tang, J., and Luo, Q., "Academic Conference Homepage Understanding Using Constrained Hierarchical Conditional Random Fields," In Proceeding of International Conference on Information and Knowledge Management, pp. 1301-1310, 2008.

Cited by

  1. 텍스트 문서 분류에서 범주간 유사도와 계층적 분류 방법의 성과 관계 연구 vol.25, pp.3, 2017, https://doi.org/10.7838/jsebs.2020.25.3.077