DOI QR코드

DOI QR Code

Text Classification for Patents: Experiments with Unigrams, Bigrams and Different Weighting Methods

  • Received : 2017.06.09
  • Accepted : 2017.06.21
  • Published : 2017.06.28

Abstract

Patent classification is becoming more critical as patent filings have been increasing over the years. Despite comprehensive studies in the area, there remain several issues in classifying patents on IPC hierarchical levels. Not only structural complexity but also shortage of patents in the lower level of the hierarchy causes the decline in classification performance. Therefore, we propose a new method of classification based on different criteria that are categories defined by the domain's experts mentioned in trend analysis reports, i.e. Patent Landscape Report (PLR). Several experiments were conducted with the purpose of identifying type of features and weighting methods that lead to the best classification performance using Support Vector Machine (SVM). Two types of features (noun and noun phrases) and five different weighting schemes (TF-idf, TF-rf, TF-icf, TF-icf-based, and TF-idcef-based) were experimented on.

Keywords

References

  1. T. Anthony, Guidelines for preparing patent landscape reports, http://www.wipo.int/edocs/pubdocs/en/wipo_pub_946.pdf, 2015. [Last accessed: 20th of May 2016].
  2. K. Benzineb and J. Guyot, "Automated patent classification, in 'Current challenges in patent information retrieval'," Springer Berlin Heidelberg, 2011, pp. 239-261.
  3. C. Bielza, G. Li, and P. Larranaga, "Multi-dimensional classification with Bayesian networks," International Journal of Approximate Reasoning, vol. 52, no. 6, 2011, pp. 705-727. https://doi.org/10.1016/j.ijar.2011.01.007
  4. M. F. Caropreso, S. Matwin, and F. Sebastiani, "Statistical phrases in automated text categorization," Centre National de la Recherche Scientifique, Paris, France, 2000.
  5. C. C. Chang and C. J. Lin, "Libsvm: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, 2011, p. 27.
  6. E. D'hondt, S. Verberne, C, Koster, and L. Boves, "Text representations for patent classification," Computational Linguistics, vol. 39, no. 3, 2013, pp. 755-775. https://doi.org/10.1162/COLI_a_00149
  7. G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, "A study on term weighting for text categorization: a novel supervised variant of tf. idf," in Proceedings of the 4th international conference on data management technologies and applications (DATA). Candidate to the best conference paper award, 2015, pp. 26-37.
  8. D. Eisinger, G. Tsatsaronis, M. Bundschus, U. Wieneke, and M. Schroeder, "Automated patent categorization and guided patent search using ipc as inspired by mesh and pubmed," Journal of biomedical semantics, vol. 4, no. 1, 2013, p. 1. https://doi.org/10.1186/2041-1480-4-1
  9. University of Neuchatel: Stop word list, online, http://members.unine.ch/jacques.savoy/clef/index.html.,2005- [Last accessed: 28.10.2016].
  10. EPO and USPTO, "Guide to the CPC," http://www.cooperativepatentclassification.org/publications/GuideToTheCPC.pdf, 2015. [Last accessed: 20th of May 2016].
  11. N. Fadaei, T. Mandl, M. Schwantner, M. Sofean, J. M. Struß, K. Werner, and C. Womser-Hacker, "Patent analysis and patent clustering for technology trend mining, in 'Elbes hausen Stefanie, Faay Gertrud, Griesbaum Joachim, Heuwing Ben, Jurgens Julia (Hrsg.), HIER 2015 - Proceedings des 9. Hildesheimer Evaluierungs- und Retrieval workshop," Hildesheim University Hildesheim, pp. 77-86, 2015 [Last accessed: 5th of May 2016].
  12. Grid Logics Technologies, "Technology insight report slot machines," http://www.patentinsightpro.com/techreports/0511/Technology%20Insight%20ReportSlot%20Machines.pdf, 2011. [Last accessed: 4th of June 2016].
  13. Grid Logics Technologies, "Technology insight report robotic arms," http://www.patentinsightpro.com/techreports/0312/Roboti c%20Arms%20Tech%20Report.pdf, 2012. [Last accessed: 4th of June 2016].
  14. Grid Logics Technologies, "Contact lenses technology insight report," http://www.patentinsightpro.com/techreports/0514/Tech% 20Insight%20Report%20%20Contact%20Lens.pdf, 2014. [Last accessed: 22th of May 2016].
  15. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The weka data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, no. 1, 2009, pp. 10-18. https://doi.org/10.1145/1656274.1656278
  16. C. W. Hsu and C. J. Lin, "A comparison of methods for multiclass support vector machines," IEEE transactions on Neural Networks, vol. 13, no. 2, 2002, pp. 415-425. https://doi.org/10.1109/72.991427
  17. S. Ik Jae, Patent management strategy using hierarchical clustering, Master's thesis, Korea University, Korea, 2014.
  18. C. H. Koster and J. G. Beney, "Phrase-based document categorization revisited," Proceedings of the 2nd international workshop on Patent information retrieval, ACM, 2009, pp. 49-56.
  19. C. H. Koster, J. G. Beney, S. Verberne, and M. Vogel, "Phrase-based document categorization, in 'Current challenges in patent information retrieval," Springer, 2011, pp. 263-286.
  20. M. Lan, C. L. Tan, J. Su, and Y. Lu, "Supervised and traditional term weighting methods for automatic text categorization," IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 4, 2009, pp. 721-735. https://doi.org/10.1109/TPAMI.2008.110
  21. N. Oleg, Eigennamenerkennung für technologien. implementierung und evaluierung eines prototyps für patente, Master's thesis, University of Hildesheim, Germany, 2016.
  22. L. Ozgur and T. Gungor, "Text classification with the support of pruned dependency patterns," Pattern Recognition Letters, vol. 31, no. 12, 2010, pp. 1598-1607. https://doi.org/10.1016/j.patrec.2010.05.005
  23. C. Park, D. Seong, and K. Lee, "Automatic ipc classification for patent documents using machine learning," Journal of Korean Institute of Information Technology, vol. 10, no. 4, 2012, pp. 119-128.
  24. J. Platt, et al., Sequential minimal optimization: A fast algorithm for training support vector machines, 1998.
  25. M. F. Porter, "Snowball: A language for stemming algorithms," http://snowball.tartarus.org/texts/introduction.html.[Last accessed: 4th of June 2016], 2001.
  26. A. Shmilovici, Support vector machines, in 'Data Mining and Knowledge Discovery Handbook, Springer, 2005, pp. 257-276.
  27. J. M. Strus, T. Mandl, M. Schwantner, and C. Womser- Hacker, "Understanding trends in the patent domain," in 'IPaMin@ KONVENS'. http://ceur-ws.org/Vol-1292/ipamin2014_paper9.pdf,2014. [Last accessed: 16th of November 2016].
  28. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, "Feature-rich part-of-speech tagging with a cyclic dependency network," Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, 2003, pp. 173-180.
  29. Y. J. Tseng, C. J. Lin, and Y. I. Lin, "Text mining techniques for patent analysis," Information Processing & Management, vol. 43, no. 5, 2007, pp. 1216-1247. https://doi.org/10.1016/j.ipm.2006.11.011
  30. V. Van Asch, "Macro-and micro-averaged evaluation measures [[basic draft]]". http://scholar.google.co.kr/scholar?hl=ko&q=Macro-and+micro-averaged+evaluation+measures&btnG=&lr=, 2013. [Last accessed: 31th of October].
  31. D. Wang and H. Zhang, "Inverse-category-frequency based supervised term weighting schemes for text categorization," Journal of Information Science and Engineering, vol. 29, no. 2, 2013, pp. 209-225.
  32. WIPO, "Patent Landscape Reports," http://www.wipo.int/patentscope/en/programs/patent_land scapes/, 2016. [Last accessed: 16th of November].