DOI QR코드

DOI QR Code

Crowdsourcing Identification of License Violations

  • Lee, Sanghoon (Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH)) ;
  • German, Daniel M. (Department of Computer Science, University of Victoria) ;
  • Hwang, Seung-won (Department of Computer Science, Yonsei University) ;
  • Kim, Sunghun (Department of Computer Science and Engineering, Hong Kong University of Science and Technology)
  • Received : 2015.11.02
  • Accepted : 2015.11.23
  • Published : 2015.12.30

Abstract

Free and open source software (FOSS) has created a large pool of source codes that can be easily copied to create new applications. However, a copy should preserve copyright notice and license of the original file unless the license explicitly permits such a change. Through software evolution, it is challenging to keep original licenses or choose proper licenses. As a result, there are many potential license violations. Despite the fact that violations can have high impact on protecting copyright, identification of violations is highly complex. It relies on manual inspections by experts. However, such inspection cannot be scaled up with open source software released daily worldwide. To make this process scalable, we propose the following two methods: use machine-based algorithms to narrow down the potential violations; and guide non-experts to manually inspect violations. Using the first method, we found 219 projects (76.6%) with potential violations. Using the second method, we show that the accuracy of crowds is comparable to that of experts. Our techniques might help developers identify potential violations, understand the causes, and resolve these violations.

Keywords

References

  1. N. J. Mertzel, "Copying 0.03 percent of software code base not 'de minimis'," Journal of Intellectual Property Law & Practice, vol. 9, no. 3, pp. 547-548, 2008.
  2. K. Taylor, "Oracle Am., Inc. v. Google Inc. 750 F. 3d 1339 (Fed. Cir. 2014)," Intellectual Property Law Bulletin, vol. 19, no. 2, pp. 221-223. 2014.
  3. J. Krinke, N. Gold, Y. Jia, and D. Binkley, "Cloning and copying between gnome projects," in Proceedings of 7th IEEE Working Conference on Mining Software Repositories (MSR), Cape Town, South Africa, 2010, pp. 98-101.
  4. D. M. German, M. Di Penta, Y. G. Gueheneuc, and G. Antoniol, "Code siblings: technical and legal implications," in Proceedings of 6th IEEE International Working Conference on Mining Software Repositories (MSR), Vancouver, Canada, 2009, pp. 81-90.
  5. M. Sojer and J. Henkel, "License risks from ad hoc reuse of code from the internet," Communications of the ACM, vol. 54, no. 12, pp. 74-81, 2011. https://doi.org/10.1145/2043174.2043193
  6. M. B. Jensen, Does Your Project Have a Copyright Problem? A Decision-Making Guide for Librarians. Jefferson, NC: McFarland & Company, 1996.
  7. L. Jiang, G. Misherghi, Z. Su, and S. Glondu, "Deckard: scalable and accurate tree-based detection of code clones," in Proceedings of the 29th international conference on Software Engineering (ICSE'07), Minneapolis, MN, 2007, pp. 96-105.
  8. M. W. Lee, J. W. Roh, S. W. Hwang, and S. Kim, "Instant code clone search," in Proceedings of the 18th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), Santa Fe, NM, 2010, pp. 167-176.
  9. R. Gobeille, "The fossology project," in Proceedings of the 2008 International Working Conference on Mining Software Repositories (MSR), Leipzig, Germany, 2008, pp. 47-50.
  10. T. Kamiya, S. Kusumoto, and K. Inoue, "CCFinder: a multilinguistic token-based code clone detection system for large scale source code," IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654-670, 2002. https://doi.org/10.1109/TSE.2002.1019480
  11. I. D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, "Clone detection using abstract syntax trees," in Proceedings of the International Conference on Software Maintenance, Bethesda, MD, 1998, pp. 368-377.
  12. C. J. C. Burges, "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998. https://doi.org/10.1023/A:1009715923555
  13. C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines," https://www.csie.ntu.edu.tw/-cjlin/libsvm/.
  14. N. Krawetz, "Symbolic alignment matrix," 2008; http://www.fossology.org/projects/fossology/wiki/Symbolic_Alignment_Matrix.
  15. D. M. German, Y. Manabe, and K. Inoue, "A sentencematching method for automatic license identification of source code files," in Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, 2010, pp. 437-446.
  16. D. M. German, M. Di Penta, and J. Davies, "Understanding and auditing the licensing of open source software distributions," in Proceedings of 18th International Conference on Program Comprehension (ICPC), Braga, Portugal, 2010, pp. 84-93.
  17. A. Doan, R. Ramakrishnan, and A. Y. Halevy, "Crowdsourcing systems on the world-wide web," Communications of the ACM, vol. 54, no. 4, pp. 86-96, 2011. https://doi.org/10.1145/1924421.1924442
  18. M. S. Bernstein, D. R. Karger, R. C. Miller, and J. Brandt, "Analytic methods for optimizing realtime crowdsourcing," in Proceedings of Collective Intelligence 2012, Cambridge, MA, 2012, pp. 1-8.
  19. P. Welinder and P. Perona, "Online crowdsourcing: rating annotators and obtaining cost-effective labels," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), San Francisco, CA, 2010, pp. 25-32.
  20. C. B. Eiben, J. B. Siegel, J. B. Bale, S. Cooper, F. Khatib, B. W. Shen, F. Players, B. L. Stoddard, Z. Popovic, and D. Baker, "Increased Diels-Alderase activity through backbone remodeling guided by Foldit players," Nature Biotechnology, vol. 30, no. 2, pp. 190-192, 2012. https://doi.org/10.1038/nbt.2109
  21. J. Lee, H. Cho, J. W. Park, Y. R. Cha, S. W. Hwang, Z. Nie, and J. R. Wen, "Hybrid entity clustering using crowds and data," The VLDB Journal, vol. 22, no. 5, pp. 711-726, 2013. https://doi.org/10.1007/s00778-013-0328-8
  22. M. Bayersdorfer, "Managing a project with open source components," Interactions, vol. 14, no. 6, pp. 33-34, 2007.
  23. T. Madanmohan and R. De', "Open source reuse in commercial firms," IEEE Software, vol. 21, no. 6, pp. 62-69, 2004. https://doi.org/10.1109/MS.2004.45
  24. C. Ruffin and C. Ebert, "Using open source software in product development: a primer," IEEE Software, vol. 21, no. 1, pp. 82-86, 2004. https://doi.org/10.1109/MS.2004.1259227
  25. Y. B. Dang, P. Cheng, L. Luo, and A. Cho, "A code provenance management tool for IP-aware software development," in Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, 2008, pp. 975-976.
  26. T. Alspaugh, H. U. Asuncion, and W. Scacchi, "Intellectual property rights requirements for heterogeneously-licensed systems," in Proceedings of 17th IEEE International Requirements Engineering Conference (RE'09), Atlanta, GA, 2009, pp. 24-33.
  27. T. Alspaugh and W. Scacchi, "Heterogeneously-licensed system requirements, acquisition and governance," in Proceedings of 2nd International Workshop on Requirements Engineering and Law (RELAW 2009), Atlanta, GA, 2009, pp. 13-14.
  28. T. Alspaugh, H. U. Asuncion, and W. Scacchi, "The role of software licenses in open architecture ecosystems," in Proceedings of 1st International Workshop on Software Ecosystems (IWSECO), Falls Church, VA, 2009, pp. 4-18.
  29. M. Sojer and J. Henkel, "Code reuse in open source software development: quantitative evidence, drivers, and impediments," Journal of the Association for Information Systems, vol. 11, no. 12, pp. 868-901, 2010. https://doi.org/10.17705/1jais.00248
  30. M. Sojer and J. Henkel, "License risks from ad hoc reuse of code from the internet," Communications of the ACM, vol. 54, no. 12, pp. 74-81, 2011. https://doi.org/10.1145/2043174.2043193
  31. M. Di Penta, D. M. German, Y. G. Gueheneuc, and G. Antoniol, "An exploratory study of the evolution of software licensing," in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, Cape Town, South Africa, 2010, pp. 145-154.
  32. M. Godfrey and L. Zou, "Using origin analysis to detect merging and splitting of source code entities," IEEE Transactions on Software Engineering, vol. 31, no. 2, pp. 166-181, 2005. https://doi.org/10.1109/TSE.2005.28
  33. J. Krinke, "A study of consistent and inconsistent changes to code clones," in Proceedings of 14th Working Conference on Reverse Engineering (WCRE), Vancouver, Canada, 2007, pp. 170-178.
  34. J. Krinke, "Is cloned code more stable than non-cloned code?," in Proceedings of 8th IEEE International Working Conference on Source Code Analysis and Manipulation, Beijing, China, 2008, pp. 57-66.
  35. A. Lozano, "A methodology to assess the impact of source code flaws in changeability and its application to clones," in Proceedings of the International Conference of Software Maintenance, Beijing, China, 2008, pp. 424-427.
  36. A. Lozano, M. Wermelinger, and B. Nuseibeh, "Evaluating the harmfulness of cloning: a change based experiment," in Proceedings of the 4th International Workshop on Mining Software Repositories (MSR), Minneapolis, MN, 2007.
  37. S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta, "An empirical study on the maintenance of source code clones," Empirical Software Engineering, vol. 15, no. 1, pp. 1-34, 2010. https://doi.org/10.1007/s10664-009-9108-x
  38. C. Kapser and M. W. Godfrey, "Cloning considered harmful," considered harmful: patterns of cloning in software," Empirical Software Engineering, vol. 13, no. 6, pp. 645-692, 2008. https://doi.org/10.1007/s10664-008-9076-6
  39. R. Tiarks, R. Koschke, and R. Falke, "An assessment of type-3 clones as detected by state-of-the-art tools," in Proceedings of 9th IEEE International Workshop on Source Code Analysis and Manipulation (SCAM), Edmonton, AB, 2009, pp. 67-76.
  40. Y. Kashima, Y. Hayase, N. Yoshida, Y. Manabe, and K. Inoue, "An investigation into the impact of software licenses on copy-and-paste reuse among OSS projects," in Proceedings of 18th Working Conference on Reverse Engineering (WCRE), Limerick, Ireland, 2011, pp. 28-32.
  41. M. Kim, V. Sazawal, D. Notkin, and G. Murphy, "An empirical study of code clone genealogies," ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 187-196, 2005. https://doi.org/10.1145/1095430.1081737
  42. J. Ossher, H. Sajnani, and C. Lopes, "File cloning in open source Java projects: the good, the bad, and the ugly," in Proceedings of the International Conference in Software Maintenance, Williamsburg, VI, 2011, pp. 283-292.
  43. J. Kim, S. Lee, S. W. Hwang, and S. Kim, "Adding examples into java documents," in Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering (ASE'09), Auckland, New Zealand, 2009, pp. 540-544.
  44. J. Kim, S. Lee, S. W. Hwang, and S. Kim, "Enriching documents with examples: a corpus mining approach," ACM Transactions on Information Systems, vol. 31, no. 1, article no. 1, 2013.
  45. J. W. Park, M. W. Lee, J. Kim, S. W. Hwang, and S. Kim, "CosTriage: a cost-aware triage algorithm for bug reporting systems." in Proceedings of 25th AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, 2011.
  46. J. Kim, S. Lee, S. W. Hwang, and S. Kim, "Towards an intelligent code search engine," in Proceedings of 24th AAAI Conference on Artificial Intelligence (AAAI), Atlanta, GA, 2010.
  47. M. W. Lee, S. W. Hwang, and S. Kim, "Integrating code search into the development session," in Proceedings of 2011 IEEE 27th International Conference on Data Engineering (ICDE), Hannover, Germany, 2011, pp. 1336-1339.
  48. J. W. Park, M. W. Lee, J. W. Roh, S. W. Hwang, and S. Kim, "Surfacing code in the dark: an instant clone search approach," Knowledge and Information Systems, vol. 41, no. 3, pp. 727-759, 2014. https://doi.org/10.1007/s10115-013-0677-z