1. INTRODUCTION
Advancements in mobile phone hardware and increased network connectivity made video’s popularity continues to grow. Some aspects are urged to be improved. For example, live video streaming apps have been used for sharing experiences with other viewers. On the contrary, the viewer can also provide feedback such as make comments. The feedback comments usually appear in a list below or beside the video being shared, separate from the visual context of what the viewer is commenting on. This may cause problems when the person sending the video changes his or her viewpoint. Compared to static media, people now prefer to take a video with their phone to convey some informations, but they cannot do such highlight important parts as same as static medias.
The best solution is to provide a way to comment directly on the specific frame of the video. We call this process as video annotation. Annotation is an important learning strategy. After reading some document or textbook, the learner write their ideas and record important content summary by drawing line or circle in their reading content for invoking their memory or offering hints for reading in the future. In explainer video, the video producer use a variety of annotations and virtua contents to help people understand better. At present, adding annotations to videos requires professional video editing software. The editor has to set the position of the virtual contents in each frame directly in order to fix 3D contents and produce animations. This takes enormous cost and time, therefore it’s not easy for the public to use.
There have many real-time annotation system base on static medias [1, 2, 3], but when the static image is change to sequence of images, the original intention of the annotations will be misinterpreted from novel perspectives. Therefore it is greatly significant to research on the lightweight real-time video annotation system which can adapt to everyday environment. There are two difficulties in real-time annotation system. One have to make sure annotations are fixed in the corresponding scene of the video. The other is the annotation processing should meets the requirements of real-time.
In this work, we investigate how annotations can be displayed directly on the streamed video in real-time using a mobile device, and especially focus on using Augmented Reality (AR) technic. ARCore is a recent AR platform for mobile device. It combines the advanced feature matching algorithm ORB with the smartphone's IMU to achieve the following functions on mobile phones: motion tracking, environment understanding, and light estimation [4]. We use ARCore to tracking the motion of object in video and associate this movement with annotations and adjust the shape of the annotations in real time with OpenGL.
This paper consists of the following parts. Some earlier studies are been discussed in Section 2. Section 3 presents our real-time video annotation method. A real-time collaboration system base on the propose method will be shown in Sections 4. Finally, we summarize and evaluate our work in Section 5.
2. RELATED WORKS
There have been several earlier studies that explored the real-time video annotation system for different purposes. Lai et al. [5] developed a webbased video annotation system for online classroom. Students can enter answers in the pop-up dialog box, and the teacher can browse the student's answers in the corresponding video position. The system is completely developed based on html5 features. So the annotations just simply floating on top of the video. Cho et al. [6] proposed a real-time interactive AR system for broadcasting. The system support the real-time interaction between the augmented virtual contents and the casts, the system perceives the indoor space using a RGB-D camera and the area of each object is separated through clustering then replaced with virtual content. Venerella et al. [7] proposed a lightweight annotation system for collaboration. The system is using CS architecture, client side generate 3D mesh using 6D.AI, According to the textured mesh, server side can annotate virtual objects and add virtual landmarks in the client’s environment, These previous works have many limitations that do not support free-hand drawings or lightweight.
With the rapid increase in mobile device performance and network technology in recent years. Recently there have been many attempts to make video annotation system totally base on mobile phone. Choi et al. [8] proposed a collaboration system, which allows multi-user to create and situate contents on live video streaming. This system uses an image-based AR platform called Vuforia to implement such features belows. Such systems only support annotation on live video streams [9]. Nassan et al. [10] proposed a system that can comments displayed over the background video. But the system only support text type annotations.
3. REAL-TIME VIDEO ANNOTATION
3.1 System design
The proposed system using ARCore SDK and OpenGL library to rendering annotations directly on the video in real-time. ARCore can recognize the natural image as a target object with advanced registered dictionary data. We can create a set of dictionary data from the frame of the video. When drawing annotations on the video, at the same time, the feature points in the video frame will be saved into a dictionary. By tracking the feature points, a motion matrix can be calculated. According to the motion matrix, OpenGL is able to render specified 3D models displayed on the screen (Fig. 1). The feature matching process will be introduced in Section 3.1, and the calculation process of the pose matrix will be presented in Section 3.2.
Fig. 1. AR-based real-time video annotation processing.
3.2 Feature matching
Feature descriptors are a part of computer vision and image processing eds. In this paper, a corresponding binary keypoint descriptor algorithm ORB [11] is used for feature detection. First, find the feature points of two photos using FAST algorithm, then describe the attributes of these feature points using BRIEF algorithm, finally compare the attributes of the feature points of the two pictures. If there are enough feature points with the same attributes, then the match is successful. The feature matching result is shown in Fig. 2.
Fig. 2. Feature matching result using ORB algorithm.
Research has shown that the ARCore 's performance and ability to detect feature points could prove to be dependent on lighting conditions, the texture of a surface and the angle the device is in [12].
3.3 Estimates the pose of virtual object in realtime
A real-time video stream generated by device camera is processed in ARCore recognition module for feature point detection. Together with the device's orientation and its accelerometer sensors, ARCore can real-time estimates the pose of the Mrecog recognized sparse point cloud in a physical objects. Mrecog represents the motion process of recognizing the spatial position of feature points during the change of the viewpoint. Mrecog is a 4× 4 matrix consisting of a translation and a rotation. The derivation process of the Mrecog matrix as follows:
In Fig. 3, Since C0, C1, and p are coplanar, C0 ,C1 , p0 , and p1 are coplanar. The coplanarity of vectors can be established by the following equation:
\(\overrightarrow{C_{0} p_{0}} \cdot\left(\overrightarrow{C_{0} C_{1}} \times \overrightarrow{C_{1} p_{1}}\right)=0\) (1)
\(\because p_{0}=\left[\begin{array}{l} x_{0} \\ y_{0} \\ 1 \end{array}\right]_{C_{0}}, p_{1}=\left[\begin{array}{l} x_{1} \\ y_{1} \\ 1 \end{array}\right]_{C_{1}} \quad \therefore \overrightarrow{p_{0}} \cdot\left(\vec{t} \times \overrightarrow{R p_{1}}\right)=0\) (2)
\(\because \vec{a} \times \vec{b}=[a]_{X} \vec{b} \quad \therefore p_{0}^{T}[t]_{X} R p_{1}=0\) (3)
\(E=[t]_{X} R\) (4)
\(\text { Here } E \text { is a } 3 \times 3 \text { matrix: }\)
\(\therefore\left[x_{0} y_{0} 1\right]\left[\begin{array}{ccc} E_{11} & E_{12} & E_{14} \\ E_{21} & E_{22} & E_{23} \\ E_{31} & E_{32} & E_{33} \end{array}\right]\left[\begin{array}{c} x_{1} \\ y_{1} \\ 1 \end{array}\right]=0\) (5)
\(\left[x_{0} x_{1} x_{0} y_{1} x_{0} y_{0} x_{1} y_{0} y_{1} y_{0} x_{1} y_{1} 1\right]\left[\begin{array}{c} E_{11} \\ E_{12} \\ \vdots \\ E_{33} \end{array}\right]=0\) (6)
\(\left[x_{0} x_{1} x_{0} y_{1} x_{0} y_{0} x_{1} y_{0} y_{1} y_{0} x_{1} y_{1} 1\right] E_{33}\left[\begin{array}{c} \frac{E_{11}}{E_{33}} \\ \frac{E_{12}}{E_{33}} \\ \vdots \\ 1 \end{array}\right]=0\) (7)
Fig. 3. Polar geometry model for matrix Mrecog estimation.
Since there are 8 unknown variables \(\left\{\frac{E_{11}}{E_{33}}, \frac{E_{12}}{E_{33}}, \ldots, \frac{E_{32}}{E_{33}}\right\}\), so we need to require at least 8 feature points to solve Eq. 7. Fig. 2 shows the feature matching result using the ORB algorithm. The result shows there have more than 50 matching pairs. And in most of the scenarios we have tested there have more than 8 pairs of feature points, so AR drawings can be kept in place steadily.
By using Mrecog we can calculate out the pose of device Mrecog and virtual canvas Mvirtualcanvas easily (Fig. 4).
Fig. 4. Matrix definition.
\(M_{\text {device }}=M_{\text {recog }} \cdot T_{\text {device center }}\) (8)
\(M_{\text {virtualcanvas }}=M_{\text {recog }} \cdot T_{\text {offset }}\) (9)
Here, Toffset is a translation matrix from the camera coordinate system to the virtual object coordinate system, which is mainly defined by a device’s camera parameter. Tdevice center is a translation matrix from the camera coordinate system to device coordinate system, which is mainly defined by the physical position of the build-in camera. By dynamically updated parameter Mvirtualcanvas , Open GL is able to render specified 3D models during movement.
4. REAL-TIME COLLABORATIVE AR SYSTEM
4.1 System overview
For the practical use, we proposed a novel realtime collaboration system based on the proposed annotation method. We think the integration of remote collaboration and a co-located collaborative way is one of the novelty points of the proposed system. Remote collaboration allows multiple users simultaneously to view and annotate three-dimensional virtual information among shared real-time video streams. Co-located collaborative service allowed a user to persist virtual objects in the same place as before (aka Relocalization) so that they can share AR experiences with other users in the same environment. The System overview is shown in Fig. 5.
Fig. 5. System overview.
4.2 System architecture
Recently ARCore team added native support for Firebase's real-time Database which will bring a stable shared AR experience [13]. The system architecture is shown in Fig. 6. Live video stream code by H.264 then shared by RTMP protocol on the internet. Guidance data created by the remote helper side are transmitted via the TCP protocol.
Fig. 6. System architecture.
5. IMPLEMENTATION AND EVALUATION
The experiments in this paper were conducted on OnePlus 3T, which is one of ARCore's officially supported devices. First, we drew a 2D arrow pointing to the printer cartridges, we can see the 2D arrow is immediately been augmented to a 3D arrow in the environment (Fig. 7), which express the intention of replacing the printer cartridge clearly.
Fig. 7. Implementation results of the proposed real-time video annotation method from two viewpoints. (a) Front view. (b) Side view.
In order to qualitatively evaluate our work, especially the success rate. We performed the same annotation task 100 times under indoors and outdoors environment, each task will produce ten virtual annotations. Record the number of failures in each task. As evaluation criteria, if the augmented annotation has a large displacement or suddenly disappears, then it will be thinking as annotation error, vice versa. Fig. 8 plots the CDF of annotation error comparisons for two environments. This result demonstrates the proposed system can complete the annotation task after about 4 failures in different environments.
Fig. 8. Video annotation performance in various environments.
In this paper, we proposed a novel AR-based collaboration system. For implementation, we researched a method of rendering 2D drawing annotations in 3D. Compared to 2D annotations, the semantics of 3D annotations are more explicit and can greatly improve the efficiency of collaboration systems. During the experimental evaluation, the average success rate of annotation is about 90.21%, which shows the stability of system. There are two main focuses on future work :(1) Improve annotation method to achieve automatically inferring depth for 2D drawings in 3D space, and (2) Using lntegrated IoT system to searching users for realtime collaboration [14].
※ This research was supported by the BK21 Plus project of the Ministry of Education and the Korea Research Foundation (SW Human Resources Development Team for the realization of Smart Life at Kyungpook National University) (21A20131600005).
References
- H. Attiya, S. Burckhardt, A. Gotsman, A. Morrison, H. Yang, M. Zawirski, et al., "Specification and Complexity of Collaborative Text Editing," Proceeding of ACM Symposium on Principles of Distributed Computing, pp. 259-268, 2016.
- L. Gao, D. Gao, N. Xiong, and C. Lee, "CoWeb Draw: A Real-time Collaborative Graphical Editing System Supporting Multi-clients Based on HTML5," Journal of Multimedia Tools and Applications, Vol. 77, No. 4, pp. 5067-5082, 2018. https://doi.org/10.1007/s11042-017-5242-4
- X. Wang, J. Bu, and C. Chen, "Achieving Undo in Bitmap-based Collaborative Graphics Editing Systems," Proceedings of the Conference on Computer Supported Cooperative Work, pp. 68-76, 2015.
- Shared AR Experiences with Cloud Anchors, https://developers.google.com/ar/develop/java/cloud-anchors/overview-android (accessed August 20, 2019).
- A.F. Lai, W.H. Li, and H.Y. Lai, "A Study of Developing a Web-based Video Annotation System and Evaluating Its Suitability on Learning," Proceeding of Second International Conference on Education and Multimedia Technology, pp. 44-48, 2018.
- H. Cho, S.U. Jung, and H.K. Jee, "Real-time Interactive AR System for Broadcasting," Proceeding of IEEE Virtual Reality, pp. 353-354, 2017.
- J. Venerella, L. Sherpa, H. Tang, and Z. Zhui, "A Lightweight Mobile Remote Collaboration Using Mixed Reality," Proceedings of Computer Vision and Pattern Recognition 2019, pp. 1-4, 2019.
- S.H. Choi, M. Kim, and J.Y. Lee, "Situationdependent Remote AR Collaborations: Imagebased Collaboration Using a 3D Perspective Map and Live Video-based Collaboration with a Synchronized VR Mode," Journal of Computers in Industry, Vol. 101, pp. 51-66, 2018. https://doi.org/10.1016/j.compind.2018.06.006 (Issue Numbers Are Missing)
- S. Lukosch, M. Billinghurst, and L. Alem, "Collaboration in Augmented Reality," Journal of Computer Supported Cooperative Work, Vol. 24, No. 6, pp. 515-525, 2015. https://doi.org/10.1007/s10606-015-9239-0
- A. Nassani, H. Kim, G. Lee, M. Billinghurst, T. Langlotz, R.W. Lindeman, et al., "Augmented Reality Annotation for Social Video Sharing," Proceeding of Special Interest Group on GRAPHics ASIA 2016 Mobile Graphics and Interactive Applications, pp. 1-5, 2016.
- E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An Efficient Alternative to SIFT or SURF," Proceedings of IEEE International Conference on Computer Vision, pp. 2564-2571, 2011.
- A. Eklind, An Exploratory Research of ARCore's Feature Detection, Master's Thesis of KTH Royal Institute of Technology, 2018.
- Working with Anchors, https://developers.google.com/ar/develop/developerguides/anchors (accessed December 26, 2019).
- S. Ryu and S. Kim, "Development of an Integrated IoT System for Searching Dependable Device Based on User Property," Journal of Korea Multimedia Society, Vol. 20, No. 5, pp. 791-799, 2017. https://doi.org/10.9717/kmms.2017.20.5.791