DOI QR코드

DOI QR Code

A Method for Text Information Separation from Floorplan Using SIFT Descriptor

  • Shin, Yong-Hee (Department of Civil and Environmental Engineering, Seoul National University) ;
  • Kim, Jung Ok (Institute of Construction and Environmental Engineering, Seoul National University) ;
  • Yu, Kiyun (Department of Civil and Environmental Engineering, Seoul National University)
  • Received : 2018.08.12
  • Accepted : 2018.08.23
  • Published : 2018.08.31

Abstract

With the development of data analysis methods and data processing capabilities, semantic analysis of floorplans has been actively studied. Therefore, studies for extracting text information from drawings have been conducted for semantic analysis. However, existing research that separates rasterized text from floorplan has the problem of loss of text information, because when graphic and text components overlap, text information cannot be extracted. To solve this problem, this study defines the morphological characteristics of the text in the floorplan, and classifies the class of the corresponding region by applying the class of the SIFT key points through the SVM models. The algorithm developed in this study separated text components with a recall of 94.3% in five sample drawings.

Keywords

1. Introduction

As data processing ability is improved through improvements of data analysis through machine learning and hardware performance, studies for analyzing the semantic characteristics of objects in the floorplan, as well as structural analysis, are being actively conducted. In the structural analysis of space, the text component is only noise that should be removed. However, in the semantic analysis of indoor space (de las Heras et al., 2012), the text information of the floorplan image can be analyzed. By combining the textual information with other information, such as the location information of the object, semantic analysis can be performed to create new information that is useful for decision-making, such as topology information (Lam et al., 2015) in the indoor space. Therefore, it is necessary to study the separation of text information from indoor floorplan images for data expansion for semantic analysis.

The method of separating text from floorplan images can be roughly divided into four methods. These methods include a morphological method of filtering and classifying morphological differences of text and graphic components, a connection component analysis method of separating text components by using connection components analysis and a component characteristic expression method that utilizes a technique of expressing the local characteristics of the image. All of these methods have the problem that when text and graphic elements overlap text cannot be separated. Therefore, recent studies have tried to solve this problem by combining various methods, as shown in Table 1.

Table 1. The text separation researchers from image and their research method

OGCSBN_2018_v34n4_693_t0001.png 이미지

However, even in the precedent researches, there is the problem that when the text component and the graphic component completely overlap, the text cannot be separated. In this study, we tried to solve this problem by morphological method, connected component analysis, and component characteristic expression method. First, the non-overlapping text is separated from the floorplan through the connected component analysis method and the morphological method, and the morphological characteristics of the text are defined. Then, the characteristic of the specific region where the text overlaps with the graphic component is expressed by using the SIFT descriptor having the invariance of resolution and rotation, and then classified into the text and the graphic through the SVM models. Finally, we propose a method for separating the text completely overlapped with the graphic by utilizing the morphological characteristics of the text in the floorplan defined above.

2. Construction of Text Classification Models

In order to separate the text components from the floorplan image, we first define the difference between the text component and the graphic component in the floorplan image using the SIFTdescriptor. We learn the difference defined by the SIFT descriptor through training data, and construct two classification models. The constructed classification model will be used to classify the text components and the graphic component in the subsequent text separation process.

1) Construction of Training Data

To construct a classification model for classifying textual and graphical elements, a floorplan database is used that consists of 2,334 floorplans provided by the Ministry of land, infrastructure and transport of Korea, and 1,179 floor plans of Seoul National University buildings. In the floorplan database, the training data is generated by selecting 55 floorplan images, excluding the floorplans of similar type, such as another floor of the same building.

Two support vector machine (SVM) models are used to separate the text components and graphic components from binary floorplan images. The objective function of the SVM model is defined to classify the differences between the textual and graphical components of the formal characteristics with the largest margin (Cortes and Vapnik, 1995). Therefore, in order to create the SVM model that classifies floorplan images into text components and graphic components, it is necessary to define characteristics of text components and graphical components according to the input data of the SVM model, and to learn the differences between text components and graphic components. Learning data is needed. Fort his purpose, the formal characteristics o the text and graphic components are defined by using a SIFT descriptor generated by a Scale-invariant feature transform (SIFT) algorithm. Then the text and graphic elements of the floorplan image are manually separated to learn the differences in the characteristics of each element.

First, the SVM model for constructing training data is the SVM-BoW model, which is used to classify connected components, which is a set of connected pixels of the same class in the binary image. As shown in Fig. 2, we construct training data to learn the SVM-BoW model by manually dividing it into the text connection component and the graphical connection component, as shown in Fig. 1.

OGCSBN_2018_v34n4_693_f0004.png 이미지

Fig. 1. (a) The text connected components and (b) graphic connected components.

OGCSBN_2018_v34n4_693_f0001.png 이미지

Fig. 2. Number of clusters and variance of the k-mean cluster.

The second SVM model is an SVM descriptor model that can be used to classify SIFT key points. In the case of a text component that overlaps with a graphic component, the text component cannot be separated through the connection component analysis, because it is separated into one connection component together with the graphic component. Therefore, the text components overlapped with the graphic components are classified by the SVM descriptor model to classify whether the SIFT key point is the text component or the graphic component, and the text component is separated by utilizing the morphological characteristics of the text in the floorplan image. The training data of the SVM descriptor model is generated by classifying the generated SIFT descriptor by applying the SIFT algorithm in the text image and graphic image generated above.

2) SVM-BoW Model Construction

First, to learn the SVM-BoW model, the connected component image data generated in 1) is converted into a low-dimensional vector that can be used as training data in the SVM-BoW model. The entire image cannot be described as one SIFT descriptor. However, if a high-dimensional image is transformed into a low-dimensional vector by utilizing SIFT descriptors of all the key points existing in the image, the vector can express the overall characteristics of the image. The low-dimensional vectorization of the connected component image is performed by collecting the SIFT descriptor of the connected component images in the training data, and assigning the SIFT descriptor.

To do this, we perform K-means clustering of SIFT descriptors, and find cluster centers. The cluster centroids found through K mean clustering are called visual words, and the set of visual words consisting of K cluster centroids is called a bag of words (Csurka et al., 2004; Lowe, 2004).

As the number of clusters increases, the overall variance decreases, which improves performance, but increases the computational cost. Therefore, it is important to determine the appropriate number of clusters. In the case of this study, as a result of clustering in 312,882 SIFT descriptors extracted from the learning data, Fig. 2 shows that clustering is efficient in 60 to70 clusters, so the number of clusters is determined as K = 64.

By constructing the BoW, it is possible to vectorize the image by assigning the SIFT descriptor of the connected component image to the BoW. A SIFT descriptor is assigned to the closest visual word, and a histogram of the visual word is generated. Normalizing the sum of the degrees of the class to 1, vectorization of the image is then performed.

If the connected component image of the training data is vectorized, the vectors of the same class exist in the near vector space, because they are generated, and the vectors of the other class exist in the far vector space. Based on these properties, there are planes that can divide vectors of different classes, and the plane with the largest margin is learned as the decision boundary of the SVM-BoW model.

We constructed the BoW with 64 visual words, and learned the decision plane using 11,180 connected component images. Table 2 shows the result of testing SVM-BoW model constructed with 3,260 connected component images. As a result, the SVM model using the RBF kernel shows the best performance, so the SVM model using the RBF kernel is used in the next process.

Table 2. Performance of the SVM-BoW model by kernel

OGCSBN_2018_v34n4_693_t0002.png 이미지

3) SVM-descriptor Model Construction

n order to learn the SVM-descriptor model, IFT algorithm is applied to the text image and graphic image generated in 1) to find the singularity, and the SIFT descriptor is divided into text image and graphic image. Since the SIFT descriptor is a vector of 128 dimensions, it learns the decision plane that can classify the class of SIFT descriptor in the 128 dimensional space, without needing the vectorization process in the process of 2). Through this process, a SVM-descriptor model is created that classifies the classes of key points.

The SVM-descriptor model was learned through 267,301 SIFT descriptors generated from the training data. Table 3 shows the results of testing the constructed model with 45,581 SIFT descriptors. Based on the results, SVM Descriptor model using RBF kernel will be used in the following process.

Table 3. Performance of the SVM-Descriptor model by kernel

OGCSBN_2018_v34n4_693_t0003.png 이미지

3. Method for separating text from floorplan images

In this course, the text components are separated from the floorplan image by using the existing text component separation method and the constructed classification model. The first text separation is performed by applying the Tombre method to the floorplan image. Through the first processing, the floorplan is separated into two images. These two images are applied to the two SVM models constructed in the precedent process, and text separation is performed.

1) Text component separation through the modified Tombre method

According to Fletcher (1988), when a connected component analysis is performed on a floorplan in which text components and graphic components are mixed, the connection component having a large size expresses the graphic component, and that having a small size connecting component expresses the text component. Therefore, it is possible to separate large graphic components in the floorplan through the limitation of the relative size of the connected components. In addition, since the width and height of the text component are morphologically relatively similar to the graphic component, it is possible to separate the graphic elements in the floorplan by limiting the width and height ratio.

In view of this, Tombre (2002) constructed the constraints based on the morphological characteristics of the textual elements, and separated the textual components from the floorplan images using the constraints. The constraints of the text-connected components constructed by Tombre (2002) are as follows.

– The width of the bounding box of the connected component is smaller than T1 = n × max(Amv, Aavg).

– The height and width of the bounding box of the connected component are both smaller than \(\sqrt{T_{1}}\).

– The ratio (\(\frac{\text { Height }}{\text { Width }}\)) of the height and width of the bounding box of the connected component is in the range \([\frac{1}{T_{2}}, T_2]\).

– The density of black pixels (\(\frac{\sum \text { Area of black pixcel }}{\text { Area of bounding box }}\)) in the bounding box of the connected component is smaller than T3

– The ratio (\(\frac{\text { Height }}{\text { Width }}\)) of the height and width of the best enclosing rectangle of the connection component is in the range \([\frac{1}{T_{4}}, T_4]\).

– The density (\(\frac{\sum \text { Area of black pixcel }}{\text { Area of bounding box }}\)) of black pixels in the optimal bounding box of the connected component is less than 0.5.

where, Aavg is the average of the width of the bounding box, Amt is the median of the most frequent interval when drawing the histogram of the width of the bounding box, T1 is the width threshold, T2 is the linear threshold of the bounding box, T3 is the density threshold, and T4 is the linear threshold of the optimal bounding box. The optimal bounding box means the bounding box with the minimum width of the bounding box of the connecting component (Fig. 3). N, T2, T3 and T4 are parameters to be set according to the floorplan. Tmbre (2002) used ). N = 1.5, T2 = 20, T3 = 0.5 and T4 = 2 as parameters.

OGCSBN_2018_v34n4_693_f0005.png 이미지

Fig. 3. (a) The bound box and (b) optimal bounding box.

The Tombre (2002) method is useful, because in the case of non-overlapping text components, it can extract the text-connected components completely. However, floorplan images containing Korean text cannot be applied directly to the method used in Tombre (2002). As shown in Fig. 4, in the case of the Korean character, there is a problem that when the connected component analysis is performed, one character is separated into two or more connected components, rather than one connected component. Therefore, floorplans containing Korean text cannot use the threshold calculated by Tombre.

OGCSBN_2018_v34n4_693_f0002.png 이미지

Fig. 4. Connection component of Korean characters

To solve this problem, it is necessary to perform clustering of the text-connected components in Korean characters, and extract the components of cluster units. First, after analyzing the connection components in the floorplan image, the threshold on the width of the connection component and the restrictions on the ratio of the height and width among the restriction conditions used in the Tombre method are used to limit the connected components that do not satisfy the restriction condition. Then, the mean value \(\bar{h} \) of the height of the remaining connected components is obtained, and clusters of connected components are made that have the distance between the center points of the connecting components that is less than or equal to 2.5 × \(\bar{h} \). Then, we apply the constraints used in Tombre (2002) to extract the text clusters.

Using the modified Tombre method proposed in this study, the floorplan image is divided into text image and graphic image, as shown in Fig. 5. The text image generated in this process includes graphic components that are morphologically similar to the text components, and the graphic image includes text components that overlap the graphic components. In the subsequent process, the SVM-BoW model is used to separate the graphic components of the text image, and the overlapping text components of the graphic image are separated using the SVM-Descriptor model.

OGCSBN_2018_v34n4_693_f0006.png 이미지

Fig. 5. (a) The text image and (b) graphic image separated by modified Tombre method.

2) Separation of text-connected components using the SVM-BoW model.

The SVM-BoW model is used to separate text components and morphologically similar graphic components from text images generated by the modified Tombre method. To do this, we first perform connected component analysis on the text image, and then separate the connected components into different images. In order to utilize the extracted connected component images as input data of the SVM-BoW model, image vectorization as in section 2 is performed. The SVM-BoW model is applied to classify the class of the connected component image. Images are separated into images classified as text, and images classified as graphics. As a result, the text image is separated as shown in Fig. 6.

OGCSBN_2018_v34n4_693_f0007.png 이미지

Fig. 6. (a) The text image and (b) graphic image separated by SVM-BoW model.

3) Separation of overlapping text components using the SVM-Descriptor model

The SVM-descriptor model is used to separate overlapping text components in the graphic image. First, the SIFT algorithm is applied to the graphic image to extract the key points and the SIFT descriptors, and the class of the SIFT descriptor is classified by applying it to the SVM-Descriptor model. Fig. 7 shows the SIFT key points in the area where the text and graphic elements overlap in the graphic image. In this case, red points mean key points classified as text components through the SVM-Descriptor model, and green points mean key points classified as graphic components.

OGCSBN_2018_v34n4_693_f0008.png 이미지

Fig. 7. (a) The text image and (b) graphic image separated by SVM-BoW model.

As shown in Fig. 8, the text components in the floorplan image exist in a concentrated form in one area, not in a single character. Likewise, key points classified as text key points from the SVM-BoW model are also concentrated. So we can estimate the location of the text elements by the location of the text key points. However, since the size of the text component at the location cannot be known, it is difficult to separate the text component from the image. To solve this problem, morphological information of the text obtained in the previous process can be used, because the size of the text component within a floorplan image is constant everywhere.

OGCSBN_2018_v34n4_693_f0009.png 이미지

Fig. 8. (a) The text image and (b) graphic image

First, we perform connected components analysis of the text image obtained in Section 3-2, and obtain the center point position and the average height(have) of the connected components. The center points of the collected connected components are collected and text key points obtained as in Section 3-3, and the points grouped with the distance of ≤ 2.5 × have between them. Finally, a boundary box including all of these groups is generated, and the images inside the boundary box are extracted from the original floorplan image. As a result, text separation in the floorplan image is completed, as shown in Fig. 8.

4. Experiments and results

1) Experimental data

The floorplan database consists of 2,334 floorplans of the indoor plans of Seoul and the 1,179 floor plans of the Seoul National University buildings. As a result, the training data of SVM-BoW model consisting of 9,031 text-connected component images, and 5,409 graphic-connected component images were generated from the selected 55 floor plans, and a total of 312,882 SIFT descriptors, including 143,455 text SIFT descriptors and 169,427 graphic SIFT descriptors, are generated as training data of the SVM-descriptor model. The experiment in this study was based on image analysis library Open CV 3.4 in Python 2.7 environment.

2) Text Separation Result in the floorplans

In order to evaluate the results of text segmentation of floorplan images in this study compared with Tombre’s method, which generally uses text segmentation, and Hoang’s method, which has high performance when text and graphic elements overlap, the text segmentation algorithm is applied to five images used in previous researches. These five images are widely used to prove the text separation performance because these images include the mixture of text and graphic elements such as floorplans, maps, and mechanical drawings. The index used to evaluate the text segmentation algorithm is a recall that indicates how many texts in the image are extracted in character units. As shown in Table 4, 317characters among 336characters in the image were separated, and showed better separation performance than the existing text separation study. A Wilcoxon signed rank test was performed for significance testing. As a result, this study showed better text separation performance at the 96 % confidence level (p-value =0.03125)than Tombre et al.(2002), and at 60 % (p-value= 0.3932) than Hoang et al.(2010).

Table 4. Comparison of the text component separation accuracy

OGCSBN_2018_v34n4_693_t0004.png 이미지

OGCSBN_2018_v34n4_693_f0003.png 이미지

Fig. 9. The result of application of text separation algorithm, five images are used in Tombre et al. (2002) and Haong et al. (2010) to prove performance of text separation.

5. Conclusion

In the study of floorplan images, research for extracting building objects in the images has mainly been performed, but semantic analysis of the images is also increasing with the development of image analysis algorithms and the improvement of calculation ability. In this situation, this study makes sense, because it allows the text to be automatically separated from the image for semantic analysis, and it solved the problem from precedent studies that could not be separated when the text and graphic elements completely overlapped. In order to solve this problem, two classification models using SIFT descriptors are used to separate text and morphologically similar graphic elements, and to separate text elements that completely overlap graphic elements. This study has advantages compared with other studies in that it can separate textual elements from completely overlapping images and extracts Korean text without loss. As a result, the text components were separated from the five images at a recall rate of 94.3 %, which is higher than the precedent study.

The limitations of this study are that the algorithm of this study cannot be applied to the connected components without SIFT key points. The numbers ‘1’ and ‘7’, and the alphabetic ‘i’ and ‘l’ may not have key points on the connected component. In addition, when separating the text component, the graphic component around the text is also extracted, so that the text component extracted through this method is not recognized by a character recognition program, such as OCR. In order to solve these problems, a pixel-based deep-running algorithm, such as CNN, can be used, which expresses the characteristics of an image in pixels, so that the image can be analyzed even in a general region, as well as a key point, or a SURF or VLAD. It is expected that the performance of text component separation in drawing images will be improved.

Acknowledgment

This research was supported by a grant (18NSIPB135746-02) from National Spatial Information Research Program (NSIP) funded by Ministry of Land, Infrastructure and Transport of Korean government.

References

  1. Ahmed, S., M. Liwicki, M. Weber, and A. Dengel, 2011.Text/graphics segmentation in architectural floor plans, Proc. of 2011 Document Analysis and Recognition (ICDAR) International Conference on IEEE, Beijing, China, Sep.18-21, pp.734-738.
  2. Cortes, C. and V. Vapnik, 1995. Support-vector networks, Machine learning, 20(3): 273-297. https://doi.org/10.1007/BF00994018
  3. Csurka, G., C. Dance, L. Fan, J. Willamowski, and C. Bray, 2004. Visual categorization with bags of keypoints, Proc. of 2004 Workshop on statistical learning in computer vision, 8th European Conference on Computer Vision, Prague, Czech Republic, May 10-16, vol. 1, pp. 1-22.
  4. de las Heras, L. P., S.Ahmed, M. Liwicki, E. Valveny, and G. Sanchez, 2014. Statistical segmentation and structural recognition for floor plan interpretation, International Journal on Document Analysis and Recognition, 17(3): 221-237. https://doi.org/10.1007/s10032-013-0215-2
  5. Fletcher, L.A. and R. Kasturi, 1988. Arobust algorithm for text string separation from mixed text/ graphics images, IEEE transactions on pattern analysis and machine intelligence, 10(6): 910-918. https://doi.org/10.1109/34.9112
  6. Hoang, T. V. and S. Tabbone, 2010. Text extraction from graphical document images using sparse representation, Proc. of 2010 Document Analysis Systems (DAS), 9th IAPR international workshop on document analysis systems, Cambridge, USA, Jun. 9-11, pp. 143-150.
  7. Lam, O., F. Dayoub, R. Schulz, and P. Corke, 2015. Automated topometric graph generation from floor plan analysis, Proc. of 2015 Australasian Conference on Robotics and Automation, ACRA, Canberra, Australia, Dec. 2-4, pp. 1-8.
  8. Le, D. X., G. R. Thoma, and H. Wechsler, 1995. Classification of binary document images into textual or nontextual data blocks using neural network models, Machine Vision and Applications, 8(5): 289-304. https://doi.org/10.1007/BF01211490
  9. Lowe, D. G., 2004. Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 60(2): 91-110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  10. Roy, P. P., U. Pal, and J. Llados, 2009. Touching text character localization in graphical documents using SIFT, Proc.of 2009 International Workshop on Graphics Recognition, Springer, LaRochelle, France, Jul. 22-23, pp. 199-211.
  11. Tombre, K., S. Tabbone, L. Pélissier, B. Lamiroy, and P. Dosch, 2002. Text/graphics separation revisited, Proc. of 2002 International Workshopon Document Analysis Systems, Springer, Princeton, NJ, USA, Aug. 19-21, pp. 200-211.