1. Introduction
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order [34]. We can see that the quality of the layout analysis task outcome can determine the quality and the feasibility of the next steps, and hence of the whole document processing activity.
In the field of document layout analysis, a number of approaches were proposed but the results them is still limited. Top-down method [7-10] look for global information on the entire document page and split it into blocks and then split the blocks into text lines and the text lines into words. These methods use different ways to separate the input document into distinguish regions and use many heuristic condition to identify non-text elements. This method proves the efficient when the document have Manhattan layout (Manhattan layout has region boundaries consisting of vertical and horizontal line segments). Bottom-up method [1-3], [5], [11], [13] start with local information and first determine the words, then merge the words into text lines, and merge the text lines into blocks or paragraphs. This method is more efficiency on non-Manhattan layout but it requires longer time in computation and space. Besides, some threshold of these method are complex and sometimes inaccurate. Hybrid method [19], [22] uses the combination of two methods above. However most of these methods are not really paying attention to the classification of text and non-text elements before grouping them. This leads to the result obtained without high precision.
As we can see, one of the main reasons of this limitation is due to these methods do not pay much attention to the classification of text and non-text components in the document, even thought the classification of text and non-text elements in the document layout plays a very important role. This lead to the wrong segmentation of text and miss classification of non-text elements.
Besides, the distribution of text and non-text elements in documents is very random and do not follow the rules, especially in non-Manhattan documents (non-Manhattan layout may include arbitrarily shaped document components and be in any arrangement). Therefore, the classification of text and non-text in the original document by bottom-up methods, or in the region obtained by top-down methods, even the homogeneous region also becomes complex and often provides unexpected results. These methods typically separate the original document into many different regions. Then use many filters to classify each region [5], [18], [20] (only one layer of homogeneous region is used). In addition to creating many filters, these methods only effective when the region is not too complicated. The winner of ICDAR 2009 page segmentation competition [21], the Fraunhofer Newspaper Segmenter also uses this approach, so that the precision of non-text identification is not satisfied.
On the other hand, there are some methods which uses the wavelet transform and multi-scale resolution [15], [16] to identify non-text elements. This method yielded positive results and the approach is very general; however, with this method, by using the multi scale resolution, the computing time is quite long. In addition, when using wavelet methods difficulties rise in the case of documents with less information, or with structured overview together. Another difficulty when using wavelet transform is that it will create a lot of noise when the document size is changed. The most common mistake is the confusion of the classification of cases images that has structure similar to text, or small images with big text font size.
In [27] Chen and Wu proposed a multi-plane approach for text segmentation. This method prove an efficient in complex background document images. However, this method can not control the number of planes and uses many threshold so the computation time in some cases is quite long. Besides, this method only focus on text extraction without paying attention to the identification of non-text elements.
One of the important features of the document image is the large of information. In other words, there are many components inside the document. It will create favorable conditions for the algorithms that use statistical approach. In this paper, we propose an effective method to separate the input document to two distinguish binary documents, text document and non-text document. Our method use recursive filter on multi-layer of homogeneity to carry out the classification. Based on statistical approach, the proposed method provides the high efficiency of text and non-text classification. The motivation is as follows (Fig. 1).
Fig. 1.Flowchart of Proposed System
Firstly, the colored input document is binarized by the combination of Sauvola technique and Intergral image. Then, the skew estimation [28], [31] is performed. This is an optional step for the skewn document. Secondly, based on the binary document f, we extract the connected component and get connected component properties. A heuristic filter is applied to identify some non-text elements, all these non-text elements are removed from the binary document by the label matrix to get the new image . Then, based on this image, the recursive filter is performed to identify all non-text components. This process is the combination of whitespace analysis with multi-layer homogeneous region. Non-text elements are eliminated layer-by-layer. All remain elements after recursive filter is the text elements. It will be reshaped by their coordinates to get the text-document. Finally, the non-text document can be obtained by the subtraction of original binary image f and the text document.
An overview of proposed method and its performance is given next. In Section 2, the method is described in detail. Section 3 presents the experimental results and the evaluation using ICDAR2009 page segmentation dataset [21] and six methods in this competition. Finally, the paper is concluded in Sections 4.
2. Proposed Method
The proposed method for text and non-text classification is described as follows: Given the colored image,
2.1. Image Binarization
Like other image processing methods, our method also performs on the binary image, so we first need to convert the input colored document to binary document. Image binarization is the process that converts an input grayscale image g(x,y) into a bi-level representation (if the input document is colored, its RGB components are combined to give a grayscale image). Generally, there are two approaches to binarize the grayscale image: the algorithms based on global threshold and algorithms based on local threshold. In our method, we refer to Sauvola technique [14] to obtain the local threshold at any pixel (x,y).
where m(x,y), s(x,y) are the local mean and standard deviation values in a W×W window centered on the pixel (x,y) respectively. R is the maximum value of the standard deviation and k is a parameter which takes positive values in the range [0.2,0.5].
In the field of document binarization, algorithms use local threshold often give better result than algorithms use global threshold. Nevertheless, they require a longer computation time.
For example, the computational complexity for an image in size a × b by Sauvola algorithm is O(W2ab). In order to reduce computation time of Sauvola algorithm; we use integral images for computing local means and variances. The integral image or summed area table, was first introduced to us in 1984 [24] but was not properly introduced to the world of computer vision till 2001 by Viola and Jones [6]. The value of integral image at location (x,y) is the sum of the pixels above and to the left of it. The value of integral image I at position (x,y) can be written as:
The integral image of any grayscale image can be efficiently computed in a single pass [6]. Then the local mean m(x,y) and local standard deviation s(x,y) for any window size W can be computed simply by formula [23]:
where
The computational complexity of binarize image now is only O(ab) and the computation time does not depend on the window size. In our algorithm, we choose window size W=1/2×min(a,b). The parameter k controls the value of the threshold in the local window such that the higher the value of k, the lower threshold from the local mean m(x, y). Experimented with different values and found that k=0.34 gives the best results for the small window size [23], but the difference is small. In our system the window size is large so the efficient of parameter k is very small. In general, the algorithm is not very sensitive to the value of k. The experimental result shows that our system has the similar results with k ∈ [0.2,0.5] and the best result with k ∈ [0.2,0.34]. We assign the pixels that belong to the foreground is a value of 1 and background pixels a value of 0. This means the binary image f (Fig. 6(a), 6(d) and Fig. 7(a), 7(d)) is calculated by:
2.2. Connected component analysis and whitespace extraction
2.2.1. Connected component analysis
Connected components labelling is the process of extracting and labelling connected components from a binary image. All pixels that are connected and have the same value are extracted and assigned to a separate component. There are many algorithms that use this method, such as [5], [17-20]. Let L is the label matrix, CCs be all connected components and CCi is the ith connected component of f. Every CCi is characterized by the following set of features:
Fig. 2.Example of connected conponent analysis and whitespace extraction.
2.2.2. Whitespace extraction
For every connected component CCi in the region, we find the right nearest neighbor and left nearest neighbor. Then based on these neighbors we extract the length of whitespace (distance) between them. In our system, we use the technique was proposed in [19] to get the linear computation time. The CCj, j ≠ i is called the right nearnest neighbor of CCi if Volap(CCi) ≠ ∅, CCj ∈ Volap(CCi), CCj does not locate inside CCi, Xlj > Xri and
where, Xlj−Xri is the whitespace (distance) between CCi and CCj. Besides, if CCj is the right nearest neighbor of CCi, then CCi is the left nearest neighbor of CCj.
2.3. Heuristic filter
In this filter, we find the elements that cannot be the texts without attention to its relative position in the document image. Clearly, these conditions must be precise and very stringent, because they have a strong influence in whether we are looking at a separate region or not. This filter not only reduces the computational time of whole process but also increases the accuracy of the proposed system.
Let CCs be all the connected components of the input binary document f, CCi is the ith connected component and B(CCi) is the bounding box of it. In order to improve our performance, the CCi is considered as a non-text component if it satisfies one of following conditions:
These thresholds in the above conditions are carefully calculated and checked many times with many a variety of document types. It also is considered carefully in the case of binary images with noise. Call CCs′ is the set of non-text elements that were found by conditions above, ∀CCi∈CCs′
Then, CCs = CCs╲CCs′
2.4. Recursive filter
This is the most important process of our method. After heuristic filter step, some non-text elements are identified and eliminated. However, there still exists many non-text elements in the document and these components are often not too different than text elements. In this section, based on the statistical method we proposed an efficient filter to identify non-text elements in document, see Fig. 1. Let is the binary document and CCs be all connected components obtained after heuristic filter. This is an iterative method with three main steps:
Firstly, we extract the homogeneous regions of . To do this, vertical homogeneous regions are extracted by vertical projection and then each vertical region are segmented horizontal again to get the homogeneous regions HRk (k = , m is the number of region).
Seconly, the whitespace analysis process is performed to identify the non-text components and its label in all homogeneous regions HRk. Call CCs′ ⊂ CCs is the set of them. Once again, these components are removed by label matrix to deduce the new text binary document .
Repeat two steps above until we cannot find any non-text component or CCs′ = ∅. At this time, all regions HRk are text homogeneous region HRk∗ and will be reshaped by their coordinate to get the text document and the non-text document is the subtraction of f and .
2.4.1. Homogeneous regions extraction
Document is usually divided into various regions, including the text regions (or paragraphs), image regions, lines, etc. Moreover, in the paragraph, the text horizontally or vertically is often homogeneous and white spaces between them are almost the same. Based on these properties, we will segment the input document to many different regions (we call homogeneous regions). In this section, we present a method to segment the input document into many homogeneous regions. There are two kinds of homogeneity, horizontal homogeneity and vertical homogeneity. The different between them is the direction in which we get the projection.
Suppose is the considering binary image with a × b as the size of it. In order to get the horizontal homogeneous region, we perform the following steps:
Firstly, we use Run Length Encoding (abbreviate to RLE) to calculate the run length of all elements in LP. RLE is a data compressing method in which a sequence of the same data can be presented by a single data value and count. It is especially useful in the case of binary image data with only two values: zero and one (in our case, we use −1 instead of 1). For example, the result of RLE of sequence {−1,−1,−1,0,0,−1,−1,0,0,0,−1,−1,−1,−1} is RLE={−1,3,0,2,−1,2,0,3,−1,4}. Let bi and wi are the large of ith black line and white line. Let
Suppose μb, μW is the mean and nB, nW is the number of B and W respectively. The variance of black line and white line is as follow,
If the value of VB and VW is low, it means that the region is homogeneous. Conversely, if the variance of VB or VW is high, our region is heterogeneous and it should be segmented. In our method within skew consideration, we choose the threshold value for VB and VW are 1.3. This mean, if VW>1.3 or VB>1.3 the region should be segmented. Actually, this threshold is still depends on the type of the language (Korean or English) because the several of length and height in each letter of each language is also different. However, the impact of this threshold is not much, the experimental results show that we can fix the threshold for both language.
When the region is heterogeneous, the region should be segmented. There are two cases that need to be further divided. First, in the region under consideration, there exists a white line with its width larger than other white line, Fig. 3(a). Second, there exists a black line with its width is larger than another, Fig. 3(b).
Fig. 3.Example of: (a) white segment- white line 4, (b) black segment- white line 1 and 2
In order to get the position to split, we need to find the most distinctive space of black or white. The splitting position is described as follow:
Repeat the steps above until the entire region obtained becomes homogeneous. In the same way, we can perform all steps on vertical and get the homogeneous region on this direction.
2.4.2. Whitespace analysis
As mentioned above, in the recursive filter we focus more carefully on the structure of the text and examine the relationship between the CCi. The structure of the text is usually in rows or columns, so we can use two important properties in statistics, the median and variance to classify them. The dispersion of the text in a homogeneous region is relatively low. Therefore, by using two important properties in statistics, me dian and variance, we can identify the nontext elements in each homogeneous horizontal region. Suppose HRk is the region being considered, CCk ⊂ CCs is the set of connected components of HRk. Put
where CCsize(CCj), Hj, Wj is the area, height and width of CCj respectively. In order to classify the text and non-text elements in a horizontal homogeneous region we can follow these steps.
The CCi ∈ CCk is the non-text candidate if it satisfies
And one of two following conditions:
where tj, j = 1,2,3 is the threshold fo the difference betwwen the size, height, width of CCi and the median of them. The value of tj can be chosen as,
In each homogeneous region, the different between components is always small. In the other word, the variance of them is usually small if the region includes only text elements. The equation (28) is calculated tj by considering the relationship of median and mean. Meanwhile, in statistic if the distribution has finite variance, then the distance between the median and the mean is bounded by one standard deviation [29].
Suppose CCi is the non-text candidate that was found in step 1. Using the whitespace rectangle extraction (13), we find the left nearest neighbors LNN(CCi), right nearest neighbors RNN(CCi), the distance between CCi and its left nearest neighbor li = LNWS(CCi), its right nearest neighbor ri = RNWS(CCi).
The CCi will be classified as non-text element if
Or
where mediamWS, meanWS, maxWS is the median, mean, max of whitespace in the considering region. numLN(CCi), numRN(CCi) is the total number of left nearest neighbor and right nearest neighbor on the each row of CCi. Experimentation showed that the use of the median generated desirable results, especially in the case that there are many connected components in considering region. In case our region has little information, we can further analyze the min-max and the variance of elements in the region.
2.4.3. Non-text elements removal
Call CCs′⊂CCs,CCs′ ≠ ∅ is the set of non-text elements in all regions HRk, k = that were found by whitespace analysis stage. Apply (14), we remove non-text elements and get new text document. ∀CCi ∈ CCs′
and CCs = CCs╲CCs′.
2.5. Reshape regions and post processing
2.5.1. Reshape regions
When the whitespace analysis process cannot find any non-text component in all homogeneous regions or CCs′ = ∅. This mean HRk, k = now contains only text elements and we call these regions are the text homogeneous regions (HRk∗). Reshape all regions by their coordinates to get the text document
Then, the non-text document can be obtained by the logical “and” of the original binary document with text document.
2.5.2. Post processing
Binarize images always contain a lot of noise especially when the original document has many figures and these figures have a big size. Besides, binary image often has many missing components or unexpected components (all components are stick together), especially for the low resolution document or nonstandard document.
To reduce the noise, firstly, we apply a morphological with a small kernel for non-text document . Secondly, all holes inside this image will be filled to remove the noise. On the other hand, we extract the bounding box of all connected components B(CCi) in text document. Let CCstext be the set of all connected components CCi and is the bounding box image of text document,
Let CCsntext be the set contains all connected component CCj of non-text document. ∀(x,y) ∈ a × b, if
Then, the final output of non-text document and the final output of text document can be calculated by
3. EXPERIMENTAL RESULT
3.1. System Environment and Database
Our system was implemented in MATLAB R2014a on our workspace with a system configured with an Intel® Core™ i5-3470, 4G RAM, Windows 7 – 64 bits. Our database is generated from 114 document images with various layouts and document languages, as in Fig. 4. In there, 55 English documents of ICDAR2009 page segmentation competition dataset [21] and 59 Korean documents of the dataset of Diotek company [35].
Fig. 4.Example of our database: (a,b,c,d) Colored document image; (e,f,g,h) ground truth of each region type – black: text, gray: non-text; (i,j,k,l) text elements extract from text region; (m,n,o,p) non-text elements extract from non-text region
3.2. Method for Performance Evaluation
The performance of our method is evaluated based on two measures: the pixel percentage of foreground from the ground-truth to be reserved (Recall), and the pixel percentage of remaining pixel after the process to belong to the foreground (Precision). The balance F-score (F-measure) shows the harmonic mean of these two measures,
3.3. Experimental Results and Evaluation Profile
Our method has been experimented on many different data sets and the results obtained are very encouraging. It is not dependent on the type of language and still gives good results when the input document has skew less than 5 degrees; see Fig. 6, Fig. 7. We can also apply a skew detection algorithm [28] after binarization step to estimate the skew of document and return the zeros horizontal angle of document before our algorithm is performed.
Fig. 5F-measure comparision of our method with four algorithm and two state-of-art methods have the highest performance.
Fig. 6.Example of proposed method with Korean document: (a,d) Input binary document, (b,e) Text document, (c,f) Non-text document.
Fig. 7.Example of proposed method with English document: (a,d) Input binary document, (b,e) Text document, (c,f) Non-text document.
For the evaluation, firstly, we use the text documents and non-text documents (Fig. 4) datasets that are extracted from the ground truth of our database. The success rates of our method with two these datasets are shown in Table 1, Table 2. Secondly, Table 3 and Table 4 show the success rate of text region and non-text region (Fig. 4) that are given by the ground truth of our database. Before this evaluation process, our text documents and non-text documents result are smoothed by mathematical morphology. Firstly, in text document, we extract the homogeneous text regions (Section 2.4.1). Then, the kernel of morphology in each region is calculated by the mean of the height, the width of all elements in this area. On the other hand, in non-text document, all elements are filled and smoothed by a small kernel (this kernel is depended on the size of each non-text element). Table 5 shows the success rates of our method with overall foreground regions in the document.
Table 1.The success rates of text detection
Table 2.The success rates of non-text detection
Table 3.The success rates of text region
Table 4.The success rates of non-text region
Table 5.The success rates of overall region
According to the experimental results, the success rate of Korean document does not give results as expected. This cause by the Korean dataset contains many noise documents so the outcome of binary image is really bad in some cases. Besides, the ground-truth of this dataset is always try to return the rectangular shape (segmentation process) for each region, as in Fig. 4(g) whereas our algorithm only provide an efficient method for the classification of text and non-text. This also reduces the success rate when we evaluate our result especially with text evaluation.
The comparison of document classification is always complicated because they depend heavily on dataset and ground-truth. There are many different dataset and evaluation process. In our paper, we use dataset of ICDAR2009 page segmentation competition [21] for evaluation because this dataset had been published and very popular in our field. Even though this dataset is published in 2009, but it stills the basic for evaluating document analysis algorithms. Due to the complexity of the layout of the document in this dataset, not many algorithms are evaluated on this dataset. Up to now, the results of competitor's algorithms on this dataset are known as the best performances.
In 2013, Chen et al. [18] also evaluated his method on this dataset but his result stills low and it cannot compare to the best performance of ICDAR2009 page segmentation competition.
We chose four algorithms and two method-of-arts get the highest results in [21] to compare with our algorithm. The F-measure evaluation profile is given in Fig. 5. Note that, all methods in ICDAR2009 page segmentation is the full system of document layout analysis.
After performing many evaluation process, we found that our method is very promising. The use of connected components analysis combine with multilevel homogeneity structure is an effective method to classify text and non-text elements. The success rate of non-text detection and non-text region is the highest. This demonstrates that our method can identify and classify most of the non-text elements in the document. Meanwhile, although we do not pay much attention to the page segmentation (just use a simple mathematical morphology), but the result of our method for text regions is very encourage. This will also create favorable conditions for us when we perform document layout analysis.
4. Conclusion
In this paper, we proposed an efficient text and non-text classification method based on the combination of whitespace analysis and multi-layer homogeneous regions. In our approach, we first label the connected component and get their properties. Then, a heuristic filter is performed to identify strictly non-text elements in the binary document. The third stage, all elements in each homogeneous region are classified carefully in the the recursive filter. Not only one, but multi-layer of homogeneous regions are used to identify the non-text component. Therefore, this filter demonstrate the effectiveness in classifying text and non-text components. The final result is obtained after that, all text homogeneous regions are reshaped by their coordinates to get the text document. A simple post processing step which uses mathematical morphology will help us to remove noise and increase the performance of proposed method.
Our algorithm is not too complicated and easier to improve. Besides, our algorithm does not depend on the size of document, this means it can run on any resolution of image. Therefore, with the big size document image, we can reduce the resolution before implement our algorithm to improve the time consuming. The experimental results show that our algorithm get a higher precision with the image has the resolution greater than 1 Megapixel. Like most of document layout algorithm, our method is sensitive with the skew document because the top-down approach is used in our system. However, skew document is the other interest field. For example, in further process of document layout, the reading system OCR is also requires a non-skew document. Besides, all published and well-known dataset of document image is non-skew, and our goal is focus on the complex of document layout.
Experimental results on ICDAR2009 and other databases gave high performance and very encouraging. The proposed method not only has good results on English dataset but also on the Korean dataset (many other algorithms cannot do that). In future work, we are going to implement the page segmentation. Text document will be separate into corresponding text regions, all elements in non-text document will be classified in more detail (i.e. figure regions, table regions, separator regions, etc.).