Real-time Text Tracking in Natural Scenes

Carlos Merino Gracia         Majid Mirmehdi


We present a system that automatically detects, recognises and tracks text in natural scenes in real-time. The focus of our method is on large text found in outdoor environments, such as shop signs, street names, billboards and so on. Built on top of our previously developed techniques for scene text detection and orientation estimation, the main contribution of this work is to present a complete end-to-end scene text reading system based around text tracking. We propose to use a set of Unscented Kalman Filters (UKF) to maintain each text region's identity and to continuously track the homography transformation of the text into a fronto-parallel view, thereby being resilient to erratic camera motion and wide baseline changes in orientation. The system is designed for continuous, unsupervised operation in a handheld or wearable system over long periods of time. It is completely automatic and features quick failure recovery and interactive text reading. It is also highly parallelised to maximize usage of available processing power and achieve real-time operation. We demonstrate the performance of the system on sequences recorded in outdoor scenarios.


This work is part of our project to develop a text reading system for blind people.

The TextTrack dataset

These are the full video sequences of the samples in the paper that form the TextTrack dataset. The links to download the ground-truth data for the sequences are next to each video. There is a brief explanation about the ground-truth data format below.

Sequence HOSPITAL (Figure 7)





Sequence MERCHANT (Figure 8)



Sequence QUEEN (Figure 9)



The TextTrack ground-truth format

For each sequence there are two files available: the raw video file and the regions file. The format of the regions file is very simple: it is a text file, where each line is a semicolon separated list of fields:


where frame is the frame number; id is the region identity (that is maintained from frame to frame); (x0,y0)—(x3,y3) are the quadrilateral corner coordinates, defined clockwise starting on the top left corner; and text is the ground-truth text.

Sequences from the qualitative study

Those are the sequences used in the qualitative study (Figure 10).

Sequence CLIFTON (Figure 10a) 


Sequence HANNOVER (Figure 10b) 

Sequence WOLFGANG (Figure 10c)


Sequence BYRON PLACE (Figure 10d)

Sequence UOB (Figure 10e)