Racing Bib Number Recognition Method using Deep Learning

− Mass running event has gained popularity ever since recreational running becomes more common as they often held annually by various organizers. As image documentation took a huge part to showcase the event, many thousands of images were generated during the event. Along with thousands of images that were generated, the participant is unlikely to found an image of themselves. To solve this problem, image annotation could be performed to address image with specific tags such as participant attribute like racing bib number (RBN). Manually annotate thousands of images would result in inefficiency of time and hard-labor. As a work to tackle this problem, this paper proposed an automatic image annotation system using the YOLOv3 algorithm based RBN recognition method. The experiment result shows 83.0% precision, 81.5% recall, and 82.2% F1 score as a result of our proposed method on running event dataset. Therefore, this implemented method will promote efficiency to solve the image annotation problem because it doesn't require manual annotation over thousand of running event images.


INTRODUCTION
Running originally only practiced by competitive athletes through extracurricular school and university programs. Recreational running becomes common since the late 1960s because more people enjoy running as a leisure-time pursuit and could be characterized as a mass movement [1]. From its root in the US, recreational running has been widespread across the continent with numbers of growth in popularity as participant counts increased from time to time. This popularity boost encourages the organizer of running events to attract more people to participate with various benefits such as a medallion, shirt, and documentation as memorabilia for participants to take home. Documentation take a big part to showcase the event after it was done. With a huge amount of documentation generated from one event, the participant is unlikely to find a picture of themselves as it scattered along with thousand of images.
To address this problem, we could annotate the image with participant attributes such as RBN. RBN is a unique number composed of multiple numerical digits used as participant identity. RBN then could be used as a tag to allow the participant to find their related images with ease. Automatic image annotation could be used as a way to eliminate the inefficiency of labor and time consumption caused by manual annotation on an enormous amount of image data. The idea to recognize numerical digit within RBN and to use them as an annotation for running event images could be used as the problem base. With the rapid growth of research in machine learning topic, this problem could be solved using a state-of-the-art method.
A lot of research has been done using multi-digit recognition for a variety of similar problems. Liu et al. designed an R-CNN based framework for an all-in-one solution for person detection, body keypoints prediction and jersey number recognition to identify players within sport match videos [2]. Goodfellow et al. used CNN based model to recognize the multi-digit number from street numbers using street view house numbers (SVHN) dataset [3]. To solve the jersey number recognition problem, Li et al. used a CNN model along with Spatial Transformer Network (STN) to further boost the CNN model performance [4].
Because our problem is mainly about RBN recognition, there are numerous RBN recognition research topics also has been done with various method. Ben-Ami et al. being the earliest to do RBN recognition research, used face detection to approximate RBN position, and combined stroke width transform (SWT) and optical character recognition (OCR) to identify numerical digit within the RBN position [5]. One of its limitations is when the participant's face is not detected in the image due to the photo's inconsistent angle hence the RBN recognition cannot be done. Inspired by Ami et al. following work, Roy et al. proposed a method using a multi-modal technique that combines face, skin and text detection to extract text candidate region as participant generally displays skin and then used it as a cue to identify a person body which gives the general position about RBN whereabout within an image [6]. Extracted text candidate region was applied with a text detection technique which utilizes a combination of wavelet and color, and k-mean clustering to detect text line. The detected text line then was applied with the binarization technique to get the binarized image result. The final binarized result then processed into the Tesseract OCR engine to recognize RBN. Shivakumara et. al, also proposed a multi-modal technique that combines torso and text detection for RBN detection and eliminates the need for face detection as many marathon image doest work well at showing face features [7]. Torso detection technique that was used is a combination between a linear support vector machine (SVM) classifier to extract histogram of oriented gradients (HOG) and an appearance model based on pictorial structural model (PSM) which used the result of processed SVM result and then output the needed torso region. The region output from torso detection then applied with HOG based text detection method to extract candidate text region height and width information using SVM based on Fourier, Pseudo-Zernike moments, Polar descriptor, and THOG. Extracted text region then used as an input for scene text segmentation based on inverse rendering method that was used for binarizing the RBN which preserve character shapes to improve recognition result by the tesseract OCR engine. Apap et al. proposed a two stages method for RBN detection and recognition based on a deep learning approach [8]. The first stage of the model is the detection stage used an image segmentation based on a convolutional neural network to localize RBN in complex marathon images and extract a segmentation map as the result. The second stage is the recognition stage, using a convolutional recurrent neural network (CRNN) composed by a convolutional layer for feature extraction, 2 bidirectional gated recurrent units (GRUs) as a recurrent layer, and transcription module which used bi-directional GRUs output to produce number sequence output [9]. Apap et al. proposed a deep learning model that eliminates the need for torso and face detection but still achieves a good F-score in complex images. Wong et al. proposed a deep learning cascade model based on a deep learning model that implements the YOLOv3 algorithm and an RCNN model to solve the RBN recognition problem as work to increase efficiency in sorting marathon images [10]. In Wong et al. proposed model, the YOLOv3 algorithm was used to detect runner, racing bib, and number within images to produce required input for the CRNN model. The result YOLOv3 then used by CRNN as a recognition process to generate a label sequence containing RBN. The proposed network predicition performance were evaluated using YOLOv3 algorithm output. The evaluation was done by calculating the correct class prediction and mean average precision (mAP) to detect how well predicted bounding box produced by object detection model compared to the ground truth bounding box.
Our proposed method used a state-of-the-art object detection model based on the YOLOv3 algorithm to solve the RBN recognition problem. You only look once (YOLO) is a unified object detection model proposed by Redmon et al. which used a single convolutional network to predict multiple bounding boxes and class probabilities of boxes at the same time [11]. At test time, YOLO inference time is fast because the model is not composed of a complex pipeline unlike the other method such as R-CNN which used region proposal method [12]. However, the original YOLO model is still lacking in accuracy to localize small objects. But over the last years, the author of YOLO has made an incremental improvement for the YOLO model called YOLO 9000 (YOLOv2) and YOLOv3 [13] [14]. This incremental improvement gives YOLO mAP improvement over smaller-size images. The use of their latest network architecture also contributed to improvement over YOLO mAP. Darknet-53 is a network architecture used for feature extraction that has a deeper network compared to Darknet-19 used on YOLOv2 and was used for YOLOv3 as one of many improvements to the original YOLO model. Darknet-53 has 53 convolutional layers with the addition of shortcut connection between layers and performs much more powerful than Darknet-19 but still efficiently compared to ResNet-101 or ResNet-152 on ImageNet dataset. In this work, we tried to eliminate the use of CRNN and OCR as a number recognition technique in RBN recognition as previous works often use. This is done without sacrificing the predicition accuracy by using YOLOv3 state-of-the-art model with an addition of algorithm to produce a numerical digit sequence as RBN label for image annotation. In this work, we proposed a model optimized for racing bib design and font that only used in our dataset. Different design and font used in other racing bib design are not considered in this work.

RESEARCH METHODOLOGY
In this work, the research method consisted of several steps as shown in Figure 1. Our proposed RBN recognition method used two YOLOv3 models that detect racing bib and numerical digit separately. The output from each model then combined in the RBN recognition step to produce a complete RBN label for image annotation.

Figure 2. A Subset of Our Dataset
As previously mentioned in the introduction section, the dataset in this research consisted of images from a marathon event. We collect 2194 raw images taken from various angles with two types of width and height image resolution at 3068 x 2048 and 1367 x 2048 in jpg format from an event called BNI ITB Ultra Marathon 2019. We choose this event as it provides us with a sufficient amount of image data to train our YOLOv3 model. The image gathered for our dataset has at least one visible racing bib that mostly not obstructed by any other obstruction. This was done to provide more data for racing bib and numerical digit detection model training. A subset of the image we collected is shown in Figure 2.

Dataset annotation
Because YOLOv3 training input needs a ground-truth bounding box coordinate and class label for each object in the dataset, we manually annotate each image in our dataset using a python based graphical image annotation tool called labelImg that is specifically used to annotate bounding box and class for image dataset used in YOLO model. We split the dataset into two types of annotation, for racing bib detection and numerical digit detection. We define the name of the class object in a text file named "class.txt" in each dataset directory. For racing bib detection, we annotate each image with class and bounding box over a fully identified and obstructed racing bib with 'valid' and 'invalid' label respectively and could be turned into a set of class which is Cbib={'valid', 'invalid'}. This could be used to train the model to avoid obstructed racing bib getting mixed in the detection process. For numerical digit detection, we annotate each image with class and bounding box over the identified numerical digit within the RBN with a numerical digit from 0 to 9 and could be turned into a set of class which is Cnumdigit={'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}. After the annotation process was done, each image will have a list containing ground truth bounding box annotation of each object within the image in a text file with the name   (1) and (2) from the YOLO author: For the YOLOv3 model to make a prediction, we need to pre-defined several anchor boxes size as dimension cluster needed for the model to generate prediction at the network parameter. Anchor boxes can be generated using k-means clustering from object bounding box size within our image dataset. Before each model training, we have to initialize the initial weight of our network model. This was done by utilizing pre-trained weight for Darknet-53 network architecture. By using already a pre-trained weight as a starting point, we eliminated the work to get the optimum weight initialization method. As mentioned before, Darknet-53 has 53 convolutional layers with the addition of a shortcut connection at the residual layer. The network used leaky rel-u and linear activation unit for convolutional and residual layer respectively, the function for leaky rel-u activation unit shown in equation (3) where α is a small constant.
The model will perform a backpropagation using stochastic gradient descent at each convolutional layer to computes the gradient needed to adjust the weight of our network model. This was done using binary cross-entropy loss function calculated at each iteration. To further generalize our model to a more variety of data, the YOLOv3 network model will use a random network size at every 10 th iteration. Bounding box and class prediction in the YOLOv3 network model were made by using logistic regression at the end of a fully connected network.

Training Evaluation
Training evaluation is key to measure how well our model performs towards the test data. To evaluate our network model training performance, we used an object detection metric called mAP. the mAP is the mean value of average precision (AP) for each object class in our model. AP is an object detector evaluation metric used at the PASCAL VOC Challenge [15]. AP is a way to summarize the shape of the precision/recall curve and calculated using the mean of interpolated precision over 11 recall levels [0,0.1,...,1]. Precision is the ratio between the true object prediction and the total number of object prediction by our model. Recall is the ratio between the true object prediction and total number object in the ground truth label. Precision (p), recall (r), and AP calculation can be shown in equation (4), (5), and (6):

Component
Description <class-id> A representation of predicted bounding box class order from "class.txt" in integer form <x-center> The x-center coordinate of object bounding box relative to the image resolution <y-center> The y-center coordinate of object bounding center relative to the image resolution <width> The width of the object bounding box relative to the image resolution <height> The height of the object bounding box relative to the image resolution To determine whether an object prediction is defined to be true or false, intersection over union (IoU) was used to measure a ratio between the intersection and the union of the predicted boxes and the ground truth boxes, which also known as Jaccard index [16]. We then set how much IoU threshold for an object to be defined as true. How IoU was calculated in this work is shown in equation (7): On the bottom line, the mAP could be used as a model training performance indicator over all object classes. If the mAP is low, it means that the model training is not going well due to low generalization over test data or incorrect use of network parameters.
RBN recognition is the core of our method to recognize RBN within an image. We predict the RBN annotation label for each racing bib detected from our test data based on trained racing bib and numerical digit detection model inference result. The following RBN recognition method scheme used in this work is shown in Figure 3 and Figure 4 to further illustrate on how our RBN recognition method works. First, we used the trained weight that produced the best mAP score during model training for model inference step. The model inference will output a list of prediction consisted of class and bounding box coordinate and size from the detected object within an image, this output component is the same as when we annotate our image dataset. To further increase mAP of our prediction, we then tuned the network size of our model into much bigger value to increase mAP of our model inference prediction as we only process one image at a time unlike during model training step. As

RBN Recognition Evaluation
Our main goal in this work is to create a model that could perform automatic image annotation over detected RBN within an image efficiently. To ensure the model is capable to predict RBN label from an image, we need to evaluate the model RBN prediction based on the ground-truth RBN label from our image dataset. Precision, recall, and F1 score are used to measure how RBN predictions were accurately predicted and how well true predictions were found. F1 score is used to describe how well the balance between precision and recall from RBN prediction was made, thus this evaluation will be the final evaluation for our RBN recognition method.

Dataset and Hardware Specification
We perform dataset annotation on 2194 images from various angles obtained from the running event. After the dataset annotation was done, the object annotated within our dataset consisted of 11788 numerical digits and 2733 racing bib bounding box annotations. A subset of bounding box annotation and image from the numerical digits dataset is shown in Table 2 and Figure 4.

Figure 5. A Subset Of Bounding Box Annotated Image
The main hardware that is very significant for our research is the use of RTX GPU which is capable of running tensor cores to speed up model training and inference time than other GPU. The computer testing setup is listed in Table 3.

Model Training and Evaluation
First, we train our numerical digit and race bib detection model separately. Both models were trained on the same 80%-20% training and test dataset ratio to measure mAP(mean average precision) for evaluating the network model performance. Before training, we customized some network parameters to achieve the best result for our dataset. These network parameters including the number of convolutional filters, number of iterations, network size, number of images within a batch, learning rate, IoU threshold, anchor size, and network architecture. The number of classes that were used in racing bib and numerical digit detection models was 2 and 10 classes respectively. As mentioned before we need to calculate the number of convolutional layers and the number of iteration based on total class used in a model. For the racing bib detection model, we used 21 convolutional layers before the YOLO layer and 4000 iterations to train the model. And for the numerical digit detection model, we used 45 convolutional layers before the YOLO layer and 20000 iterations to train the model. However, there are other parameters tuned for both models to achieve the best mAP for our dataset. We used a 640x640 network size, meaning that our image dataset will be resized to fit the network size. Then batch was used to define how many images are needed in each iteration to perform backpropagation mainly to speed up and generalize the training, we used 64 images in a batch. The adjustment of both network size and the number of images in each batch was influenced by GPU memory capacity to prevent it from being overloaded. We used 0.001 as a learning rate value because the use of bigger and smaller didn't improve our model. For anchor size, we tried to use our calculated anchor size from running k-means clustering over our object bounding box in the dataset. But, the training done in customized anchor size was not performing well so instead we used the default YOLO anchor size. We set the IoU threshold to 0.5 during model training. We also tested a few different network architectures to see how Darknet-53 compared to the others. While testing on ResNeXt-50 architecture, the usage of GPU capacity was huge so we had to tuned the batch size down to 16. At every iteration, the weight of the network model will be updated. Trained weight will be saved at every 1000 th iteration as a backup weight. The training time for racing bib and numerical digit detection model consumed roughly around 5 hours and 49 hours for each model The graph results of training for the racing bib and numerical digit detection model with its evaluation are shown in Figure 6. The graphs showed the model training performance where blue dot as model average loss and red graph as model mAP calculation on the test data. The numerical digit model and racing bib detection performance peaked at 84% and 79% mAP with our customized network settings parameter. Despite the decrease of average loss throughout the end of model training, the mAP toward the test data was not increased significantly. To avoid model overfitting, we then take the trained weight that produces the best mAP towards the test data for model inference.

RBN Recognition and Evaluation
At RBN recognition, the first thing we do was performing model inference for both models on the test data. Before we do the model inference, we increased the network size of our model by two different sizes for comparison, 864x864, and 1280x1280. Figure 7. shown a subset of both model inference result over test data. After we performed model inference with both models on test data, we take the detected class label and bounding box coordinate from the inference result of both models and combine them with the RBN bounding box arrangement algorithm to produce a complete RBN label. Figure 8. shown the result of our RBN recognition method annotated into the image. To evaluate the proposed method inference result combined with the RBN bounding box arrangement algorithm, we calculated recall, precision, and F1 score using ground truth RBN label and predicted RBN label using the different network for comparison size as shown in Table 5.    . The attempt to tweaked network size into two different sizes shows a minor increase in precision and decrease in recall, this is due to the decrease of chance false positive being detected within the racing bib and numerical digit detection model as the network size increased. Based on the result, our proposed method still suffers from the same problem as previous works did which is incorrect RBN prediction result such as "1745" which it should be "61745" as shown in Figure 9. Undetected racing bib region also occurs during racing bib model inference, this resulted in detected numerical digit within the undetected racing bib region is not being processed bib in the RBN bounding box

CONCLUSION
RBN annotation on running event images is often done manually and it requires a lot of hard labor and time consumption when the available image in abundance. As work to automate a simple task and to promote work efficiency, We could use machine learning to minimize work inefficiency using a deep learning-based method by constructing a two-step method structure that implements object detection and recognition. We implemented YOLOv3 with its Darknet-53 neural network architecture as a backbone on the detection and recognition model to detect and recognize the numerical digit and race bib region for the RBN recognition method. This paper proposed a YOLOv3 based RBN recognition method for an efficient RBN annotation system based on deep learning which achieved an 82.2% F1 score higher than the previous work which achieved 64.25% F1 score on using YOLOv3 + CRNN. The proposed method can provide an efficient RBN annotation method that recognized numerical digit on race bib within an image without the use of OCR using a deep learning method. For future work, some problems on both numerical digit and race bib region detection that occured were mainly because of the small datasets as the object detection model unable to generalize more different data. The use of more images in the dataset and more computing power for the use of bigger network size input may improve the overall network performance.