Custom data generators in Tensorflow

How to use a custom generator in Tensorflow and which benefit it brings

MACHINE LEARNING

9/11/20234 min read

Introduction

In the ever-evolving landscape of machine learning and data science, flexibility is not just a perk—it's a necessity. Engineers and researchers are continually faced with a plethora of unique challenges that require custom-tailored solutions. TensorFlow is one of the most potent libraries that offers the dexterity needed to meet these custom requirements. You may be familiar with TensorFlow's high-level APIs like Keras, which provide a more accessible entry point into machine learning. These APIs are fantastic for getting up and running quickly, but what happens when you need to tweak your training loop for a specialized loss function or implement a novel class balancing at batch level? That's where custom TensorFlow generators come into play.

In this article, we'll delve deep into the architecture of TensorFlow's generator structures, demystify the mechanics behind them, and demonstrate how you can bend these loops to your will. Whether you're dealing with non-standard input data, implementing complex sampling techniques, or simply trying to optimize your model's performance down to the last decimal point, understanding how to craft custom loops in TensorFlow will empower you to tackle a wide array of problems head-on.

The fit method and the keras generators

The standard fit method in Keras is often the go-to option for many due to its simplicity and ease of use. By just providing your training and validation data sets, you can set your model to train with minimal lines of code. This high-level API manages batching, epochs, shuffling, and reporting under the hood, making it a convenient tool for quick prototyping and straightforward machine learning tasks. Keras also allows for the creation of image data generators, providing some flexibility in data augmentation and preprocessing. However, these built-in generators come with their own set of constraints and often lack the ability to implement highly customized data manipulation routines. For those who require greater control over data sampling or specialized augmentations, the fit method and Keras generators may fall short of expectations. Let's have a look the following two examples:

In these examples, the allure of the standard fit method and Keras generators lies in their simplicity and user-friendliness. However, this ease comes at the cost of customizability, particularly when intricate data manipulation is required. With a custom generator, you gain the flexibility to integrate third-party packages like Albumentations for sophisticated image augmentations. Furthermore, it offers unparalleled control over the composition of your training batches, allowing you to dynamically balance classes or implement complex sampling strategies. This opens up a realm of possibilities for fine-tuning your machine-learning models to suit specialized or challenging scenarios.

An example of custom image generator

Let's suppose we have available a Pandas Dataframe with at least two columns: 'label' and 'img', where label is the actual label we want to predict and img is a base64 encoded image. An example of custom image generator can be:

The image_generator function is designed to produce batches of image data for training or validation. It takes as input:

df: a dataframe containing image information and labels.
batch_size: the size of each generated batch.
sz: the size to which images should be resized (though this parameter isn't used in the code).
is_training: a boolean flag to indicate if the generator is used for training or validation.

Training Mode

When is_training=True:

The function iterates through each class, as defined by percentage_in_batch, selecting a certain number of samples from the dataframe df.
For each selected row, it decodes the base64-encoded image to a NumPy array.
A random event (EVENT) determines if the image should undergo additional transformations using Albumentations
If the transformed image is not a black image (all zeros), it is appended to the batch_x list, and its corresponding label is appended to batch_y.

Validation Mode

When is_training=False:

The function randomly samples batch_size rows from the dataframe df.
For each row, it decodes the base64-encoded image and appends it and its label to batch_x and batch_y, respectively.

Final Steps

Finally, the generator yields batches of batch_x and batch_y as NumPy arrays. These batches can be fed directly into a machine learning model for training or validation.

This custom generator allows for advanced data augmentation strategies, flexible sampling of classes, and the ability to use other Python libraries, thereby providing granular control over the data pipeline.

For completeness this is the transforms object:

One critical feature of this custom data generator is the percentage_in_batch dictionary, which allows for fine-grained control over the class distribution within each batch. The dictionary would look something like this:

Note that the sum of all percentages should equal 1 to ensure a complete and balanced batch. In the generator, these percentages are multiplied by the batch_size to determine the number of samples for each class to include in the training batch. This ensures that each batch is balanced at the batch-level, not just across the entire dataset, which can be crucial for training stable and effective models.

Additionally, the data augmentation techniques could be applied at the class level, offering an extra layer of customization. For instance, where class 'A' might require different augmentation strategies than class 'B', this could easily be implemented within the custom generator, providing another layer of control not available in standard Keras generators.

Required changes for the 'fit' function

To integrate our custom image_generator into the model training, we first calculate the number of steps per epoch for both the training and validation sets. This is accomplished by taking the total number of samples in each set and dividing it by the batch size. Then, we add 1 to ensure that any remaining samples are also included. With these steps calculated, we proceed to the fit function, which now takes our custom generators (train_generator for training and val_generator for validation) as its data sources. Here's how you set it up:

This setup leverages all the custom features we've built into our generator—class-level balancing, complex data augmentations, and more—right within the standard TensorFlow fit function. The best of both worlds!

Custom data generators in Tensorflow

Introduction

The fit method and the keras generators

An example of custom image generator

Required changes for the 'fit' function

Enrico D'Urso