Continuous annotation of user data is a challenge when deploying large-scale NLU techniques in commercial applications. Models need to be re-trained and updated to keep performance at peak levels. However, the process is expensive, labor intensive and time consuming. Additionally, with growing privacy concerns, manual review of user data needed for annotation is not ideal.
Researchers from Amazon and the University of Massachusetts Lowell have proposed a generative model to produce labeled synthetic data. The idea is to improve the robustness and performance of the model by generating synthetic statements and augmenting the original training data.
Synthetic augmentation with GIT
The Generative Insertion Transformer (GIT) is based on a non-autoregressive Insertion Transformer model that extends the idea of solving the inverse NLU problem by producing a valid labeled data statement that matches the annotation with a given pattern .
In this generative model, the decoder generates a sequence by inserting tokens between previously generated tokens. Carrier tokens are inserted between labels in the model iteratively. The insertion process at each position of the utterance is independent of all other positions and stops when the EOS token is generated at all positions, resulting in a fully annotated synthetic utterance that can be directly populated with data real ones for model building purposes.
The process can be divided into three sections:
Pre-training: GIT is pre-trained using the BERT encoder and the KERMIT goal on an unsupervised LM task: given a sentence with hidden tokens, GIT is trained to insert the hidden tokens. Two tests are configured on this model:
- Pre-training using only English Wikipedia
- Pre-training using an internal corpus of 800 million unlabeled utterances randomly sampled from anonymized Alexa queries, using pre-trained English Wikipedia models as initialization.
Fine adjustment: The pre-trained GIT model is then refined for each domain using annotated real data. A template is provided as template input for each utterance and the complete utterance as output. During training, at each insertion slot, there are multiple candidate tokens from the ground truth, unlike autoregressive generation, which involves a single token per generation step. The ground truth distribution sets the non-candidate token probabilities to 0 and weights all candidate token probabilities evenly.
Generation: To generate synthetic data for NLU, a model is constructed that contains the desired intent, location types, and location values for the synthetic example. This seed sequence is provided as input to the decoder, which inserts carrier tokens iteratively to form a coherent utterance. The generation process addresses both the challenges of label projection and entity control. The models used in the inference are built from the reduced real data.
To study the efficiency of synthetically generated data, the performance of the NLU model was evaluated in a reduced data regime. For each domain, several IC-NER models are built using all real data, a reduced set of real data and a combination of real and synthetic data. All models in a domain share the same training hyper-parameters, including architecture and encoder. They differ only in the composition of the training data.
Researchers demonstrated that DA uses GIT as a feasible data generation technique to mitigate reduced annotation volumes for IC and NER tasks. NLU models trained on 33% real and synthetic data performed on par with models trained on full real data. Additionally, on domains with the highest SemER regressions, the quality of synthetic data was improved by filtering them with model confidence scores. Among the domains that benefit from synthetic data, the insertion of appropriate support token improved the semantics of utterances and their value as training samples. The future represents data generation with entities replaced by knowledge base sampling. Such finer-grained control over entities supports the expansion of new features and improves customer privacy.