Assignment 7: Sequence-to-sequence networks

Deadline: Aug 17, 08:00 CEST.

In this assignment we are going to experiment with sequence-to-sequence (or seq2seq) encoder-decoder networks. For simplicity, we will build a basic seq2seq model, and work on a toy problem. However, variations of these networks are used in a large number of natural language processing applications.

There are quite a few online tutorials showing step-by-step implementations of these models. You are allowed to make use of these tutorials. However, make sure you understand what you are doing, and your submission should follow the instructions below carefully.

The problem we want to solve is “translating words”: given pronunciation of a word in one language, predicting its pronunciation in a (related) language. The model should work well if the words are cognates, where mapping the word in the source language to the word in the target language follows regular patterns. Hence, we expect better results on related languages.

It is likely that your model will perform poorly on this data set - even for closely related languages. In practice, it is more common to use “data augmentation” techniques, in combination of more complex models (e.g., using attention). You are not required but encouraged to try methods to improve the base system defined below. Your assignment will not be evaluated based on the scores, but based on correct implementation.

Please implement all exercises in the provided template.

Data

The data we use in this assignment is from NorthEuraLex which includes IPA encoded pronunciations of a set of words in multiple languages. We only need the file containing word forms, northeuralex-0.9-forms.tsv, for this assignment. The template already includes code necessary to read and extract the necessary information from the data file.

In NorthEuraLex, the words in different languages are related thorough their ‘concept ID’. The words belonging to the same concept tend to be cognates for closely related languages, but they are not necessarily cognates. Hence, depending on the pair of languages you work on, a significant proportion of the word pairs in the training and test sets will be noise (one may use some cognate detection method/heuristic to refine the data set, but you are not required to this).

Exercises

6.1 Data preparation

The template already includes a function that reads the NorthEuraLex data, and returns the lists of word pairs that belong to the same concept in the indicated source and target languages. Furthermore, you are also given an example encoder for these sequences, which transforms a given word list to numeric representations, where each IPA segment is represented by and integer.

Obtain the list of word pairs for two related languages, and split the data set into a training (80%) and test (%20) set. Encode the data sets appropriately using the encoder class provided.

6.2 Building and training a seq2seq model

Implement the function train_seq2seq() in the template, which builds a seq2seq model as described below, trains the model, and returns the trained model object.

Your model should include:

A bidirectional GRU encoder, which takes an encoded sequence IPA symbols (words of the source language), and build a representation for the whole word.
Your encoder should use an embedding layer before the recurrent layer(s).
Your input words to the encoder should be padded with 0’s on the left.
A GRU decoder, whose hidden state is initialized using the final state of the encoder hidden representation, and predicts the characters of the word in the target language that belongs to the same concept.
The decoder’s input is the previous character of the target language word. Use a special “end-of-sequence” symbol as the first character input character.
Similar to the encoder, use an embedding layer before the decoder recurrent layer. An interesting option here is to use a shared embeddings layer in the encoder and decoder (not required for the exercise).
The target words should be padded from the right.
Make sure to train your model properly, employing at least one method for preventing overfitting.

6.3 Decoding

Implement the function predict() in the template, that returns (encoded) target words given the source words and a trained seq2seq model.

You are required to return only the most-likely word, but recommended to implement a beam-search option that returns top-k preedictions.

6.4 Evaluation

Implement the function evaluate(), which returns the “mean edit distance” between two sequences of words (predicted and gold-standard sequences), as well as the mean edit distance between the gold standard and a trivial baseline predicting the source word without any modification.

You can use a library for edit distance calculations.