Assignment 6: Segmentation as Sequence labeling
Deadline: July 27 Aug 03, 2020 @08:00 CEST
In this assignment we will work with (gated) recurrent networks for segmentation. In particular, we treat the segmentation task as a sequence labeling task, where we want to label each character as ‘beginning of a word’ (B, or 1) or ‘inside a word’ (I, or 0). This is very similar to BIO tagging used for many problems in NLP, e.g., tokenization or named entity recognition. In our case, we assume that there will be no ‘other’ (O) tag (for example space characters for tokenization, or words that do not belong to any named entity).
In this particular form, the problem is relevant for segmentation of languages where the written text does not include word boundaries, or processing spoken language transcripts without (clear) word boundary information. Although you can work with any data set you like, for your convenience (and comparison of success of your systems) your repository contains a data set which has been used in many studies on child language acquisition. In particular the data comes from the CHILDES (originally collected by Bernstein-Ratner (1987) processed by Brent & Cartwright (1996) into to present form). The data comes as phonemic transcriptions and regular text. In child language acquisition research, the phonemic transcriptions are used. You are free to experiment with any, try both and compare the performance of your system on different data sets.
Each line in the data files contains an utterance, and the word boundaries are indicated with space characters.
As usual, implement your system in the provided template. You are expected to use Keras for defining the neural network in this assignment.
Problem definition
6.1 Reading the data
Not surprisingly, our assignment starts with reading the data file.
Implement the function read_data()
that reads a data file formatted as described,
and returns utterances without spaces,
as well as 0
and 1
labels corresponding to ‘inside a word’
and ‘beginning of a word’ respectively.
Also include a special ‘end-of-utterance’ symbol (#
)
which is treated as a word.
For example, for the input file
night night
daddy
a kitty
the function should return unsegmented input like
["nightnight#", "daddy#", "akitty#"]
and labels like
[[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 1],
[1, 1, 0, 0, 0, 0, 1]]
6.2 RNN sequence labeller
Implement the function segment()
in the template
that trains a (gated) RNN for predicting the boundaries
(B/I or 1/0 tags) using the unsegmented
input returned by read_data()
described above.
The function should return sequences of segmented test utterances.
For the input
["nightnight#", "daddy#", "akitty#"]
The output should be predicted segmentations similar to:
[["night", "night", "#"], ["daddy#"], ["ak", "it", "ty", "#"]]
Note that we assume the model made some mistakes in the example above.
You are free to chose the architecture and the data encoding
(you may want to reuse your WordEncoder
from the assignment 5
if you opt for one-hot encoding for input characters)
and whether and how to train/tune your network.
You are not required to demonstrate your tuning. However, you should be careful to follow correct practices.
6.3 Evaluating the output
Implement the function evaluate()
in the template that takes
gold-standard segmentations and predicted segmentations,
and produces the following metrics:
- Boundary precision (BP), recall (BR) and F1 score: these scores should indicate the ratio of the predicted boundaries that are correct according to the gold standard (precision), the ratio of the gold-standard boundaries that were correctly predicted (recall), and their harmonic mean (F1 score).
- Word precision (WP), recall (WR) and F1 score: similar to above, but to count a true positive, both boundaries of a word should be identified correctly.
- Lexicon precision (LP), recall (LR) and F1 score: similar to word precision, but each word type (unique word) counts only once.
For example, for predicted segmentation
[["night", "night", "#"], ["daddy#"], ["ak", "itty", "#"]]
and the gold standard
[["night", "night", "#"], ["daddy", "#"], ["a" "kitty", "#"]]
BP = (2+0+1)/4, BR=(2+0+1)/5, WP = (2+0+0)/5, WR=(2+0+0)/5, LP = (1+0+0)/4, WR=(1+0+0)/4.
Our treatment of ‘end-of-sequence’ symbol may seem somewhat arbitrary.
The general guiding principle is that we do not want to credit the
system for predicting the obvious (so, a boundary before the first
word or after #
should not add to true positives),
but predicting the end of the last word
(a ‘beginning-of-word’ label on #
)
is part of the evaluation.