Let’s assume we have a sequence of words, and we want to predict, as accurately as possible, whether each word is a name, verb, or some other part of speech.

This is equivalent to predicting a sequence of labels (where each label represents a part-of-speech tag) from a sequence of observations (where each observation is a word from a natural language vocabulary).

In mathematical terms, this equates to finding the sequence of labels that is most probable for the given sequence of words.

## Brute force

We could do this by determining p(Y|X) - the conditional probability of the label sequence (Y) given the observation sequence (X).

Assuming we had some model that could give us p(X), p(Y) and p(X|Y), we could explicitly calculate this by using Bayes Rule:

$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$

Our label prediction would then simply be the sequence with the maximum probability.

This would, however, require us to explicitly model both $p(Y)$ and $p(X)$ - the probability of the observed word sequence (over all possible word sequences), and the probability of every possible label sequence (over all possible label sequences).

This is equivalent to modelling the joint probability of both label and word sequences.

## Reducing the event space

This is intractable for all but the most simple cases - from the binomial theorem, the number of possible word sequences grows factorially as the sequence grows larger.

This brute force approach is of complexity $O(n!)$, so we want to find a better approach.

What if we disregarded the probability of the observation itself?

We know intuitively that many combinations of words, even if combinatorially possible, never actually appear in the English language.

We don’t want to waste computational time modelling nonsensical phrases like “trumpet cat the bag overlay”.

Ideally, we would ignore the probability of the observation and consider only $p(Y|X)$, the probability of each label sequence conditioned on that sequence of words.

This seems intuitively reasonable for our label prediction task.

We only really care about labelling word sequences that we are given, so avoiding modelling $p(X)$ delivers a significant reduction in computational complexity.

## Assumption of independence

Could we similarly reduce the complexity of our model for p(Y)?

Without any simplifying assumptions, we would again need to consider every possible sequence of labels - a set that is too large to work with in general.

What if we assumed that each label in the sequence is independent of the others?

This would allow us to focus on the probability of each label in its respective position, separately and independently of anything that comes before or after it.

This would reduce the number of calculations for $p(Y)$ from “all possible K-length sequences from N possible labels” ($\binom{N}{K}$) to a more manageable (“K independent draws of N possible labels” (NK) - a clear reduction.

The question then becomes whether or not this is a reasonable assumption for our task at hand1.

In English, at least, we know that adjectives tend to appear before nouns, adverbs don’t follow prepositions, and so on. The labels are clearly not independent.

## Relaxed assumption of interdepedence

What if we adopted a more relaxed independence assumption - that each label is conditionally independent, given its immediate predecessors?

For example, say we want to predict the blank in the following sequence:

NUM ORG VB _

We want to model the probability of the label at position i, given every preceding label in the sequence:

$p(y_{i} | y_{i-1}, y_{i-2}, y_{i-3})$

which is equivalent to

$p(y_{i} | y_{i-1}).p(y_{i-1}|y_{i-2}, y_{i-3}).p(y_{i-2}|y_{i-3})$

If we assume that each label is independent of everything except its immediate predecessor, then:

$p(y_{i}) = p(y_{i} | y_{i-1})$

This is a more realistic assumption than naive/strong independence.

Now, we can capture some first-order interactions and interdependence between labels, which fits much better with our prior knowledge of linguistic structure2.

If we were explicitly modelling the probability of each observation (i.e. p(X)), reintroducing p(Y|X) would give us a Hidden Markov Model, where each label corresponds to the hidden state, and each word in the sequence corresponds to the observation.

Without explicit consideration of consideration of P(X), this then gives us a Conditional Random Field.

This model is equivalent to a Markov Random Field, conditioned on a sequence of observations (hence the name - Conditional Random Field).

CRFs are a restricted form of Hidden Markov Model that allow more tractable probability models, by avoiding the need to explicitly consider the probability of the observations.

This is important for our label sequence prediction task, where the probability input space - that is, the space of all possible word sequence combinations - is simply too large to explicitly consider.

For discriminative tasks such as this, CRFs deliver competitive performance by making weaker independence assumptions about a smaller search space.3

So far, I’ve only talked about what probability we’re trying to model - not how we might actually model it.

In the next post, I’ll give a breakdown of what it actually means to model the probability under a linear chain CRF.

3. The downside is that, by not modelling $p(X)$, we can no longer calculate $p(X|Y)$. This means we can only assign labels to words (a classification/discriminative objective), and we are unable to create a generative model that can generate the most likely word given a label.