Decoding Language: A Deep Dive into Part of Speech Tagging with Dynamic Algorithms

Introduction

In the realm of computational linguistics, understanding each word’s grammatical role through Part-of-Speech (POS) tagging is essential. This process not only aids in syntactic parsing but is also vital for applications such as machine translation and information extraction. For my recent project in the ‘Language and Computation’ module at the University of St Andrews, I delved into the Viterbi algorithm and its efficacy in POS tagging. I aimed to implement and evaluate the Viterbi algorithm alongside two related methods—the Eager and Individually Most Probable Tags algorithms. I compared their effectiveness across three languages—English, Korean, and Swedish—to uncover insights into their morphological and syntactic characteristics.

Diving into Data: Setting the Stage for POS Tagging

In my journey through computational linguistics, the first step was to gather the right data. I turned to the Universal Dependancies Treebanks, a rich source of linguistically annotated text, and used the conllu Python package to meticulously parse and divide this data into training and test sets. Picture this: each sentence neatly arranged into a 2D list, each word tagged not just with its part of speech but also with markers at both the beginning and the end. Why? To capture those crucial boundary effects that can make or break the transition probability calculations. From there, it was all about the numbers. I calculated transmission probabilities by tallying the frequency of POS bigrams, applying Witten-Bell smoothing to deal with the data sparsity. Emission probabilities followed, based on how frequently each word appeared with a particular tag. And to top it all off, I determined the probability of each POS tag starting off a sentence, setting the stage for what was to come. This meticulous preparation was foundational, ensuring that the algorithms had a solid base to work from and promising more accurate tagging results.

Unpacking the Algorithms

The Eager Approach

I started with the Eager algorithm—a straightforward, no-frills approach to POS tagging. This method handles one word at a time, determining the most likely tag by considering only the tag of the preceding word and the word itself. In my script, this process unfolds within the eager_algorithm function. For each word, I sift through every possible tag, crunch the numbers by multiplying each tag’s emission probability with the transition probability from the previous tag, and then select the tag with the highest score. It’s all about making the best choice, one step at a time.

For those curious about the math behind it, the Eager algorithm relies on a straightforward formula to make these tagging decisions. The formula looks like this:

\[\hat {t}_i = \argmax _{t_i} P(t_i \mid \hat {t}_{i-1}) \cdot P(w_i \mid t_i)\]

Let me break it down:

\(\hat{t}_i\): This represents the most likely POS tag for the \(i\)-th word in the sentence. It’s what the algorithm is trying to determine.
\(P(t_i \mid \hat{t}_{i-1})\): This is the transition probability—it tells us how likely the current tag \(t_i\) is, given that the previous word was tagged as \(\hat{t}_{i-1}\). Essentially, it’s about how well the current tag flows from the previous one.
\(P(w_i \mid t_i)\): This is the emission probability, which measures how likely it is that the current word \(w_i\) would be associated with the tag \(t_i\). This helps the algorithm decide which tag makes sense based on the word itself.

In essence, for each word, the algorithm selects the tag that maximizes the combination of how well it fits with the previous tag and how well it matches the current word.

The Viterbi Algorithm

Moving on to the Viterbi algorithm, things get a bit more intricate. This algorithm isn’t satisfied with just the immediate context; it looks at the entire sentence to decide the sequence of tags that makes the most sense. In my script, I implement this within the viterbi_algorithm function. It begins by filling a matrix with log probabilities for the first word across all possible tags. It then iterates over each word in the sentence, updating the matrix based on the transition probabilities, emission probabilities, and previously computed log probabilities. Finally, it backtracks from the last word to determine the most probable sequence of tags. To avoid the pitfalls of multiplying very small probabilities—which can cause numerical underflow—I use log probabilities instead. By transforming probabilities into log space, multiplication operations become additions, significantly mitigating the underflow risk. This approach is critical for the Viterbi algorithm's reliability, especially in processing long sentences where the product of many small probabilities would otherwise approach zero.

Now let's break down the formula that the Viterbi algorithm uses to find the most likely sequence of tags for a given sentence:

\[ \hat{t}_1 \cdots \hat{t}_n = \arg\max_{t_1 \cdots t_n} \left( \prod_{i=1}^{n} P(t_i \mid t_{i-1}) \cdot P(w_i \mid t_i) \right) \cdot P(t_{n+1} \mid t_n) \]

Here’s what the terms mean:

\(\hat{t}_1 \cdots \hat{t}_n\): This represents the most likely sequence of tags for the sentence, from the first word to the last.
\(P(t_i \mid t_{i-1})\): The transition probability, which tells us how likely the current tag \(t_i\) is, given the previous tag \(t_{i-1}\).
\(P(w_i \mid t_i)\): The emission probability, which measures how likely it is that the current word \(w_i\) corresponds to the current tag \(t_i\).
\(t_0 = \langle \mathit{s} \rangle\): This is the start-of-sentence marker.
\(t_{n+1} = \langle /\mathit{s} \rangle\): This is the end-of-sentence marker.

In simple terms, the Viterbi algorithm finds the sequence of tags that maximizes the product of the transition probabilities (from one tag to the next) and the emission probabilities (how well each word fits its tag) for the entire sentence. It also factors in the start and end markers to ensure the tagging process is coherent from beginning to end.

Individually Most Probable Tags

Lastly, the individually most probable tags algorithm represents the most detailed method in my toolkit. Implemented in the most_probable_tags_algorithm function, this algorithm leverages forward and backward probabilities to optimize the accuracy of POS tagging for each word based on the context provided by the other tags in the sentence. It relies on two support functions, forward_algorithm and backward_algorithm, which compute the forward and backward probabilities for each word being assigned each possible tag.

Here’s how it works: The most_probable_tags_algorithm function first calculates the cumulative forward probabilities—those that move from the start of the sentence up to the previous word, factoring in both emission and transition probabilities. Then, it computes the backward probabilities, which run from the end of the sentence to the next word. Using the logsumexp() function, I ensure accurate summation of these probabilities. Finally, by multiplying these summed forward and backward probabilities together, the algorithm identifies the most probable tag for each word.

In more formal terms, for each \(i\), the algorithm computes:

\[ \hat{t}_i = \arg\max_{t_i} \sum_{t_1 \cdots t_{i-1}, t_{i+1} \cdots t_n} \left( \prod_{i=1}^{n} P(t_i \mid t_{i-1}) \cdot P(w_i \mid t_i) \right) \cdot P(t_{n+1} \mid t_n) \]

However, computing this directly would be inefficient. Instead, we break it down using forward and backward probabilities, similar to the Baum-Welch algorithm. The result is:

\[ \hat{t}_i = \arg\max_{t_i} \begin{array}{l} \sum_{t_1 \cdots t_{i-1}} \Bigl( \prod_{k=1}^{i} P(t_k \mid t_{k-1}) \cdot P(w_k \mid t_k) \Bigr) \cdot \\[2ex] \sum_{t_{i+1} \cdots t_n} \Bigl( \prod_{k=i+1}^{n} P(t_k \mid t_{k-1}) \cdot P(w_k \mid t_k) \Bigr) \cdot P(t_{n+1} \mid t_n) \end{array} \]

This formula ensures that we not only consider the context of each word but also maximize the probability of each individual tag, making it a highly robust tagging method.

Measuring Success: How The Algorithms Stacked Up

When it came to evaluating the algorithms’ performance, accuracy was my guiding star. This straightforward metric was calculated by comparing each algorithm’s predicted tags to the true tags from the test sets, measuring how often each algorithm got it right. I laid out the results in a simple table format, making it easy to see how each algorithm fared across different languages.

Here’s a snapshot of how each algorithm performed:

Language	Accuracy % Eager	Accuracy % Viterbi	Accuracy % Individually Most Probable Tags
English	88.6	91.3	88.6
Swedish	85.7	90.2	85.7
Korean	80.8	79.2	80.8

Interestingly, the Individually Most Probable Tags algorithm did not surpass the Viterbi algorithm in performance as I had originally anticipated. This result hints at possible implementation issues, warranting a thorough reevaluation to pinpoint any errors or missteps. Consequently, this algorithm was not included in the final discussion.

Deep Dive: Analyzing Algorithm Performance and Linguistic Challenges

Comparing the Performance of Two Algorithms.

Why Viterbi Performs Well in English and Swedish, but Falls Short in Korean.

English and Swedish both thrive under the Viterbi algorithm, which excels by considering the full context of a sentence. These languages rely on strict word order to define grammatical roles, making Viterbi’s sentence-wide analysis highly effective. However, Korean presents a different challenge. As an agglutinative language, Korean packs a lot of grammatical information directly into word forms through affixes, providing strong independent cues for POS tagging. The flexible word order in Korean reduces the need for a detailed sentence-wide analysis, which is why Viterbi doesn’t perform as well here.

Comparing Performance Across Languages.

English Shows the Highest Accuracy Overall.

English’s straightforward morphology, especially when compared to Swedish and Korean, makes it easier for the algorithms to tag accurately. With fewer inflections and affixes to handle, the task is simplified. Plus, the rigid sentence structure in English helps create predictable word sequences, something both the Viterbi and Eager algorithms can easily take advantage of for more reliable tagging results.

The Challenge with Swedish: A Case for the Viterbi Algorithm.

While Swedish isn’t as morphologically complex as Korean, it does have more inflections than English and allows for greater sentence structure variation. This is where the Viterbi algorithm shines. Its ability to analyze entire sentence structures makes it particularly suited for languages like Swedish, where understanding long-distance dependencies between words and tags is key. In comparison, the Eager algorithm struggles to capture these nuances, giving Viterbi the edge.

Unpacking the Challenges of POS Tagging For Korean

Korean posed a significant challenge in POS tagging, with lower accuracy compared to English and Swedish. This is largely due to Korean’s agglutinative nature, where words are formed and grammatical relationships are expressed through numerous affixes. This creates a wide range of possible word forms from a single root, making the tagging process more complex. On the other hand, English and Swedish, being analytical languages, have simpler morphologies. With fewer word forms to manage, it’s much easier to categorize and tag them, leading to higher accuracy.

My Reflections

Key challenges

Cracking the Viterbi Algorithm

Building the Viterbi algorithm was no easy feat. The mathematical complexity behind it made the development process challenging. Translating theoretical formulas into workable Python code took multiple iterations and a lot of deep dives into academic resources—lecture notes, textbook pseudocode, and online resources.

Fine-Tuning the Individually Most Probable Tags Algorithm

The Individually Most Probable Tags algorithm presented its own set of difficulties, especially since it wasn’t covered in our classes or textbooks. It demanded a solid grasp of mathematical principles to accurately integrate forward and backward probabilities into a predictive model. This algorithm still has room for improvement, and with further iterations, I believe it can achieve the results I initially expected.

So What Did I Learn?

Viterbi vs. Eager Algorithm:
- Viterbi performed better in Swedish and English.
- Eager had slightly better results in Korean.
Language Accuracy Rankings:
- Highest accuracy observed in English.
- Swedish showed significant improvement with the Viterbi algorithm.
- Korean exhibited the lowest accuracy.
Effect of Morphology:
- The simpler morphology in Swedish and English may explain better algorithm performance compared to the complex morphology of Korean.

Looking ahead

Looking ahead, there's a clear opportunity to tailor POS tagging algorithms more effectively for Korean. Leveraging its rich morphological features could enhance accuracy significantly. Ensuring robust POS tagging across all languages is crucial, as it lays the groundwork for more advanced natural language processing applications, from speech generation to language translation.

Try It Out

Curious to see how the algorithms perform in action? Pull the repository and try it for yourself.

Get the Code on GitHub

Want to find out more?

For a deeper dive into the Viterbi algorithm and its applications in POS tagging, check out the recommended books below:

Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition.

Jurafsky, D. and Martin, J.H.

Access online

Natural language processing with Python

Bird, S., Klein, E. and Loper, E.