Predicting a Court’s decision: the natural language processing steps

Earlier we introduced a machine learning study that predicted decisions of a European court regarding human rights violations.  The data preparation started with two kinds of text analysis.  My translation of their article borrows heavily from their writing.

Finding frequent words or combinations of words — N-gram features: “The Bag-of-Words (BOW) model is a popular semantic representation of text used in NLP.  In a BOW model, … text is represented as the bag … of its words (unigrams) or N-grams without taking into account grammar, syntax and word order.“

In short, as is common when using Natural Language Processing [NLP] they decomposed the text of each decision into a “bag” of single words and small-multiple words.  Doing so ignores the actual meaning, order or part of speech of the words in the original.  A “unigram” is a single word, whereas an N-gram is any two, three, four or more words (the “N”) in combination.  The researchers went as far as 4-grams.  So, you could go through a U.S. Supreme Court decision and find each word (unigram), each doublet of words, each triplet of words, and each four-in-a-row combination of words (2-grams, 3-grams, and 4-grams) and create a bag of words without noun designations, verb designations etc.

Creating a vector space representation:  Then the researchers created what you might visualize as a row in a spreadsheet for each opinion and a column for each word, doublet, triplet, etc.  in it.  If you think of a 0 in a cell where the particular opinion did NOT have the N-gram and a 1 where it DID contain the N-gram, you have what statisticians call a matrix (a rectangle of zeros and ones as long as your number of texts (opinions) and as wide as your total number of N-grams).   It would likely be called a “sparse matrix” because so many of the -grams would show up only once or two; lots and lots of cells would have a 0, hence the sparse descriptor.  As they succinctly stated this step, “That results in a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams.”   The term “vector space representation” describes of a huge multi-dimensional space where the axes are each one opinion and their points are the N-grams in that opinion.   Many N-gram points would represent a single word used only once; some points would be words used in several opinions, a few words (like articles and prepositions) would be dense clusters of points because the word was used in many opinions (incidentally, NLP researchers usually strip out those “stop words” since they add no information).

For machine learning, the case opinions (the axes in hyperspace) are usually called “observations” and the columns (corresponding to the points in hyperspace) are usually called “features.”  As the researchers wrote, “N-gram features have been shown to be effective in various supervised learning tasks,” which refers to the machine learning algorithm described later and its task.

“For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1, 2, 3, 4}.”   Hence, they went only as far as combinations of four unigrams and kept only the 2000 N-grams used most often in the opinions.   [The trident facing right is set notation for saying “The values of N come from the set of 1, 2, 3, and 4.”]  “Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case.”  “Normalized” means they converted the number of times each N-gram showed up to a scale, such as from 0 to 1.

“This can be considered as a feature matrix, C ∈ ℝc×m, where c is the number of the cases and m = 2, 000.”   The researchers extracted N-gram features for each of five sections in the opinions, since all opinions of the Court are written in the same format, as well as for the full text.  They refer to this huge array of 0’s and 1s as their C feature matrix (case features from N-grams).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.