We’ve been designing a new Natural Language Processing course recently, which we will be delivering in our Masters Series over the coming months. The Irish court system provides transcripts of its proceedings online at www.courts.ie and we used it as the corpus for our analysis.  Some really interesting insights were uncovered – but that’s for another post.  This one is about the surprising number of queries I came across online about the custom corpus building facility in NLTK.  While researching the material I found a blog where the author noted something interesting: ‘While looking through the NLTK forum I’ve noticed a lot of people looking for information on how to build a corpus’.  Problematically, he goes on to provide a misleading step-by-step guide to building a plaintext corpus, that involves tokenising as a preliminary step in the process.  This is unnecessary – NLTK’s plaintext corpus reader class will do that for you.  Furthermore, although he did load some text for analysis, he didn’t build a corpus at all!

He was right in one respect though – there are lots of queries on this topic in the NLTK forum, and more like this one from stackoverflow.com: ‘Creating a New Corpus with NLTK’, where the author says ‘I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn’t give the answer.’   This question even had a bounty.  Both on the forum and on stackoverflow there are some very impressive NLTK experts giving excellent technical answers.  So, why does this question keep recurring?

The misunderstanding seems to be about the general idea of how NLTK deals with custom corpora.  The reality is that a corpus is just a collection of texts, or text files.  As such, there is no need to create a new corpus with NLTK – you don’t need to call a corpus constructor.  Instead, NLTK makes life really easy, so that what most users will actually create are two things:

  1. Your corpus, which is just a collection of files stored in an appropriate structure on your filesystem, and,
  2. A means to manipulate your corpus in NLTK, which is a corpus reader that matches your corpus structure.

This means that in the case of plain text files, for example, you simply create a directory of files on your filesystem, and then a plaintextcorpusreader can access them, parse them, and provide methods of reading and manipulating them.

You may be wondering what is meant by ‘an appropriate structure’.  Given that the NLTK corpus readers are going to provide some complex functions for your corpus, they rely upon the corpus being in a specific format so that it can be parsed correctly.  For example, when dealing with plain text, the plain text corpus reader assumes that paragraphs are separated by white space.  It uses this assumption to process the documents into the paragraphs that are returned by its .paras() function.

For a categorised corpus, the categorised corpus reader assumes that a naming convention will be applied to the files to tell it how to categorise their content as, for example, positive or negative text, or as belonging to specific news groups e.g. politics, religion, sports.  Or, alternatively, that the positive files will all be in one folder, and the negative files in another.

From within your code, and for most intents and purposes, your corpus reader object functions as a model of the activities that arise when interacting with your corpus.  This is the reason why it is conflated with your corpus, and it is also why the question ‘how do I create a corpus in NLTK’ is often answered with ‘use a corpus reader’.

 

Breakdown the Process

For a plain text corpus, creating files in the appropriate structure is easy: you just drop them in a directory.  Let’s look a little more closely at how the second piece of work is undertaken and see how the plaintextcorpusreader performs its magic.  Here is the syntax for the reader:

PlaintextCorpusReader(
    root,
    fileids,
    word_tokenizer=WordPunctTokenizer,
    sent_tokenizer=PunktSentenceTokenizer,
    encoding='utf8’)

This reader class is defined with 5 parameters as follows:

root: The root directory for this corpus.  This should reference the URI of the directory that holds your plaintext files for building the corpus.

fileids: This can either be a regular expression that pattern-matches the names of the files you wish to select from the root directory, or a simply a list of filenames.

word_tokenizer: This specifies the tokenizer for breaking sentences or paragraphs into words.  By default, the reader class uses the in-built Punkt word tokenizer.  This is one of a number of rule-based tokenizers that uses regular expressions to identify words within each of the files.  A user can specify that a different tokenizer is used.  Either way, the tokenizer ensures that once the user has created the corpus reader they can call methods that rely upon it to return results, i.e, the word tokens.

sent_tokenizer: This specifies the tokenizer for breaking paragraphs into words.  By default, the reader class uses the in-built Pickle sentence tokenizer.  This is one of a number of rule-based tokenisers that uses regular expressions to identify sentence boundaries within each of the files.  A user can specify that a different tokenizer is used.  Either way, the tokenizer ensures that once the user has created the corpus reader they can call methods that rely upon it to return results, i.e, lists of sentences.

para_block_reader: The block reader used to divide the corpus into paragraph blocks (in this case, based on blanklines).

 

We can see then that the corpus reader facilitates much of the preprocessing that you would normally require in order to be able to perform analysis, e.g. tokenisation.  As such, the corpus reader is your way of accessing the corpus, which you have built simply by gathering together (either physically, or by reference) the files that you are interested in analysing.

 

Let’s look at a specific example.  I have my file system set up so that my code, and a folder called ‘testCorpus’ are stored in a directory.  The testCorpus folder contains 100 individual text files, each containing the text of an individual judgement from the Irish courts system, all named similarly to judgement33837F580258168003E5BDA.txt.  The code for being able to manipulate this corpus is:

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

#Your working directory should have both this notebook and the text folder within it
#os.getcwd() will return the filepath to the current working directory
cwd = os.getcwd()
#Check it has been successful
print(cwd)
completeFolder = os.path.join(cwd,"TestCorpus")
#Check it has successfully found the Corpus directory created by the Scraping script
print(completeFolder)

#Build the corpus from all files contained within the corpus directory
newcorpus = PlaintextCorpusReader(completeFolder, '.*')
print("done")

#Print paragraphs of the corpus. (which is a list of list of list of strings)
# Each element in the outermost list is a paragraph, and each paragraph contains sentence(s), and each sentence contains token(s)
# To access the pargraphs of a specific file.
print (newcorpus.paras('judgement33837F580258168003E5BDA.txt'))

--------------------------
Will create this output:
[[['Determination']], [['Title', ':', 'R', '-', 'v', '-', 'The', 'Governor', 'of', 'Cloverhill', 'Prison', 'Neutral', 'Citation', ':', '[', '2017', ']', 'IESCDET', '79', 'Supreme', 'Court', 'Record', 'Number', ':', 'S', ':', 'AP', ':', 'IE', ':', '2017', ':', '000106', 'Court', 'of', 'Appeal', 'Record', 'Number', ':', 'A', ':', 'AP', ':', 'IE', ':', '2017', ':', '000311', 'High', 'Court', 'Record', 'Number', ':', '2017', 'No', '.'], ['668', 'SS', 'Date', 'of', 'Determination', ':', '07', '/', '20', '/', '2017', 'Composition', 'of', 'Court', ':', 'Denham', 'C', '.', 'J', '.,', 'Clarke', 'J', ',', 'Dunne', 'J', '.'], ['Status', ':', 'Approved']], ...]
# Access sentences of the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
# To access the sentences of a specific file.
print (newcorpus.sents('judgement33837F580258168003E5BDA.txt'))

Which will provide the following results:

--------------------------
[['Determination'], ['Title', ':', 'R', '-', 'v', '-', 'The', 'Governor', 'of', 'Cloverhill', 'Prison', 'Neutral', 'Citation', ':', '[', '2017', ']', 'IESCDET', '79', 'Supreme', 'Court', 'Record', 'Number', ':', 'S', ':', 'AP', ':', 'IE', ':', '2017', ':', '000106', 'Court', 'of', 'Appeal', 'Record', 'Number', ':', 'A', ':', 'AP', ':', 'IE', ':', '2017', ':', '000311', 'High', 'Court', 'Record', 'Number', ':', '2017', 'No', '.'], ...]

 

Categorised Corpus

If you were attempting to create a machine learning solution for a text classification problem, you would probably use the categorised corpus reader.  In this case, the first step would be to load your training data corpus, already in the appropriate structure to allow NLTK to understand which text belongs to which class.

The categorised corpus reader class has the following syntax:

CategorizedPlaintextCorpusReader(
root,
fileids,
cat pattern,
word_tokenizer=WordPunctTokenizer,
sent_tokenizer=PunktSentenceTokenizer,
encoding='utf8’)

Where the fileids and cat pattern are both regular expressions that allow the corpus reader to recognise the files that it is to parse, and the pattern of the categorisation label that has been used to name them.  Let’s look at an example:

Given that the file system is set up so that the 20 news groups corpus is contained in a set of folders with one folder per category, and that files have no file extension but are named using numeric characters, then the code would be:

news_grps_20_corpus = nltk.corpus.CategorizedPlaintextCorpusReader( root = './data/20news-bydate/20news-bydate-train/’, fileids = r'.+/\d+', cat_pattern=r'((\w|\.)+)/*', encoding = 'latin1')
news_grps_20_corpus.categories()

Or, given the alternative structure where the categories are indicated in the filenames, then the code would use a regular expression to know which files belong in which category:

#reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'\d+_(\w+)\.txt')

The categorised reader is actually an extension of the plaintext corpus reader, and includes all its functionality.  Once the categorised reader is built you can call its functions on your text in the same manner.   There are many other corpus reader classes, each suited to a particular type of natural language task.  Now that you know the underlying concept, you can review the documentation for any of them and pick the one suited to your needs.

We hope that this blog gives you some idea of the power provided by NLTK for working with your own custom corpus, and how to get started.  To learn more about this, and how to perform the very latest natural language processing using Deep Learning techniques, take a look at our upcoming Masters Series.

The Analytics Store is the leading data analytics training and consultancy firm in Ireland.  We assist companies in developing and implementing data and analytics strategies (from people to data environments).  Depending on your lifecycle maturity, we offer products that bring you to the next level.