The filename contains the date, chatroom, and number of posts; e.g., The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.
This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.
We examined some small text collections in 1., such as the speeches known as the US Presidential Inaugural Addresses.
This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).
NLTK's corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora.
We'll use NLTK's support for conditional frequency distributions.
These are presented systematically in 2, where we also unpick the following code line by line.
Observe that average word length appears to be a general property of English, since it has a recurrent value of variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.
The previous example also showed how we can access the "raw" text of the book Although Project Gutenberg contains thousands of books, it represents established literature.