This site contains documentation for DocuScope Corpus Analysis (or DocuScope CA), as well information for those using the docuscospacy Python package.

🛠 This page is currently under construction.

Load a Corpus

The first step in using DocuScope CA is to load a corpus of plain text files.

If you’re not familiar with the basics of corpus building and file preparation, see these guidelines.

Otherwise, follow these steps.

1. Upload files

First, select the load corpus tab on the left side of the window.

Scroll down to the bottom of the page, where you will find the browse files button. Select it.

The widget for browsing and selecting files.

A dialogue window will open. Navigate to your corpus files and select them. You can select multiple files by holding down the shift key or the command key. You can also use select all if you have organized them in their own directory.

A dialogue window for selecting files.

Alternatively, you can simply drag and drop your corpus files into the uploader.

2. Choose a dictionary

From the drop-down menu, select the dictionary you would like to use to tag the corpus. The “dictionaries” are actually models that have been trained on different tagsets.

Part-of-speech tagging is the same for all models. But the models are trained on different DocuScope tagsets.

Broadly, the models are differentiated by their number of DocuScope categories:

  • Large Dictionary = many categories
  • Medium Dictionary = some categories
  • Common Dictionary = fewer categories

Your choice of model largely depends on your data and your research questions. More categories means greater coverage (fewer tokens left “Untagged”), but also more variables with finer grained distinctions.

For more informtion, consult the DocuScope tagset descriptions.

3. Process a corpus

Once you have selected your desired files, they will appear in the widget. Make sure you select the “process corpus” button below the widget.

Widget showing selected files and the process button.

Processing time depends on the size of your corpus. A corpus with a 250,000 words should process in just a few seconds. A corpus with 2 million words will take roughly 30 seconds.