🛠 This page is currently under construction.
Load a Corpus
The first step in using DocuScope CA is to load a corpus of plain text files.
If you’re not familiar with the basics of corpus building and file preparation, see these guidelines.
Otherwise, follow these steps.
1. Upload files
First, select the load corpus tab on the left side of the window.
Scroll down to the bottom of the page, where you will find the browse files button. Select it.
A dialogue window will open. Navigate to your corpus files and select them. You can select multiple files by holding down the shift key or the command key. You can also use select all if you have organized them in their own directory.
Alternatively, you can simply drag and drop your corpus files into the uploader.
2. Choose a dictionary
From the drop-down menu, select the dictionary you would like to use to tag the corpus. The “dictionaries” are actually models that have been trained on different tagsets.
Part-of-speech tagging is the same for all models. But the models are trained on different DocuScope tagsets.
Broadly, the models are differentiated by their number of DocuScope categories:
- Large Dictionary = many categories
- Medium Dictionary = some categories
- Common Dictionary = fewer categories
Your choice of model largely depends on your data and your research questions. More categories means greater coverage (fewer tokens left “Untagged”), but also more variables with finer grained distinctions.
For more informtion, consult the DocuScope tagset descriptions.
3. Process a corpus
Once you have selected your desired files, they will appear in the widget. Make sure you select the “process corpus” button below the widget.
Processing time depends on the size of your corpus. A corpus with a 250,000 words should process in just a few seconds. A corpus with 2 million words will take roughly 30 seconds.