Documents

Printing the document shows the document text and indicates that there are no document features and no annotations which is to be expected since we just loaded from a plain text file.

In a Jupyter notebook, a gatenlp document can also be visualized graphically by either just using the document as the last value of a cell or by using the IPython "display" function:

This shows the document in a layout that has three areas: the document text in the upper left, the list of annotation set and type names in the upper right and document or annotation features at the bottom. In the example above only the text is shown because there are no document features or annotations.

Document features

Lets add some document features:

Document features map feature names to feature values and behave a lot like a Python dictionary. Feature names should always be strings, feature values can be anything, but a document can only be stored or exchanged with Java GATE if feature values are restricted to whatever can be serialized with JSON: dictionaries, lists, numbers, strings and booleans.

Now that we have create document features the document is shown like this:

Annotations

Lets add some annotations too. Annotations are items of information for some range of characters within the document. They can be used to represent information about things like tokens, entities, sentences, paragraphs, or anything that corresponds to some contiguous range of offsets in the document.

Annotations consist of the following parts:

Annotations can be organized in "annotation sets". Each annotation set has a name and a set of annotations. There can be as many sets as needed.

Annotation can overlap arbitrarily and there can be as many as needed.

Let us manually add a few annotations to the document:

Add an annotation to the set which refers to the first word in the document "This". The range of characters for this word starts at offset 0 and the length of the annotation is 4, so the "start" offset is 0 and the "end" offset is 0+4=4. Note that the end offset always points to the offset after the last character of the range.

If we visualize the document now, the newly created set "Set" is shown in the right part of the display. It shows the different annotation types that exist in the set, and how many annotations for each type are in the set. If you click the check box, the annotation ranges are shown in the text with the colour associated with the annotation type. You can then click on a range / annotation in the text and the features of the annotation are shown in the lower part. To show the features for a different annotation click on the coloured range for the annotation in the text. To show the document features, click on "Document".

If you have selected more than one type, a range can have more than one overlapping annotations. This is shown by mixing the colours. If you click at such a location, a dialog appears which lets you select for which of the overlapping annotations you want to display the features.

Loading a larger document

Lets load a larger document, and from an HTML file: the Wikipedia page for "Natural Language processing":

The markup present in the original HTML file is converted into annotations in the annotation set with the name "Original markups". For example all the HTML links are present as annotations of type "a" (there are 449 of those), the level 3 headings are present as annotations of type "h3" and so on.

Loading and saving using various document formats

GateNlp documents can be loaded from a number of different text representations. When you run Document.load(filepath), gatenlp tries to automatically determine the format of the document from the file extensions, but if that fails, it is possible to explicitly specify the format using the fmt= keyword argument which can take a memnonic or a mime type specification for the format.

The following formats are known, the list shows first the memnonic, if one exists, then the mime type, and then the description of the format. All the following formats can be loaded and saved:

The following formats can only be loaded:

The following formats can only be saved:

Documents can also be saved and loaded using Python pickle.

Documents can also be convert to and from a Python-only representation using the methods doc.to_dict() and Document.from_dict(thedict) which can be used to serialize or transfer the document in many other formats.