Text Source Collections

Introduction
Step-by-step instructions
Case Study

Introduction

Knowtator is very flexible with respect to what kind of texts can be loaded and annotated. Texts may come from simple text files, XML files or a database table. Rather than trying to predict all of kinds of texts that will be loaded into Knowtator we provide a simple API for defining a text source collection. A couple of implementations of this interface are provided with Knowtator:

Local Files: This text source collection simply gathers text files located in a local directory and presents the contents in the text source viewer. The identifier for an individual text source is the name of the file that it corresponds to.
Lines from File: This text source collection opens a file that has one line per text source. Each line of the file has an identifier for the text source followed by a bar ('|') followed by the text to be annotated.

If these two implementations are not sufficient we encourage you to write your own text source collection implementation using our simple API. The API makes the following assumptions about text sources in a text source collection:

The text sources are ordered and indexed starting at 0 (zero).
The text sources are plain text. That is, a call to TextSource.getText() returns plain text. The underlying source may be xml or anything else but Knowtator is none-the-wiser.
The offsets are based on the plain text returned by TextSource.getText() starting at 0 (zero). If these are not the "true" offsets from the underlying source, then Knowtator assumes that the text source implementation handles any discrepencies.
The display of the text is not formated or extensible/configurable.

If you need to annotate pdfs, web pages, or any thing other than plain text received from TextSource.getText(), then we need to talk about how Knowtator can be reengineered to handle these special requirements.

Text source collections are some arbitrarily defined set of text sources. Each text source corresponds to some text that comes from a file, database, etc. and is accessed via implementations of TextSourceCollection and TextSource. Each text source also corresponds to a Protégé instance. The Protégé instance corresponding to a text source will inherit from the the class "knowtator text source" (in knowtator.pprj) or one of its descendants. The Protégé instance does not store the text from the text source but rather provides a sort of pointer to the text of the text source. The Protégé text source instance may also contain additional information about the text source that is relevant to and displayed for the annotators.

If you have not already done so please visit the User Interface Tour .

Step-by-step instructions

Write some code: you need to extend DefaultTextSource.java and TextSourceCollection.java. Please consult the javadocs for these two classes and the interface TextSource.java and look at the implementations that are already available.
Add an instance to knowtator.pprj for the class called "knowtator text source collection implementations" which is a direct subclass of "knowtator support class". This can be done with the following steps:
- Click on the "Instances Tab" from the tabs along the top (e.g. Classes, Slots, Forms, Knowtator).
- The instances tab is split vertically into three main sections. In the left hand side there is a view of the classes defined in your Protégé project. Expand the class node labelled "knowtator support class" and select the class called "knowtator text source collection implementations".
- The middle section lists the current instances for the selected class. There is a button at the top of the list for creating instances that looks like this: . The tool tip text says "Create Instance" if you hover your mouse over it for a couple of secons. Click this button to create an instance.
- When the new instance is created, the right most section will have a single slot that must be filled in. The value of this slot should be the full name of the text source collection you implemented in step 1 and should include the package name and the class name.
- Save your Protégé project.
Add the compiled class files created in step one to the Protégé classpath. This can be done by putting the class files into a jar file and placing it in the Knowtator plugin directory which is located in the at <protege-home>/plugins/edu.uchsc.ccp.knowtator.
Shutdown and restart Protégé.
When you open a new text source collection, the "Text source type selection" dialog should give your implementation as an option. Please see the tour page if the last sentence made no sense.

Case Study

In one of our annotation projects we are annotating specialized texts called GeneRIFs. There is information associated with GeneRIFS that is very valuable for annotators for the task that we have given them. For example, each GeneRIF is associated with one or more Entrez IDs and a PubMed abstract. We created a subclass of the "Lines from File" implementation (see filelines package). Because our annotators need to see the Entrez IDs and PubMed abstract IDs when they are annotating GeneRIFs, we created a subclass of "file line text source" (which is a subclass of "knowtator text source" which is in turn a subclass of "knowtator support class") called "generif text source". This class has slots for holding the information (entrez ids and pubmed ids) that we wanted displayed. The values of a slot of a text source are displayed in the upper right hand corner of Knowtator just to the right of the text viewer (see tour page).

The general approach we took to implement this text source collection was the following:

First we subclassed FileLineTextSource.java and overrode the constructor and method createTextSourceInstance to handle the extra data being passed in (a String[] for entrez ids and a String for the pubmed id).
Second we subclassed FileLineTextSourceCollection.java and overrode createTextSource so that it would create a text source with the entrez ids and pubmed id. This method handles a variation on the file format used for FileLineTextSourceCollection. The original file format is:
text_source_id|text
while the updated file format is something like:
text_source_id|entrez_id:entrez_id|pubmed_id|text