Text Source Collections
Introduction
Knowtator is very flexible with respect to what kind of texts can be loaded and annotated.
Texts may come from simple text files, XML files or a database table.
Rather than trying to predict all of kinds of texts that will be loaded into Knowtator we provide
a simple API for defining a text source collection. A couple of implementations of this interface are
provided with Knowtator:
- Local Files: This text source collection simply gathers text files located in a local directory and presents
the contents in the text source viewer. The identifier for an individual text source is the name of the file that
it corresponds to.
- Lines from File: This text source collection opens a file that has one line per text source. Each line
of the file has an identifier for the text source followed by a bar ('|') followed by the text to be annotated.
If these two implementations are not sufficient we encourage you to write your own text source collection
implementation using our simple API. The API makes the following assumptions about text sources in a text source collection:
- The text sources are ordered and indexed starting at 0 (zero).
- The text sources are plain text. That is, a call to TextSource.getText() returns plain text. The underlying source may be xml or anything else
but Knowtator is none-the-wiser.
- The offsets are based on the plain text returned by TextSource.getText() starting at 0 (zero). If these are not the "true" offsets from the
underlying source, then Knowtator assumes that the text source implementation handles any discrepencies.
- The display of the text is not formated or extensible/configurable.
If you need to annotate pdfs, web pages, or any thing other than plain text received from TextSource.getText(), then we need to talk about
how Knowtator can be reengineered to handle these special requirements.
Text source collections are some arbitrarily defined set of text sources. Each text source corresponds to some
text that comes from a file, database, etc. and is accessed via implementations of TextSourceCollection and TextSource. Each text source
also corresponds to a Protégé instance. The Protégé instance corresponding to a text source
will inherit from the the class "knowtator text source" (in knowtator.pprj) or one of its descendants.
The Protégé instance does not store the text from the text source but rather provides a sort of pointer to
the text of the text source. The Protégé text source instance may also contain additional information about the
text source that is relevant to and displayed for the annotators.
If you have not already done so please visit the User Interface Tour .
Step-by-step instructions
- Write some code: you need to extend
DefaultTextSource.java
and
TextSourceCollection.java.
Please consult the javadocs for these two classes and the interface TextSource.java and look at the implementations
that are already available.
- Add an instance to knowtator.pprj for the class called "knowtator text source collection implementations"
which is a direct subclass of "knowtator support class". This can be done with the following steps:
- Click on the "Instances Tab" from the tabs along the top (e.g. Classes, Slots, Forms, Knowtator).
- The instances tab is split vertically into three main sections. In the left hand side there is
a view of the classes defined in your Protégé project. Expand the class node labelled
"knowtator support class" and select the class called "knowtator text source collection implementations".
- The middle section lists the current instances for the selected class. There is a button at the
top of the list for creating instances that looks like this: . The tool tip text
says "Create Instance" if you hover your mouse over it for a couple of secons. Click this button to create an instance.
- When the new instance is created, the right most section will have a single slot that must be filled in.
The value of this slot should be the full name of the text source collection you implemented in step 1
and should include the package name and the class name.
- Save your Protégé project.
- Add the compiled class files created in step one to the Protégé classpath. This can be done by putting the class files into
a jar file and placing it in the Knowtator plugin directory which is located in the at <protege-home>/plugins/edu.uchsc.ccp.knowtator.
- Shutdown and restart Protégé.
- When you open a new text source collection, the "Text source type selection" dialog should give your
implementation as an option. Please see the tour page if the last
sentence made no sense.
Case Study
In one of our annotation projects we are annotating specialized texts called
GeneRIFs.
There is information associated with GeneRIFS that is very valuable for annotators for the task that we
have given them. For example, each GeneRIF is associated with one or more Entrez IDs and a PubMed abstract.
We created a subclass of the "Lines from File" implementation (see
filelines package).
Because our annotators need to see the Entrez IDs and PubMed abstract IDs when they are annotating
GeneRIFs, we created a subclass of "file line text source" (which is a subclass of "knowtator text source" which
is in turn a subclass of "knowtator support class") called "generif text source". This class has slots
for holding the information (entrez ids and pubmed ids) that we wanted displayed. The values of a slot of a text source
are displayed in the upper right hand corner of Knowtator just to the right of the text viewer (see
tour page).
The general approach we took to implement this text source collection was the following:
- First we subclassed
FileLineTextSource.java
and overrode the constructor and method createTextSourceInstance to handle the extra data being passed in (a String[] for entrez ids and a String for the pubmed id).
- Second we subclassed
FileLineTextSourceCollection.java
and overrode createTextSource so that it would create a text source with the entrez ids and pubmed id. This
method handles a variation on the file format used for FileLineTextSourceCollection. The original file format is:
text_source_id|text
while the updated file format is something like:
text_source_id|entrez_id:entrez_id|pubmed_id|text
Maintained by Philip V. Ogren.
This file last modified Monday, 08-Dec-2008 21:56:51 UTC