A Personal Knowledge Base

Clone the git repository: git clone https://github.com/uholzer/pkb4unix.git

What this is all about and Motivation

The following things happened several times to me:

I rember having read something on some website but can't find it anymore. This is especially embarassing when a friend is watching and Google just does not want to return the right result.
I visit a website and consider it worthy to visit again later. But where should I put a note so I find it again later?
For a thesis I have to work with several papers. Of course, I store the metadata as BiBTeX. Such papers often come in handy in other situations, therefore it would be useful to manage the metadata centrally and searchable.
I discover some piece of software for a certain task. Next time I need to perform a similar task, I want to remember what software I chose.
I find a solution for some problem which turns up seldomly. I want to be able to look up this solution whenever I encounter a similar problem. Or stated the other way: Given a problem, I want to be able to look up whether I already solved it in the past.

These events all sound like they should be solvable with currently available software. Maybe so, but I am the kind of person who wants to build his own tailored solution.

Let us look at the requirements: The data I want to store is manifold. I have problems, solutions, tasks, websites, papers and topics. Furthermore, I will likely be tempted to introduce more and more properties for all of them. Take papers as an example. There are many BiBTeX properties: authors, publisher, date, journal (issue, page and so on), title, URL, DOI. The list is virtually open ended. Also one is tempted to add more properties, for example one which says in which way the paper has been useful in the past. The things I want to add to my knowledge base should have relations between each other: A paper is related to a topic and an author. An author is related to her email address and her university. A solution is related to the problem it solves. I want to be able to build the knowledge base gradually and introduce new concepts and properties when needed.

Clearly, it is very difficult to do all this using relational databases. Sematic Web enthusiasts already know what the right thing is in this case. I will build upon Semantic Web standards and plug together readily available software. Also, I hope to benefit from integration with the Semantic Web, for example, metadata about papers is provided by many publishers in RDF, so I don't need to enter that data myself.

This page is organized as a blog. I will report here from time to time how my knowledge base evolves. Everything mentioned above will be fleshed out and I will show how it works in practise.

Following the UNIX Philosophy

The requirements for my knowledge base are not very clear yet. In fact, I have to explicitely allow them to change over time. Because of this, it is important to stick to the UNIX philosophy. Among its principles are Separate mechanism from policy, Write simple parts connected by clean interfaces, Fold knowledge into data, so program logic can be stupid and robust, and Prototype before polishing. I will addhere to these principles like this:

Separate mechanism from policy: I am going to use a SPAQRL endpoint to store and query the RDF data. Several libraries to handle RDF data and to create presentation from it are already available or will be developed by myself. This all implements the mechanism. The policy will be determined by the small tools I am going to write and how they will be configured.
Write simple parts connected by clean interfaces: My toolkit is going to be split up into the SPARQL endpoint and several small tools interacting with the SPARQL endpoint, acting as filters, or providing a GUI.
Fold knowledge into data, so program logic can be stupid and robust: In fact, a SPARQL endpoint is as stupid as it can get. It basically stores quadruples of meaningless identifiers and is able to query them equaly meaninglessly. Indeed, it has no knowledge what these identifiers mean. With a little luck, a SPARQL knows the meaning of identifiers in the RDFS or OWL namespace and is able to do reasoning. Also, I will use a a library which presents RDF data in a human-digestable way. How these presentations look like will also be described in RDF. So, program logic will really be stupid.
Prototype before polishing: I am going to use Python, which is an excellent language for prototyping. Efficiency will not be a concern for the beginning, since this is a personal knowledge base which will likely not need to handle a huge flood of data.

A SPARQL Endpoint

This time, I'll install a SPARQL edpoint. I decided myself for Sesame 2.7.0-beta2.

Installation Instructions can be found in Sesame's documentation. Sesame is a Java application and needs a Servelt container in order to be run as a SPARQL endpoint. I'll use Tomcat from my Debian distribution found in the package tomcat7-user. This package is special in that it allows a user to easily setup and run Tomcat on his machine for testing purposes. Setting up a tomcat instance is very easy:

urs@speedy:~/p/knowledge$ tomcat7-instance-create --help
Usage: tomcat7-instance-create [options] <directoryname>
  directoryname: name of the tomcat instance directory to create
Options:
  -h, --help       Display this help message
  -p httpport      HTTP port to be used by Tomcat (default is 8080)
  -c controlport   Server shutdown control port (default is 8005)
  -w magicword     Word to send to trigger shutdown (default is SHUTDOWN)
urs@speedy:~/p/knowledge$ tomcat7-instance-create tomcat-instance
You are about to create a Tomcat instance in directory 'tomcat-instance'
nc: unable to connect to address localhost, service 8080
nc: unable to connect to address localhost, service 8005
* New Tomcat instance created in tomcat-instance
* You might want to edit default configuration in tomcat-instance/conf
* Run tomcat-instance/bin/startup.sh to start your Tomcat instance

Unpacking Sesame reveals two war files, one for the Sesame itself and one for the Sesame Workbench. These must be dropped into Tomcat's webapps directory.

urs@speedy:~/p/knowledge$ cd sesame/
urs@speedy:~/p/knowledge/sesame$ tar -xzf openrdf-sesame-2.7.0-beta2-sdk.tar.gz
urs@speedy:~/p/knowledge/sesame$ ls openrdf-sesame-2.7.0-beta2/war/
openrdf-sesame.war  openrdf-workbench.war
urs@speedy:~/p/knowledge/sesame$ cd ..
urs@speedy:~/p/knowledge$ cp sesame/openrdf-sesame-2.7.0-beta2/war/* tomcat-instance/webapps/

This does the trick. Starting and stopping tomcat is easy with ./tomcat-instance/bin/startup.sh and ./tomcat-instance/bin/shutdown.sh. Note that Sesame's data is stored in ~/.aduna/. After starting, Sesame's Workbench can be accessed via http://localhost:8080/openrdf-workbench/ and the server is at http://localhost:8080/openrdf-sesame. Using the workbench, one can create new repositories. For my Personal Knowledge Base I create one of type Native Java Store RDF Schema with ID pkb and title Personal Knowledge Base.

Using the workbench, one can load and query graphs easily. It is useful when doing simple things, but for complex things, I will have to write my own tools.