g-pypi 0.2.1 has been released. This is a bug fix release with changes that allow it to work with the portage 2.2 API changes and with older portage versions. There is also a change for a reported amd64 bug, so amd64 users please try 0.2.1 and file a ticket if you encounter a bug.

What is g-pypi? g-pypi creates Python package ebuilds for Gentoo Linux (think g-cpan of the Python world). Unlike CPAN, The Python Package Index is very liberal about the packages you can upload and the metadata you supply. You can also use any crazy versioning scheme you can think of and PyPI probably won’t flinch. This makes it difficult to create ebuilds and has kept it in my overlay a few years now.

I’ve only had a few actual bug reports over the last couple of years, but I’m afraid that if I put it in the official tree people may expect it to perform as well as g-cpan and file reports saying it couldn’t figure out the license or dependencies when the upstream maintainer didn’t provide any of that information. There are a few other problems that won’t get fixed until things change with the Python Package Index and distutils. Maybe Py4k.

So, give me some feedback, let me know if it’s been reliable enough for you that it warrants going in the official tree.  Go to ohloh and let me know you’re using g-pypi so I get motivated again.

You can find g-pypi in my overlay (layman -a pythonhead)

In the next major version I’ll have g-pypi search the semantic web for missing metadata. If PyPI doesn’t have the license, for instance, but Ohloh or Freshmeat or SourceForge does, it’ll figure it out.

I created a tool that uses FOAF to create home pages for Gentoo developers.

ternate takes Gentoo Linux developer information from our LDAP server on http://dev.gentoo.org and creates a FOAF file.

We use an XML format called GuideXML for all our official documentation and I thought that would give me an easy way to create a consistent look. I created an XSLT style sheet that takes FOAF and outputs GuideXML which can then be converted to HTML using www-servers/gorg.

For now it’s fairly basic since our LDAP servers don’t contain too much info. I’ll add features to parse our herds.xml file and add herds, parse our blog roll to find your blog URL etc.

You can take the FOAF and manually add quite a bit of info that can be included in GuideXML. Take a look at mine to see what I did with my FOAF file.

I was planning on setting up a SPARQL endpoint so we could do interesting things like checking if your fellow developers have accounts for games etc. but Apache on dev.gentoo.org is ignoring my .htaccess and refuses to serve .rdf files as anything but text.

dev-python/ternate can be found in my overlay (layman -a pythonhead).

It’s fairly easy to use -

Get your LDAP info:

$ ssh USERNAME@dev.gentoo.org ‘/usr/local/bin/perl_ldap -b anon -s USERNAME’ > ldap.txt

Convert it to FOAF:

$ ternate –foaf ldap.txt > foaf.rdf

Convert FOAF to GuideXML

$ ternate –guidexml foaf.rdf > index.xml

To convert it to HTML you’ll need www-servers/gorg:

$ gorg < index.xml > index.html

I first became interested in DOAP while working on a project to help Gentoo developers get new upstream package release information by ‘herd‘. I was taking Freshmeat names and matching them to Gentoo package names and hoped they mapped correctly. It worked pretty well, actually, and I came up with a way to let developers re-map names if they were incorrect. I wanted to expand to other package indexes and needed a common format to let people create their own scrapers and plugins when I discovered DOAP and Karl Fogel’s idea for a Galactic Project Registery.

I got in touch with Karl last year but he hadn’t done any work on it and it didn’t seem like he had a plan for actually implementing the GPR. So, I thought I’d see if I could come up with an implementation.

My first idea, which hasn’t really gone anywhere, was to use a website (doapurl.org) that would allow people to register PURLs for DOAP, allowing an easy way for users to find DOAP that was verifiably authentic. The problem with that is that only semantic web geeks create DOAP, and who’s going to want to create DOAP if nobody (i.e. SourceForge, Freshmeat etc.) are importing or exporting it?

As I tweaked and perfected my algorithm for smushing DOAP I came to the conclusion that I should ditch doapurl.org and just use doapspace.org. Rather than try to get people to understand how cool DOAP can be, I’ll make it easy to create DOAP for developers from existing sources and claim it as their own.

doapspace

Currently you can get DOAP for a single project using several different URLs. If there is DOAP for a project via Freshmeat, SourceForge and The Python Package Index, they can all be reached by the following URLs:

http://doapspace.org/doap/fm/pkgfoo

http://doapspace.org/doap/sf/pkgfoo

http://doapspace.org/doap/py/pkgfoo

These are all separate DOAP profiles with most of the information duplicated. When I upgrade the current doapspace server code to my next version you’ll be able to click on a ’smush’ link and either view a webpage or download RDF that is a single graph of an aggregation of all DOAP for a single project.

While I could have the three previously mentioned URLs all automatically present the smushed DOAP profiles, I like the idea of being able to see exactly what metadata comes from which source.

So my plan is to have a unique URL for each project out there. It will use the same scheme I decided to use for doapurl.org, but instead of waiting for people to come register a new URL and create DOAP, I can start with what metadata we have, and possibly create the URL automatically based on categories from existing indexes and OS distros.

The URL scheme for all the smushed DOAP would be in a single-level category and use the ’shortname’ from DOAP. If you’re familiar with Gentoo’s packaging system, the categories are actually two-level but separated by a ‘-’, but as far as a URL goes, it’s one level. In our example for ‘pkgfoo’, it would have a unique URL which returns the smushed DOAP graph or webpage, based on content negotiation, here:

http://doapspace.org/app-backup/pkgfoo

Doapspace’s spiders are automatically gathering all this metadata and monitoring pingthesemanticweb.com for any new or updated DOAP found out there. After a project is claimed by people who work on it, they’ll be able to see a list of URLs for DOAP that has been spidered and choose to aggregate it one-time, or authorize URLs to be automatically aggregated and updated each time they are updated.

Using OpenId, we’ll be able to authenticate users using delegation by various means. SourceForge now allows you to use it as an OpenId delegation point, which is a great start. I’ll go more into this in another article but, suffice to say, doapspace will use as much automated authentication as possible instead of verifying submissions manually, like Freshmeat does, for instance.

Operating system distributions will be allowed to make certain semantic assertions about any projects using OpenId delegation. A Gentoo developer would delegate against a file in their user directory (http://dev.gentoo.org/~pythonhead/openid.html). They could then say that pkgfoo is packaged in Gentoo as http://packages.gentoo.org/app-foo/foopkg. Debian and Ubuntu developers could then make assertions using http://pacakges.debian.org etc. I’m running tests using Gentoo’s entire portage tree and importing any matches into doapspace using <doap:packaged-as rdf:resource=”http://packages.gentoo.org/cat/pkgname”/> (packaged-as isn’t part of the official DOAP vocabulary, don’t try this at home).

I think having doap:packaged-as or something simliar would be quite handy for Baetle, no? You’d be able to find all the bug trackers for every distribution easily.

So my plan is to make doapspace into something like dbpedia.org, except that all the input will be strictly limited to parties who have the authorization to assert specific semantic statements and by taking metadata automatically from places where the information is known to have originated from developers and maintainers (SourceForge, RubyForge etc.).

Is this so crazy it just might work?

This is the second of several articles describing my work on a Semantic Web project I hope becomes helpful to Gentoo developers and users. I use Gentoo as an example distribution in this article but creating doapfiend plugins for other distributions is trivial.

In my first article in this series I described DOAP, a vocabulary for describing Open Source project metadata and described some imaginary tools that could be made using Semantic Web technology. Since then I’ve created a tool that does much of what my imaginary tools promised.

Doapfiend

Introducing Doapfiend

Some of the metadata DOAP can describe is a project’s name, homepage, description, bug tracker URL, file release URLs and changelogs, screenshot urls, VCS URLs (svn, cvs, bzr, mercurial etc.), wiki URLs, programming languages, licenses, names and email addresses for developers, documenters, translators and more.

Doapfiend is a command-line client and Python library that displays or serializes DOAP in several formats. You can also search the Semantic Web with relative ease and actually do stuff with that metadata.

Doapfiend is entirely plugin based. Some of the plugins available let you search for DOAP using a Gentoo package name, a SourceForge, Freshmeat, Ohloh or Python Package Index project name, or a project’s homepage.

You can simply display DOAP in human readable format or serialize it in various formats. A very basic plugin takes DOAP and creates a very basic skeleton ebuild or a webpage with HTML and CSS.

The simplest usage example, and quite boring:
doapfiend -u /path/to/somedoap.rdf

or

doapfiend -u http://example.com/some.rdf

This will display the project metadata in human readable format, nicely formatted. It’s not terribly exciting, but you get concise information quickly.

If you only want some metadata about the project, use the –fields option. Say you want to find out the project’s subversion location URL and the homepage URL:

doapfiend -u some-project.rdf --fields svn.location,homepage

That’s a little more exciting, but what if you don’t know what kind of version control system they’re using? And you don’t know where their DOAP file is on the web or even the homepage of the project, but you know the name your pacakage manager knows the project as.

doapfiend --gentoo dev-python/doapfiend --vcs checkout

This looks in the portage tree for the doapfiend ebuild, gets the homepage, searches the Semantic Web for DOAP with that project homepage, fetches it and gets the repository. The VCS plugin determines the project uses Subversion and sends the ‘checkout’ command to the repository using the URL it found.

When you create an ebuild, rpm, deb file etc., you need the basic project metadata. I’ve written a very basic plugin that generates a Gentoo ebuild. Say you know the sourceforge name (or ohloh, freshmeat etc.):

doapfiend --sf project_name --ebuild

This prints an ebuild to stdout, nothing too fancy, but it only took about 30 minutes to write the plugin. A more sophisticated plugin would start by showing you file releases and letting you choose, naming the ebuild accordingly etc. We could also determine the programming language from the DOAP and if we have a more suitable ebuild generator, like g-cpan for Perl or g-pypi for Python, call those.

Doapfiend isn’t strictly limited to DOAP files. You can throw any RDF file at Doapfiend and it will try to do something with it. If you have a FOAF (Friend of a Friend) file and it has the person’s Open Source projects listed in it, Doapfiend will print all those project’s homepages. You can add the -f switch and it will search for DOAP for each project and display all the metadata.

Doapfiend API - Don’t Panic

Doapfiend contains a library with a simple API designed to be easy to use for coders who have little or no RDF experience. It’s based on RDFAlchemy, an ORM which uses rdflib. If you’re familiar with SQLAlchemy, you’re all set. The RDFAlchemy API was created to let you create code that uses SQLAlchemy or RDFAlchemy with little to no code changes. If you’re an RDF guru you can drop down to rdflib and access triples after using Doapfiend’s API to search the Semantic Web if you prefer.

If all that means nothing but you know a little Python, here’s how you’d fetch metadata for a project with a SourceForge name of ‘nut’.

from doapfiend.doaplib import get_by_pkg_index
print get_by_pkg_index('sf', 'nut')

That will print out all of the project’s metadata in plain text, but say you just want a few pieces of information, using the Freshmeat project name:

from doapfiend.doaplib import get_by_pkg_index, load_graph
doap = load_graph(get_by_pkg_index('fm', 'nut'))
print doap.name
'nut'
print doap.created
'2008-04-19'

So there you have a taste of what you can do with DOAP today. Of course you’re wondering how much DOAP is out there, who’s creating it and how you can create it for your own projects.

Where Does DOAP Come From? Who’s Using It?

Doapfiend uses doapspace.org to search for most DOAP. I get new and updated DOAP URLs from PingTheSemanticWeb.com and re-spider them daily. The Ohloh plugin uses the RDFOhloh website. I started work on doapspace.org last year and have been spidering DOAP, creating DOAP by scraping HTML from SourceForge and other package indexes, used metadata from FLOSSMole, imported and converted Freshmeat’s publicy available data. The Python Package Index provides DOAP for every project listed. All this DOAP is made freely available on doapspace.org.

Today I have approximately 54,000 DOAP files hosted on doapspace.org. That isn’t DOAP for 54,000 different projects, there are duplicates because it’s common to have metadata for a single project from SourceForge, Ohloh, Freshmeat and PyPi, for instance. I’ve monitored that last 17,000 SourceForge project releases and created DOAP for each. I’m about 99% happy with my SourceForge spider. When it’s ready I’ll spider all of SourceForge and keep that metadata up to date ‘in real time’.

When I started doapspace.org, having all that duplicated metadata was worrisome. I was trying to figure out which data to serve up. Ohloh doesn’t provide file release info, but Freshmeat does. But only current releases, which is handy, but SourceForge has all the file releases. Was I going to have to figure out what data the client wants then serve up the ‘best’ RDF file?

That was before I realised how flexible RDF is and how easy it’s going to be to aggregate all that metadata about a single project into a single graph. I’m not there yet, but I’m getting there.

In My Next Article in the Series…

In my next article I’ll show you how to create DOAP, put it on the web and get it spidered by a Semantic Web crawler. You’ll learn how to add a few lines to a FOAF file for each project you’re involved with. I’ll also discuss some other vocabularies that tie in with DOAP such as SIOC and BAETLE, the Bug And Enhancement Tracking LanguagE, which will allow a semantic interface to existing bug trackers.

This series of articles will also explore using FOAF and DOAP to make Gentoo’s metadata more easily available to developers and users. For instance, our LDAP information is only accessible to developers. How about taking that info, creating a FOAF file for every developer in their dev.gentoo.org/~user accounts? We could add DOAP for projects or herds they’re involved with automatically, then let them edit from there to add as much personal information as they’d like. See SPARQLbot, an IRC bot (#sparqlbot on Freenode) to see where I’m headed with this.

I don’t know if there’s anything else like this out there, but here is an online collaborative editor.

While that may not sound unique, what sets it apart is that it’s like a pastebin like rafb.net, where there’s no registration, you just paste your code and give out the URL, but everyone can edit the source code in real-time with syntax highlighting.

And it seems to be running on CherryPy, which of course makes it even cooler.

I haven’t been a big fan of social networking sites. I joined Facebook at my sister’s insistence but the only thing I got out of it was the discovery of Scrabulous.

I’ve joined several others just to check them out, but I rarely return.

There’s probably some agreement I clicked on when I got the invitation toTwine.com, so I won’t say much until I’ve actually read it. I mean “read again”. Ahem.

It’s not really fair to compare Twine to Facebook, LinkedIn etc., but they are the closest things to it. But generally when you sign up for those you go through the same thing. They want to hook you up with everyone you know. Give them all your accounts. You know the drill. And then they’re all pretty much the same. They’ve got your personal data, locked in, no exporting.

On Twine, what I really get out of it is meeting new contacts based on what I’m interested in. It learns from what I’m interested in and suggests new contacts. More on that in a future article.

Twine exposes every item as RDF to the web. After a few hours hacking on a tool to play with this feature, I give you my quick summary of Facebook compared to Twine from a curious hacker’s point of view:

Here’s what you get when you play around with Facebook:

http://scobleizer.com/2008/01/03/ive-been-kicked-off-of-facebook/

And here’s what I got when I played around with Twine:

“Yes, this is the first Twine application written by a third-party (not on our
development team). Congratulations! This is a really cool milestone for Twine.
I look forward to seeing what else you may come up with in the future.”

That was a comment on my blog from Nova Spivack, Twine’s creator.

And inside Twine I’ve received a couple of notes from their developers asking me if there’s anything I need in regards to their API and giving me nothing but encouragment.

I have no idea if this is actually the first Twine app or not, but it’s the only I’ve seen.

entwine is a command-line client written in Python using rdflib. It simply logs in to twine.com and is able to fetch anyone’s profile in RDF, parse it and print it out.

I got a nice email and connection invitation from Peter Royal, a Senior Architecht for Twine, asking what I’d like to see in the API.

I was a little startled that anyone had found entwine so quickly. I had only set up the project on Google Code hosting last night and it wasn’t visible if you searched for ‘twine’ because it was still inactive.

I’m still exploring Twine so I’m not sure exactly what I’d like out of the API yet. If you’re in on the beta, check out the new Twine I started to discuss just that.

This is the first of several articles describing my work on a Semantic Web project that should be helpful to Gentoo developers and users in many new and exciting ways. Some tools I describe are specific to Gentoo but should be of interest to users and developers of all Open Source software.

It’s the metadata, DOAP!

DOAP (Description of a Project) is an RDF vocabulary that describes Open Source projects. The metadata found in DOAP is rich with information we can use to create and patch existing tools to make life easier for developers and users.

Here are tools that would be possible with little work for the most part:

  • A command-line client you give the name of a Gentoo package and it shows you all versions available ‘upstream’, in portage or not. It can also show you the URLs of each file release or even the ChangeLog entry for any version.
  • Patch pybugz, our Bugzilla command-line tool, to take a Gentoo package name and automatically find the URL for that package’s Bugzilla interface and query it.
  • Use DOAP metadata to create a basic ebuild with basic dependency info.
  • A command-line client you give a package name and a search term and it figures out the project’s forum URL and searches it, returning all the results.
  • QA Tools: Run a tool against an ebuild and it’ll show any license changes, homepage change or new or dropped dependencies.
  • Firefox search bar plugin that let’s you enter b:package-name and it takes you to the project’s bug tracker or w:package-name and takes you to that project’s wiki, h:package-name takes you to their homepage, etc.

Before I get into the details of how these tools will be built, I’ll describe how we find DOAP for a particular project and how we parse it.

The Homepage is Key

Think of the Semantic Web as a database parallel to the normal Web, describing the contents of websites in machine-readable metadata. Each DOAP file describes an Open Source project. Because of the de-centralized nature of the Semantic Web, DOAP files are stored wherever the project wants. A DOAP index would need some way to link DOAP files to projects in a reliable way. We can’t rely on a ‘project name’ field, because uniqueness can’t be enforced. The URL of the homepage is this unique key.

http://trac.example.org/projectx -> http://example.org/doap.rdf

Our tools will take a Gentoo package name, look in the ebuild(s) for the homepage URL then query a DOAP index and fetch the DOAP.

The obstacle here of course is making sure our ebuilds have the correct homepage, which can be a problem, especially when every KDE ebuild has http://www.kde.org/ as the homepage, for example.

This can be overcome without manually editing hundreds of ebuilds, but I’ll get to that in another article.

What if a project moves? DOAP also has multiple fields for ‘old-homepage’, which, if you’re familiar with SQL, is like an alternate key.

In Part 2 of Gentoo and the Semantic Web I’ll discuss where DOAP is today, who’s using it and how close we are to having my imaginary tools.