Your Metadata Sucks

Sunday, August 18, 2013

Signs your product is doomed #42314

It enables toasters or refrigerators communicating with anything.

Stop. They shouldn't do that.

Toasters toast. Refrigerators refrigerate.

Friday, April 13, 2012

New Internet Draft: Semantic Content Packages

"Blobs, Triples, and a URI. Bring Your Own Vocabulary."

Sound interesting? Then have a look at http://www.ietf.org/id/draft-wilper-semantic-content-pkgs-00.txt Is this the next logical step in semi-structured data management or is Chris just trying to stir up a hornet's nest? You decide.

scpproject.org

Seriously though, I've been thinking in this area for a while and I know others have, too. Particularly in the preservation & archiving community with things like BagIt and ORE and various combinations. I think something a bit more generic that has RDF at its core and, critically, acknowledges that copies of the same content can be made available from multiple locations...could have quite a lot of potential.

Anyway, I thought it would be interesting to get down to business and actually specify something for people to bang on. So if this is an area you've got an interest in, and you've got a few minutes, I'd appreciate your giving it a read.

Public comments or email to me are fine for now. I can also set up a group if there's sufficient interest.

Tuesday, November 09, 2010

A simple file-level dedupe utility in Python

At home, I've been working on organizing my photo library and found FastDup to be a great little utility. You point it at a directory and it finds duplicate files with surprising speed. It works well because it's smart about not doing more work than it needs to. A naive dedupe utility (which, ahem, I may have written in Java a couple years ago to do similar work with my audio library) works like this:

Compute the checksum of all files
List files with matching checksums

A smarter approach is to:

Group all files by size
Do a partial comparison of all files of a given size, quickly excluding obvious non-matches
Complete the comparison for files that look equivalent so far, listing matches

FastDup, which is written in C++, takes this approach. I compiled and ran it fine on my file server (an Ubuntu machine) and tried to compile it on my Mac, too...no luck. The author states in the README that it works in Linux and nowhere else, and the last release was a couple years ago, so it seemed I was out of luck.

Well, not really. I've been wanting to get re-acquainted with Python for a while now (for various reasons), and I figured this was a good excuse. How hard could it be? As it turns out, not very.

qdupe - A command-line utility to quickly find duplicate files, written in Python and inspired by FastDup.

So how does it compare to FastDup?

Out of curiosity, I ran both over my DVD library, which is currently at about half a terabyte. I ran each twice, back to back, in order to see the effects of the OS's buffer cache. They both found 911 dupes, adding up to about 500MB. The first time I ran them, they each took about a minute. The second time, FastDup took 3.0 seconds and qdupe took 3.6 seconds.

Thursday, December 03, 2009

Dot Plan from 1995

Before the inter-twitter-facebook-blogweb, or whatever you kids call it, there was Finger. Finger was cool because only geeks knew about it. You'd post your status to your .plan file and people anywhere in the world could type "finger some-obsure-userid@some-obsure-host.edu" to see it.

It was like blogging, but with an even slimmer chance of having an audience. Great stuff.

Anyway, I was rooting around my old account at csh tonight and found this my .plan:

class CS2

creation
    brain_washing

feature -- Global variables

    student: STUDENT
    clean: INTEGER is unique
    warped: INTEGER is unique

feature -- Main program

    brain_washing is
    do
        from
            !!student.make
            student.mind := clean
        until
            student.mind = warped or world.end_of
        loop
            student.io.putstring( "EIFFEL is Good%N" )
            student.io.putstring( "Don't worry that your executables " )
            student.io.putstring( "are usually over 20,000 times larger " )
            student.io.putstring( "than the source code.%N" )
            if student.resists then
                student.attend_lecture
                student.attend_lecture
                student.attend_lecture
                student.attend_lab
            end
        end -- loop
    end -- brain_washing

end -- CS2

Clearly, this is an important digital artifact to preserve.

By posting it here, I feel I have played an important role in format migration for future generations. Thank you.

Friday, September 04, 2009

An extra cent?

It often happens that my flight price goes up while I'm in the process of booking. I thought it was pretty shady the first few times it happened. Now I just accept it and move on. But I thought this one was a little bizarre today:

I can't help but wonder if Peter Gibbons is behind this in some way.

Monday, August 31, 2009

Discovery of content metadata on the web

A thought experiment...

I recently read an entertaining old article on various things people have been shoving into http response headers. Some for utility (X-XRDS-Location), and some for fun (slashdot's random X-Fry and X-Bender quotes). One site actually put a bunch of DC.title, DC.etc headers in their responses. Not that anyone's looking for them there, but *just* in case...

This got me thinking (again) about ways to provide richer metadata, especially RDF, about resources on the web. We have RDFa now, which is a big step forward, but there are a couple key problems we still don't have worked out:

ISSUE 1: How do we discover publisher-sanctioned resource descriptions for arbitrary resources on the web? (e.g., non-XHTML)

I think the http Link: response header is the right way forward on this: An isDescribedBy link, pointing to a resource whose representation encodes an RDF graph describing this resource.

ISSUE 2: Given that a resource and the content of a representation of that resource are distinct things, how do we make statements about the latter on the web?

This one deserves more explanation.

If I access http://example.org/Picture1, and my browser uses content negotiation to request the image/jpeg representation, and gets it, I want to be able to discover this kind of info:

@prefix    : <http://dear.lazyweb/please/write/this/ontology/>
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>

# The file is a JPEG and here's some basic info about it

_:myFile  a            :OctetStream;
        :name        "Picture1.jpg";
        :mediaType   "image/jpeg";
        :format      <info:pronom/fmt/42>
        :length      105124;
        :md5sum      "7846df5ced300e9543a267a856c4ab6e";
        :sha1sum     "e3b5112b24e793f41fc5b843a505a83a80aaf776";
        :created     "2009-08-31T10:12.342Z"^^xsd:dateTime;
        :modified    "2009-08-31T16:28.921Z"^^xsd:dateTime;
        :renditionOf <http://example.org/someImage>

# The file is one of any number of renditions of a picture

<http://example.org/Picture1>
        dc:title       "Best Picture Ever";
        dc:description "This is a picture of my cat, Lucky"
        dc:creator     "Bob Dobbs".

What would be cool is if my browser knew about the http Link response header, and the metadata was just a click away, in an RDFa document.

The trick would be for user-agents to be able to associate the particular rendition I got by GETting the resource with the appropriate resource in this graph. Notice it's a bNode in the example above. It might have a URI, it might not; but the URI of the rendition isn't known by the user-agent when it retrieves this graph....and the relation expressed by the http Link header is to be interpreted as "(the resource identified by this URI) isDescribedBy (the graph resource over there)"

So, absent some additional information, in the general case, the user-agent is going to have to do the association via some distinctive property matching: Did the response of the original GET request on the picture include a Content-MD5 header? If so, that's a good clue. Hmmm.

Monday, May 04, 2009

That's Classy

Here's a simple program to report on Java .class versions. I'm sure some variant of this has been written a thousand times, but Google wouldn't give me what I wanted right away, so here it is again :)

The program takes one argument: a path to a .class file, .jar file, or directory containing a mixture of both, and produces a report of each class file's major .class format version (50 for Java 6, 49 for Java 5, and so on). Handy if you want to track down those new fangled classes and avoid the dreaded java.lang.UnsupportedClassVersionError


import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.jar.JarEntry;
import java.util.jar.JarInputStream;

public abstract class ThatsClassy {

 static void classyFile(File file) throws Exception {
   if (file.isDirectory())
     for (File child: file.listFiles())
       classyFile(child);
   else if (file.getName().endsWith(".jar"))
     classyJar(file);
   else if (file.getName().endsWith(".class"))
     classyClass(file.getPath(), new FileInputStream(file), true);
 }

 static void classyJar(File jarFile) throws Exception {
   JarInputStream jarStream = new JarInputStream(new FileInputStream(jarFile));
   JarEntry entry = jarStream.getNextJarEntry();
   while (entry != null) {
     if (entry.getName().endsWith(".class"))
       classyClass(jarFile.getName() + "#" + entry.getName(), jarStream, false);
     entry = jarStream.getNextJarEntry();
   }
   jarStream.close();
 }

 static void classyClass(String id, InputStream in, boolean close) throws Exception {
   in.skip(7);
   int majorClassVersion = in.read();
   if (close) in.close();
   System.out.println(id + " " + majorClassVersion);
 }

 public static void main(String[] args) throws Exception {
   classyFile(new File(args[0]));
 }
}

Your Metadata Sucks

Sunday, August 18, 2013

Signs your product is doomed #42314

Friday, April 13, 2012

New Internet Draft: Semantic Content Packages

Tuesday, November 09, 2010

A simple file-level dedupe utility in Python

Thursday, December 03, 2009

Dot Plan from 1995

Friday, September 04, 2009

An extra cent?

Monday, August 31, 2009

Discovery of content metadata on the web

Monday, May 04, 2009

That's Classy

About Me

Me Elsewhere

My GPG Fingerprints

History

Your Metadata Sucks

Sunday, August 18, 2013

Signs your product is doomed #42314

Friday, April 13, 2012

New Internet Draft: Semantic Content Packages

Tuesday, November 09, 2010

A simple file-level dedupe utility in Python

Thursday, December 03, 2009

Dot Plan from 1995

Friday, September 04, 2009

An extra cent?

Monday, August 31, 2009

Discovery of *content* metadata on the web

Monday, May 04, 2009

That's Classy

About Me

Me Elsewhere

My GPG Fingerprints

History

Discovery of content metadata on the web