URL uniqueness

Post by **Jonathan** » Thu Jun 09, 2005 1:24 am

This is an issue I've been thinking about a lot recently.

Places like http://boingboing.net and sites that use MovableType have auto-uniquefying URLs with some semblance of human-readableness.

http://boingboing.net/2005/05/31/americ ... ized_.html

Obviously, this is a step forward from numbered URLs

http://www.joelonsoftware.com/articles/ ... 00017.html

or manually selecting names

http://www.wired.com/wired/archive/13.06/war.html

What is the best way to auto-uniquefy a URL? Assume a blog-style post with a title, date, and text body, but the title and date are optional.

To wit: http://www.jonathan.pearce.name/news

I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?

I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.

quantus · Post by **quantus** » Thu Jun 09, 2005 3:00 am

dude, just concatenate date and post time (down to the second or millisecond if you so choose)...

Post by **Jonathan** » Thu Jun 09, 2005 4:46 am

I might as well increment integers if I'm going to do that. I'm trying to do something that is meaningful, so when I see a link I know what it refers to.

Post by **Jonathan** » Thu Jun 09, 2005 5:10 am

Also, sometimes all I have is the year.

Post by **Jonathan** » Thu Jun 09, 2005 5:30 am

Actually, just to be perfectly clear, I have 3 cases.

Title & Date.
Title only.
Date only.

I'm thinking if there's a title, concatenate first two words in title and test for collisions. If collision or no title, concatenate first two words in

And, everything gets put in a subdirectory of year. collisions only get checked for by year.

So, this:

http://www.livejournal.com/users/dwindlehop/14605.html

becomes this:

http://jonathan.pearce.name/2005/housekeeping

quantus · Post by **quantus** » Thu Jun 09, 2005 8:31 am

sounds reasonable... You may want to have a small dictionary of words to ignore, like: the, a, it, you, me, is, to, etc... Then just get the first two non-ignored words.

Jason · Post by **Jason** » Thu Jun 09, 2005 1:50 pm

Another good trick that's fairly quick is to use the longest words in your corpus, or in this case your text. They're usually the most unique unless you keep writing about the same things.

If you've got time on your hands you might just want to have cummulative statistics on your corpus and choose the most prominent words from the results.

Peijen · Post by **Peijen** » Thu Jun 09, 2005 2:16 pm

write your own url parser that intercept all url that goes to j.p.n/news/*

use * to performan a search and return a result for the best match.

for j.p.n/news/2005/12/30/the_future.html
You would search by date 2005/12/30, and pharse "the future"

or j.p.n/news/1/7/winter_or_summer.html
You would search by date 1/7 or 7/1 and the pharse "winter or summer"

Keep your actual documents in sequencial number, and the rest is easy to figure out. Sort of an extention of Jason's idea.

Post by **Jonathan** » Thu Jun 09, 2005 4:01 pm

Small words are a problem. I tried getting rid of them, but sometimes it just doesn't look right. Instead, I think I'll add an additional word for each small word in the proposed url. Thus, I'd get

/itisgoodtoreunite

instead of

/itis

or

/goodreunite

I'll have to look at my corpus and see what the longest word for each entry. The title thing works pretty good.

Mod_rewrite is a pain. I'll be fine with static URLs, I think.

Martin · Post by **Martin** » Fri Jun 10, 2005 3:38 am

Longest word sounds contrived. Amazon uses something called statistically improbable phrases. I don't know what their algorithm is (it's apparently proprietary), but language log has analyzed it and doesn't like it much.

Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).

Jason · Post by **Jason** » Fri Jun 10, 2005 3:23 pm

Martin wrote:Longest word sounds contrived.

Why is it contrived? Does everything have to be complex for you? It's effectively a hashcode and the longer it is the less likely there will be a collision (although since the domain is words that isn't necessarily the full case), but still. It's a quick hash for a millisecond of time.

True, if the first sentence of the post covers the full context of the post then your idea is better.

Post by **Jonathan** » Fri Jun 10, 2005 7:31 pm

Martin wrote:Do you not have any sort of timestamp on these entries?

Like I said, I got a year on all of them. If I assigned a month I'd be making it up. Manually. I don't want to. When I record a date, I record the day and month. I do not record the time of day.

I think titling untitled things with the first several words is a good plan. I'm actually leaning away from including small words now. Instead, I can add things like "well" and "hmph" to my list of words to remove and hopefully pick out some good ones. I might have pithy four word sentences. You don't know.

I just want the date+title string to be unique. I think a good format is

http://jonathan.pearce.name/2005/housekeeping
http://jonathan.pearce.name/2005/goodreunite
http://jonathan.pearce.name/2003/mightystephen
http://jonathan.pearce.name/2003/updatesupdates

Longest word appears to be a problem with my corpus, because I keep writing about the same things. The longest words either tend to crop up more than once or be apropos of nothing.