URL uniqueness

Posts you want to find years later go here.

URL uniqueness

Postby Jonathan » Thu Jun 09, 2005 12:24 am

This is an issue I've been thinking about a lot recently.

Places like http://boingboing.net and sites that use MovableType have auto-uniquefying URLs with some semblance of human-readableness.

http://boingboing.net/2005/05/31/americ ... ized_.html

Obviously, this is a step forward from numbered URLs

http://www.joelonsoftware.com/articles/ ... 00017.html

or manually selecting names

http://www.wired.com/wired/archive/13.06/war.html

What is the best way to auto-uniquefy a URL? Assume a blog-style post with a title, date, and text body, but the title and date are optional.

To wit: http://www.jonathan.pearce.name/news

I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?

I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR

Postby quantus » Thu Jun 09, 2005 2:00 am

dude, just concatenate date and post time (down to the second or millisecond if you so choose)...
Have you clicked today? Check status, then: People, Jobs or Roads
User avatar
quantus
Tenth Dan Procrastinator
 
Posts: 4653
Joined: Fri Jul 18, 2003 2:09 am
Location: San Jose, CA

Postby Jonathan » Thu Jun 09, 2005 3:46 am

I might as well increment integers if I'm going to do that. I'm trying to do something that is meaningful, so when I see a link I know what it refers to.
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR

Postby Jonathan » Thu Jun 09, 2005 4:10 am

Also, sometimes all I have is the year.
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR

Postby Jonathan » Thu Jun 09, 2005 4:30 am

Actually, just to be perfectly clear, I have 3 cases.

Title & Date.
Title only.
Date only.

I'm thinking if there's a title, concatenate first two words in title and test for collisions. If collision or no title, concatenate first two words in

And, everything gets put in a subdirectory of year. collisions only get checked for by year.

So, this:

http://www.livejournal.com/users/dwindlehop/14605.html

becomes this:

http://jonathan.pearce.name/2005/housekeeping
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR

Postby quantus » Thu Jun 09, 2005 7:31 am

sounds reasonable... You may want to have a small dictionary of words to ignore, like: the, a, it, you, me, is, to, etc... Then just get the first two non-ignored words.
Have you clicked today? Check status, then: People, Jobs or Roads
User avatar
quantus
Tenth Dan Procrastinator
 
Posts: 4653
Joined: Fri Jul 18, 2003 2:09 am
Location: San Jose, CA

Postby Jason » Thu Jun 09, 2005 12:50 pm

Another good trick that's fairly quick is to use the longest words in your corpus, or in this case your text. They're usually the most unique unless you keep writing about the same things.

If you've got time on your hands you might just want to have cummulative statistics on your corpus and choose the most prominent words from the results.
User avatar
Jason
Veteran Doodler
 
Posts: 1518
Joined: Thu Jul 17, 2003 11:53 pm
Location: Fairfax, VA

Postby Peijen » Thu Jun 09, 2005 1:16 pm

write your own url parser that intercept all url that goes to j.p.n/news/*

use * to performan a search and return a result for the best match.

for j.p.n/news/2005/12/30/the_future.html
You would search by date 2005/12/30, and pharse "the future"

or j.p.n/news/1/7/winter_or_summer.html
You would search by date 1/7 or 7/1 and the pharse "winter or summer"

Keep your actual documents in sequencial number, and the rest is easy to figure out. Sort of an extention of Jason's idea.
Peijen
Minion to the Exalted Pooh-Bah
 
Posts: 2778
Joined: Fri Jul 18, 2003 1:28 pm
Location: Irvine, CA

Postby Jonathan » Thu Jun 09, 2005 3:01 pm

Small words are a problem. I tried getting rid of them, but sometimes it just doesn't look right. Instead, I think I'll add an additional word for each small word in the proposed url. Thus, I'd get

/itisgoodtoreunite

instead of

/itis

or

/goodreunite

I'll have to look at my corpus and see what the longest word for each entry. The title thing works pretty good.

Mod_rewrite is a pain. I'll be fine with static URLs, I think.
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR

Postby Martin » Fri Jun 10, 2005 2:38 am

Longest word sounds contrived. Amazon uses something called statistically improbable phrases. I don't know what their algorithm is (it's apparently proprietary), but language log has analyzed it and doesn't like it much.

Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).
User avatar
Martin
Chump
 
Posts: 124
Joined: Thu Jul 17, 2003 11:30 pm
Location: Los Gatos, CA

Postby Jason » Fri Jun 10, 2005 2:23 pm

Martin wrote:Longest word sounds contrived.

Why is it contrived? Does everything have to be complex for you? It's effectively a hashcode and the longer it is the less likely there will be a collision (although since the domain is words that isn't necessarily the full case), but still. It's a quick hash for a millisecond of time.

True, if the first sentence of the post covers the full context of the post then your idea is better.
User avatar
Jason
Veteran Doodler
 
Posts: 1518
Joined: Thu Jul 17, 2003 11:53 pm
Location: Fairfax, VA

Postby Jonathan » Fri Jun 10, 2005 6:31 pm

Martin wrote:Do you not have any sort of timestamp on these entries?

Like I said, I got a year on all of them. If I assigned a month I'd be making it up. Manually. I don't want to. When I record a date, I record the day and month. I do not record the time of day.

I think titling untitled things with the first several words is a good plan. I'm actually leaning away from including small words now. Instead, I can add things like "well" and "hmph" to my list of words to remove and hopefully pick out some good ones. I might have pithy four word sentences. You don't know.

I just want the date+title string to be unique. I think a good format is

http://jonathan.pearce.name/2005/housekeeping
http://jonathan.pearce.name/2005/goodreunite
http://jonathan.pearce.name/2003/mightystephen
http://jonathan.pearce.name/2003/updatesupdates

Longest word appears to be a problem with my corpus, because I keep writing about the same things. The longest words either tend to crop up more than once or be apropos of nothing.
Jonathan
Grand Pooh-Bah
 
Posts: 6031
Joined: Tue Sep 19, 2006 7:45 pm
Location: Portland, OR


Return to The Vault

Who is online

Users browsing this forum: No registered users and 1 guest

cron