URL uniqueness

Posts you want to find years later go here.
Post Reply
Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

URL uniqueness

Post by Jonathan »

This is an issue I've been thinking about a lot recently.

Places like http://boingboing.net and sites that use MovableType have auto-uniquefying URLs with some semblance of human-readableness.

http://boingboing.net/2005/05/31/americ ... ized_.html

Obviously, this is a step forward from numbered URLs

http://www.joelonsoftware.com/articles/ ... 00017.html

or manually selecting names

http://www.wired.com/wired/archive/13.06/war.html

What is the best way to auto-uniquefy a URL? Assume a blog-style post with a title, date, and text body, but the title and date are optional.

To wit: http://www.jonathan.pearce.name/news

I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?

I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.

quantus
Tenth Dan Procrastinator
Posts: 4891
Joined: Fri Jul 18, 2003 3:09 am
Location: San Jose, CA

Post by quantus »

dude, just concatenate date and post time (down to the second or millisecond if you so choose)...
Have you clicked today? Check status, then: People, Jobs or Roads

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Post by Jonathan »

I might as well increment integers if I'm going to do that. I'm trying to do something that is meaningful, so when I see a link I know what it refers to.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Post by Jonathan »

Also, sometimes all I have is the year.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Post by Jonathan »

Actually, just to be perfectly clear, I have 3 cases.

Title & Date.
Title only.
Date only.

I'm thinking if there's a title, concatenate first two words in title and test for collisions. If collision or no title, concatenate first two words in

And, everything gets put in a subdirectory of year. collisions only get checked for by year.

So, this:

http://www.livejournal.com/users/dwindlehop/14605.html

becomes this:

http://jonathan.pearce.name/2005/housekeeping

quantus
Tenth Dan Procrastinator
Posts: 4891
Joined: Fri Jul 18, 2003 3:09 am
Location: San Jose, CA

Post by quantus »

sounds reasonable... You may want to have a small dictionary of words to ignore, like: the, a, it, you, me, is, to, etc... Then just get the first two non-ignored words.
Have you clicked today? Check status, then: People, Jobs or Roads

Jason
Veteran Doodler
Posts: 1520
Joined: Fri Jul 18, 2003 12:53 am
Location: Fairfax, VA

Post by Jason »

Another good trick that's fairly quick is to use the longest words in your corpus, or in this case your text. They're usually the most unique unless you keep writing about the same things.

If you've got time on your hands you might just want to have cummulative statistics on your corpus and choose the most prominent words from the results.

Peijen
Minion to the Exalted Pooh-Bah
Posts: 2790
Joined: Fri Jul 18, 2003 2:28 pm
Location: Irvine, CA

Post by Peijen »

write your own url parser that intercept all url that goes to j.p.n/news/*

use * to performan a search and return a result for the best match.

for j.p.n/news/2005/12/30/the_future.html
You would search by date 2005/12/30, and pharse "the future"

or j.p.n/news/1/7/winter_or_summer.html
You would search by date 1/7 or 7/1 and the pharse "winter or summer"

Keep your actual documents in sequencial number, and the rest is easy to figure out. Sort of an extention of Jason's idea.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Post by Jonathan »

Small words are a problem. I tried getting rid of them, but sometimes it just doesn't look right. Instead, I think I'll add an additional word for each small word in the proposed url. Thus, I'd get

/itisgoodtoreunite

instead of

/itis

or

/goodreunite

I'll have to look at my corpus and see what the longest word for each entry. The title thing works pretty good.

Mod_rewrite is a pain. I'll be fine with static URLs, I think.

Martin
Chump
Posts: 124
Joined: Fri Jul 18, 2003 12:30 am
Location: Los Gatos, CA
Contact:

Post by Martin »

Longest word sounds contrived. Amazon uses something called statistically improbable phrases. I don't know what their algorithm is (it's apparently proprietary), but language log has analyzed it and doesn't like it much.

Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).

Jason
Veteran Doodler
Posts: 1520
Joined: Fri Jul 18, 2003 12:53 am
Location: Fairfax, VA

Post by Jason »

Martin wrote:Longest word sounds contrived.
Why is it contrived? Does everything have to be complex for you? It's effectively a hashcode and the longer it is the less likely there will be a collision (although since the domain is words that isn't necessarily the full case), but still. It's a quick hash for a millisecond of time.

True, if the first sentence of the post covers the full context of the post then your idea is better.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Post by Jonathan »

Martin wrote:Do you not have any sort of timestamp on these entries?
Like I said, I got a year on all of them. If I assigned a month I'd be making it up. Manually. I don't want to. When I record a date, I record the day and month. I do not record the time of day.

I think titling untitled things with the first several words is a good plan. I'm actually leaning away from including small words now. Instead, I can add things like "well" and "hmph" to my list of words to remove and hopefully pick out some good ones. I might have pithy four word sentences. You don't know.

I just want the date+title string to be unique. I think a good format is

http://jonathan.pearce.name/2005/housekeeping
http://jonathan.pearce.name/2005/goodreunite
http://jonathan.pearce.name/2003/mightystephen
http://jonathan.pearce.name/2003/updatesupdates

Longest word appears to be a problem with my corpus, because I keep writing about the same things. The longest words either tend to crop up more than once or be apropos of nothing.

Post Reply