URL uniqueness
-
- Grand Pooh-Bah
- Posts: 6722
- Joined: Tue Sep 19, 2006 8:45 pm
- Location: Portland, OR
- Contact:
URL uniqueness
This is an issue I've been thinking about a lot recently.
Places like http://boingboing.net and sites that use MovableType have auto-uniquefying URLs with some semblance of human-readableness.
http://boingboing.net/2005/05/31/americ ... ized_.html
Obviously, this is a step forward from numbered URLs
http://www.joelonsoftware.com/articles/ ... 00017.html
or manually selecting names
http://www.wired.com/wired/archive/13.06/war.html
What is the best way to auto-uniquefy a URL? Assume a blog-style post with a title, date, and text body, but the title and date are optional.
To wit: http://www.jonathan.pearce.name/news
I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?
I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.
Places like http://boingboing.net and sites that use MovableType have auto-uniquefying URLs with some semblance of human-readableness.
http://boingboing.net/2005/05/31/americ ... ized_.html
Obviously, this is a step forward from numbered URLs
http://www.joelonsoftware.com/articles/ ... 00017.html
or manually selecting names
http://www.wired.com/wired/archive/13.06/war.html
What is the best way to auto-uniquefy a URL? Assume a blog-style post with a title, date, and text body, but the title and date are optional.
To wit: http://www.jonathan.pearce.name/news
I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?
I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.
-
- Grand Pooh-Bah
- Posts: 6722
- Joined: Tue Sep 19, 2006 8:45 pm
- Location: Portland, OR
- Contact:
Actually, just to be perfectly clear, I have 3 cases.
Title & Date.
Title only.
Date only.
I'm thinking if there's a title, concatenate first two words in title and test for collisions. If collision or no title, concatenate first two words in
And, everything gets put in a subdirectory of year. collisions only get checked for by year.
So, this:
http://www.livejournal.com/users/dwindlehop/14605.html
becomes this:
http://jonathan.pearce.name/2005/housekeeping
Title & Date.
Title only.
Date only.
I'm thinking if there's a title, concatenate first two words in title and test for collisions. If collision or no title, concatenate first two words in
And, everything gets put in a subdirectory of year. collisions only get checked for by year.
So, this:
http://www.livejournal.com/users/dwindlehop/14605.html
becomes this:
http://jonathan.pearce.name/2005/housekeeping
Another good trick that's fairly quick is to use the longest words in your corpus, or in this case your text. They're usually the most unique unless you keep writing about the same things.
If you've got time on your hands you might just want to have cummulative statistics on your corpus and choose the most prominent words from the results.
If you've got time on your hands you might just want to have cummulative statistics on your corpus and choose the most prominent words from the results.
-
- Minion to the Exalted Pooh-Bah
- Posts: 2790
- Joined: Fri Jul 18, 2003 2:28 pm
- Location: Irvine, CA
write your own url parser that intercept all url that goes to j.p.n/news/*
use * to performan a search and return a result for the best match.
for j.p.n/news/2005/12/30/the_future.html
You would search by date 2005/12/30, and pharse "the future"
or j.p.n/news/1/7/winter_or_summer.html
You would search by date 1/7 or 7/1 and the pharse "winter or summer"
Keep your actual documents in sequencial number, and the rest is easy to figure out. Sort of an extention of Jason's idea.
use * to performan a search and return a result for the best match.
for j.p.n/news/2005/12/30/the_future.html
You would search by date 2005/12/30, and pharse "the future"
or j.p.n/news/1/7/winter_or_summer.html
You would search by date 1/7 or 7/1 and the pharse "winter or summer"
Keep your actual documents in sequencial number, and the rest is easy to figure out. Sort of an extention of Jason's idea.
-
- Grand Pooh-Bah
- Posts: 6722
- Joined: Tue Sep 19, 2006 8:45 pm
- Location: Portland, OR
- Contact:
Small words are a problem. I tried getting rid of them, but sometimes it just doesn't look right. Instead, I think I'll add an additional word for each small word in the proposed url. Thus, I'd get
/itisgoodtoreunite
instead of
/itis
or
/goodreunite
I'll have to look at my corpus and see what the longest word for each entry. The title thing works pretty good.
Mod_rewrite is a pain. I'll be fine with static URLs, I think.
/itisgoodtoreunite
instead of
/itis
or
/goodreunite
I'll have to look at my corpus and see what the longest word for each entry. The title thing works pretty good.
Mod_rewrite is a pain. I'll be fine with static URLs, I think.
Longest word sounds contrived. Amazon uses something called statistically improbable phrases. I don't know what their algorithm is (it's apparently proprietary), but language log has analyzed it and doesn't like it much.
Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).
Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).
Why is it contrived? Does everything have to be complex for you? It's effectively a hashcode and the longer it is the less likely there will be a collision (although since the domain is words that isn't necessarily the full case), but still. It's a quick hash for a millisecond of time.Martin wrote:Longest word sounds contrived.
True, if the first sentence of the post covers the full context of the post then your idea is better.
-
- Grand Pooh-Bah
- Posts: 6722
- Joined: Tue Sep 19, 2006 8:45 pm
- Location: Portland, OR
- Contact:
Like I said, I got a year on all of them. If I assigned a month I'd be making it up. Manually. I don't want to. When I record a date, I record the day and month. I do not record the time of day.Martin wrote:Do you not have any sort of timestamp on these entries?
I think titling untitled things with the first several words is a good plan. I'm actually leaning away from including small words now. Instead, I can add things like "well" and "hmph" to my list of words to remove and hopefully pick out some good ones. I might have pithy four word sentences. You don't know.
I just want the date+title string to be unique. I think a good format is
http://jonathan.pearce.name/2005/housekeeping
http://jonathan.pearce.name/2005/goodreunite
http://jonathan.pearce.name/2003/mightystephen
http://jonathan.pearce.name/2003/updatesupdates
Longest word appears to be a problem with my corpus, because I keep writing about the same things. The longest words either tend to crop up more than once or be apropos of nothing.