If you came to pubcon this week you probably noticed some very odd questions being asked at some sessions. My favorite question was when somebody asked @mattcutts for a better method of doing doorway pages. yeah, seriously!
My 2nd favorite question though was also Matt’s fault. At the end of one of Alan K’necht’s sessions somebody asked about inverse document frequency and how they can improve theirs to get better rankings. If you’re scratching your head right now and saying “dude, WTF?” then you’re reacting precisely how you SHOULD react.
So where did this question come from? Well, at the pubcon mixer Matt, myself, and somebody I can’t remember were bullshitting about mostly non-SEO related stuff. Anyway, we started talking about Amit Singhal’s background and how smart he is and somehow gravitated to computer science research. That’s when we got talking about the topic of Inverse Document Frequency (idf) and how big of an expert that Amit is in information retrieval.
I take some of the blame because as Matt was attempting to explain IDF in technical terms I jokingly said “basically it means that keyword stuffing is more effective where there aren’t many competing documents.” Sadly only Matt got the joke because it wasn’t just a reference to inverse document frequency, it was also a reference to how Google most likely uses inverse document frequency. Everybody else seemed confused, but giddy like they just stumpled upon some secret ranking sauce. Sorry guys, you didn’t :’(
Ok, so what does all this stuff mean? It’s quite simple but let’s start with the basics. When we say collection frequency we’re referring to how many times the given term occurs across all the documents on the web. When we say document frequency we’re referring to the number of pages on the web that contain the term. Pretty simple right? (side note: comparing your document frequency to the collection frequency is most likely one way Google detects both relevance AND keyword stuffing)
But to do that, they need the inverse document frequency. Inverse Document Frequency is simply a measure of the importance of a term. It’s calculated by dividing the total number of documents by the number of documents containing that term, and then taking the logarithm. An even more simple explanation is to say idf is computed such that rare terms have a higher idf than common terms.
So why? Basically, if there’s only a few documents that contain a term, they should get a higher relevance boost than a case where there’s multiple documents containing a term. That’s all idf really is. It’s nothing you need to worry about – unless you’re writing algorithms or dealing with document retrieval.
But I’m pretty sure Google doesn’t just use idf alone anyway. That’s way too simple of a way to return results. So let’s get more technical. Matt was getting really technical in our talk and although I don’t think he said the term, he was actually describing tf-idf, and that’s what my keyword stuffing joke was about. So, sorry if that got people thinking along the wrong lines.
So what the hell is tf-idf? Don’t let the minus sign fool you. tf-idf is actually calculated by multiplying the term frequency of a document by the inverse document frequency. This means a document with lots of a term, where there aren’t many documents containing that term will have a much higher tf-idf; hence my joke about keyword stuffing.
And that’s basically all there is to it. Sadly, inverse document frequency won’t be the new buzzword at next year’s pubcon, but that doesn’t mean it isn’t interesting. Information retrieval was a once-boring area of computer science that suddenly became the most interesting thing in the world to geeks once search engines came about. I loved that talk and and made me pine for my days in college studying computer science.
Hope that helped clear everything up. Pubcon was a blast and I’m sure many of us look forward to implementing the various theories we learned. Well, all of them except inverse document frequency of course.