October 25, 2014

Over Thinking SEO: Inverse Document Frequency

If you came to pubcon this week you probably noticed some very odd questions being asked at some sessions. My favorite question was when somebody asked @mattcutts for a better method of doing doorway pages. yeah, seriously!

My 2nd favorite question though was also Matt’s fault. At the end of one of Alan K’necht’s sessions somebody asked about inverse document frequency and how they can improve theirs to get better rankings. If you’re scratching your head right now and saying “dude, WTF?” then you’re reacting precisely how you SHOULD react.

So where did this question come from? Well, at the pubcon mixer Matt, myself, and somebody I can’t remember were bullshitting about mostly non-SEO related stuff. Anyway, we started talking about Amit Singhal’s background and how smart he is and somehow gravitated to computer science research. That’s when we got talking about the topic of Inverse Document Frequency (idf) and how big of an expert that Amit is in information retrieval.

I take some of the blame because as Matt was attempting to explain IDF in technical terms I jokingly said “basically it means that keyword stuffing is more effective where there aren’t many competing documents.” Sadly only Matt got the joke because it wasn’t just a reference to inverse document frequency, it was also a reference to how Google most likely uses inverse document frequency. Everybody else seemed confused, but giddy like they just stumpled upon some secret ranking sauce. Sorry guys, you didn’t :'(

Ok, so what does all this stuff mean? It’s quite simple but let’s start with the basics. When we say collection frequency we’re referring to how many times the given term occurs across all the documents on the web. When we say document frequency we’re referring to the number of pages on the web that contain the term. Pretty simple right? (side note: comparing your document frequency to the collection frequency is most likely one way Google detects both relevance AND keyword stuffing)

But to do that, they need the inverse document frequency. Inverse Document Frequency is simply a measure of the importance of a term. It’s calculated by dividing the total number of documents by the number of documents containing that term, and then taking the logarithm. An even more simple explanation is to say idf is computed such that rare terms have a higher idf than common terms.

So why? Basically, if there’s only a few documents that contain a term, they should get a higher relevance boost than a case where there’s multiple documents containing a term. That’s all idf really is. It’s nothing you need to worry about – unless you’re writing algorithms or dealing with document retrieval.

But I’m pretty sure Google doesn’t just use idf alone anyway. That’s way too simple of a way to return results. So let’s get more technical. Matt was getting really technical in our talk and although I don’t think he said the term, he was actually describing tf-idf, and that’s what my keyword stuffing joke was about. So, sorry if that got people thinking along the wrong lines.

So what the hell is tf-idf? Don’t let the minus sign fool you. tf-idf is actually calculated by multiplying the term frequency of a document by the inverse document frequency. This means a document with lots of a term, where there aren’t many documents containing that term will have a much higher tf-idf; hence my joke about keyword stuffing.

And that’s basically all there is to it. Sadly, inverse document frequency won’t be the new buzzword at next year’s pubcon, but that doesn’t mean it isn’t interesting. Information retrieval was a once-boring area of computer science that suddenly became the most interesting thing in the world to geeks once search engines came about. I loved that talk and and made me pine for my days in college studying computer science.

Hope that helped clear everything up. Pubcon was a blast and I’m sure many of us look forward to implementing the various theories we learned. Well, all of them except inverse document frequency of course.

About Ryan Jones

Ryan Jones is an SEO from Detroit. By day he works as a manager of SEO & Analytics at SapientNitro where his team performs SEO for Fortune500 clients. By night he's either playing hockey or attempting to take over the world with his own websites - which he would have already succeeded in doing had it not been for those meddling kids and their dog. The views expressed here have not been paid for and belong only to Ryan, not any of his employers or clients. Follow Ryan on Twitter at: @RyanJones, add him on Google+ or visit his personal website: www.RyanMJones.com

Comments

  1. Ryan, the tf-IDF wikipedia link helped me understand what you were getting at:

    http://en.wikipedia.org/wiki/Tf%E2%80%93idf

    Interesting stuff, hadn’t heard of this term before, but it makes sense to me in terms of Search Engines deciding which internal page of a particular domain should rank for a given term.

    For example, when you do a search for “Ford Mustang Specifications” the Ford.com page that ranks -first- highest is not the most authoritative “Mustang engine specs” page on Ford.com, but it does have the highest tf-IDF for those terms.

    Enlightening, now get back to work, mustangheaven.com is eating your lunch :)

  2. People really do sometimes over think the SEO process. They tend to think it needs to look like a science project which is not true. It has to be approached from an influence and marketing stand point while building your brand online. Of course during all this you use elements the search engines like to see.