April 21, 2014

You Can’t Reverse Engineer The Google Algorithm

nerual_network
As most SEOs are aware, Google just launched the latest penguin algorithm. Amid the mass panic that follows all algorithm updates, several SEOs started discussing the Google algorithm and their theories. Don’t worry, I’m not going to tell you how to recover from it or anything like that. Instead, I want to focus on some of the discussion points I saw flying around the web. It’s become clear to me that an overwhelming majority of SEOs have very little computer science training or understanding of computer algorithms. I posted a rant a few weeks ago that briefly touched on this, but now that I can actually type (no more elbow cast!) I’d like to delve a bit deeper into some misconceptions about the Google algorithm.

It’s always been my belief that SEOs should know how to program and now I’d like to give a few examples about how programming knowledge shapes SEO thought processes. I’d also like to add the disclaimer that I don’t work at Google (although I was a quality rater many years ago) and I don’t actually know the Google algorithm. I do have a computer science background though, and still consider myself a pretty good programmer. I’ll also argue (as you will see in this post) that nobody really knows the Google algorithm – at least not in the sense you’re probably accustomed to thinking of.

SEOs who’ve been at it for a while remember the days of reverse engineering the algorithm. Back in the late 90s, it was still pretty easy. Search engines weren’t that complex and could easily be manipulated. Unfortunately, that’s no longer the case. We need to evolve our thinking beyond the typical static formula. There’s just no way the algorithm is as simple as set of weights and variables.

You can’t reverse engineer a dynamic algorithm unless you have the same crawl data.

The algorithm isn’t static. As I mentioned in my rant, many theories in information retrieval talk about dynamic factor weights based on the corpus of results. Quite simply, that means that search results aren’t ranked according to a flat scale, they’re ranked according to the other sites that are relevant to that query. Example: If every site for a given query has the same two word phrase in its title tag, then that phrase being in the title won’t contribute highly to the ranking weights. For a different search though, where only 20% of the results have that term in the title, it would be a heavy ranking factor.

What we do know is that there are 3 main parts to a Google search. Indexing, which happens before you search so we won’t cover it here, result fetching, and result ranking. Result fetching is pretty simple at a high level. It goes through the index and looks for all documents that match your query. (there’s probably some vector type stuff going on with mutltiple vectors for relevancy and authority and what not, but that’s way out of this scope.) Then, once all the pages are returned, they’re ranked based on factors. When those factors are evaluated, they’re most likely evaluated based only on the corpus of sites returned.

I want to talk about T trees and vector intersections and such, however I’m going to use an analogy here instead. In my earlier rant I used the example of car shopping and how first you sort by class, then color, etc – but if all the cars are red SUVs you then sort by different factors.

Perhaps a better way is to think of applying ranking factors like we alphabetize words. Assume each letter in a word is a ranking factor. For example, in he word “apple” the “a” might be keyword in title tag. The “p” might be number of links and the “e” might be something less important like page speed (Remember when Cutts said “all else being equal, we’ll return the faster result? that fits here.) Using this method, ranking some queries would be easy. We don’t need many factors to see that apple comes before avocado. But what about pear and pearl? In the apple/avocado example, the most significant (and important) ranking factor is the 2nd letter. In the pear example though, the first four factors are less important than the l at the end of the word. Ranking factors are the same way: They change based on the set of sites being ranked! (and get more complicated when you factor in location, personalization, etc – but we’ll tackle all that in another post.)

It’s not just dynamic, it’s constantly learning too!

For a few years now I’ve had the suspicion that Google is really just one large-scale neural network. When I read things like this and then see the features they just released for Google+ images, I know they’ve got large-scale neural nets mastered.

What’s a Neural Network? Well, you can go read about it on Wikipedia if you want, but quite simply a neural network is a different type of algorithm. It’s one where you give it the inputs and the desired outputs, and it uses some very sophisticated math to calculate the best and most reliable way to get from those inputs to the outputs. Once it does that, you can give a larger set of inputs and it can use the same logic to expand the set of outputs. In my college artificial intelligence class (back in 2003) I used a rudimentary one to play simple games like nim and even to ask smart questions to determine which type of sandwich you were eating. (I fed it in a list of known ingredients and sandwich definitions, and it came up with the shortest batch of questions to ask to determine what you had. Pretty cool) The point is that if I could code a basic neural net in lisp on a pentium1 laptop 10 years ago, I’m pretty sure Google can use way more advanced types of learning algorithms to do way cooler things. Also, ranking link signals is WAY less complicated than finding faces and cats in photos.

Anyway.. when I think of Penguin and Panda and hear that they have to be run independent of the main search ranking algorithm, my gut instantly screams that these are neural nets or similar technology. Here’s some more evidence: From leaked documents we know that Google uses human quality raters and that some of their tasks involve rating documents as relevant, vital, useful, spam, etc. Many SEOs instantly though “OMG, actual humans are rating my site and hurting my rankings.” The clever SEOs though saw this as a perfect way to create a training set for a neural network type algorithm.

By the way, there’s no brand bias either

Here’s another example. Some time ago @mattcutts said “We actually came up with a qualifier to say OK NYT or Wikipedia or IRS on this side, low quality sites over on this side.” Many SEOs took that to mean that Google has a brand bias. I don’t think that’s the case. I think what Matt was talking about here was using these brand sites as part of the algorithm training set for what an “authoritative” site is. They probably looked at what sites people were clicking on most or what quality raters chose as most vital and then fed them in as a training set.

There’s nothing in the algorithm that says “rank brands higher” (I mean, how does an algorithm know what a brand is? Wouldn’t it be very easy to fake?) – it’s most likely though that the types of signals that brand sites have were also the types of signals Google wants to reward. You’ve heard me say at countless conferences: “Google doesn’t prefer brands, people searching Google do.” That’s still true and that’s why brand sites make a good training set for authority signals. When people stop preferring brands over small sites, Google will most likely stop ranking them above smaller sites.

We need to change our thought process

We really need to stop reacting literally to everything Google tells us and start thinking about it critically. I keep thinking of Danny Sullivan’s epic rant about directories. When Matt said “get directory links” he meant get links for sites people visit. Instead, we falsely took that as “Google has a flag that says this site is a directory and gives links on it more weight, so we need to create millions of directories.” We focused on the what, not the why.

We can use our knowledge of computer science here. It’s crucial. We need to stop thinking of the algorithm as a static formula and start thinking bigger. We need to stop trying to reverse engineer it and focus more on the intent and logic behind it. When Google announces they’re addressing a problem we should think about how we’d also solve that problem in a robust and scalable way. We shouldn’t concern ourselves so much with exactly what they’re doing to solve it but instead look at the why. That’s the only true way to stay ahead of the algorithm.

Ok, that’s a lot of technical stuff. What should I take away?

  1. You can’t reverse engineer the algorithm. Neither could most Googlers.
  2. The algorithm, ranking factors, and their importance change based on the query and the result set.
  3. The algorithm learns based on training data.
  4. There’s no coded-in “brand” variable.
  5. Human raters are probably a.) creating training sets and b.) evaluating result sets of the neural network style algorithm
About Ryan Jones

Ryan Jones is an SEO from Detroit. By day he works as a manager of SEO & Analytics at SapientNitro where his team performs SEO for Fortune500 clients. By night he's either playing hockey or attempting to take over the world with his own websites - which he would have already succeeded in doing had it not been for those meddling kids and their dog. The views expressed here have not been paid for and belong only to Ryan, not any of his employers or clients. Follow Ryan on Twitter at: @RyanJones, add him on Google+ or visit his personal website: www.RyanMJones.com

Comments

  1. Hey Ryan, great piece here, but I wanted to drop in a few ideas…

    (1) “We shouldn’t concern ourselves so much with exactly what they’re doing to solve it but instead look at the why.” – This statement is largely true for white hat SEO, but largely untrue for black hat / gray hat SEO. As long as Google’s algorithms are unable to address tactics in real time, there will be an opportunity in uncovering short-term wins that can come from looking more at the what than the why.

    (2) “When Google announces they’re addressing a problem we should think about how we’d also solve that problem in a robust and scalable way”. Yes, but that doesn’t necessarily lead to the conclusion that we should focus merely on creating content and campaigns that match Google’s goals. In fact, one could argue that creating content and campaigns that match Google’s goals is self-negating. The information from that thought process is just as valuable for teaching us how to slip through the cracks as it is in teaching us how to “stay ahead”.

    (3) While I agree that “reverse engineering” the algorithm is largely useless, I don’t think that we should ignore the amazing training set that we have at our finger tips – Google’s search results. We can determine what Google missed and hit, especially when they announce an update, and use that to build reasonable approximations of risk. These approximations can be valuable in determining a risk threshold to inform clients.

    Anyway, once again, a great post. Looking forward to hearing more!

  2. aaron wall says:

    One wonders if the “geniuses” that claim there is no brand bias have done a side by side comparison of the longevity of penalties between small ecommerce sites that were hit by Panda and big brands that intentionally egregiously violated the guidelines like Interflora.

    Reality can smack a person in the face, but “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!” – Upton Sinclair

  3. For the purpose of this post I think we should differentiate between algorithm brand bias, and spam team penalty brand bias. I’m arguing that the algorithm does not have a brand bias, nor does it even have a factor that recognizes “this is a brand.” It’s only looking at other signals, which brands generally have more of or stronger ones.

    Now, when it comes to things like manual penalties, I will agree with you that Google is less harsh on brands than they are on smaller sites. There’s plenty of evidence to show us that.

    I think this is partly because when Google bans a small site, general searchers don’t notice. When Google bans BMW though, hundreds of thousands of people searching “bmw” don’t see the site they expect to see in the results and suddenly think search is broken.

    Google has to balance punishing spammers with serving their real customers (webmasters, site owners, seos etc aren’t google’s customers) In that case, sometimes you have to let the brand still rank because not doing so makes the search results less useful to searchers. It’s a delicate balance, and I’m not sure Google has found the best solution to that yet.

  4. aaron wall says:

    So the direct explicit bias from the humans that write the code for the algorithm doesn’t apply to the algorithm itself? Why wouldn’t it carry across there if it is so overt elsewhere?

    Also it is a myth that penalties must be “completely unscathed” or “complete & absolute death.” Some of their efforts of the past few years were allegedly about working in that gray area between those 2 absolutes. They can allow BMW to rank for keywords with BMW in them and disallow ranking for other terms or dampen/demote across other terms.

    Believing the black & white, hit or miss, all or nothing “solutions” are the only solutions in existence is giving into the propaganda of the central network operators, even as those same network operators are working on coding up alternative solutions inside “the algorithm.”

    Some manual penalties are harsher than others. Some algorithmic penalties are harsher than others. There isn’t a significant difference between a whitelist & an overly aggressive algorithm that has lots of collateral damage followed up by select manual overrides.

  5. There is no “brand bias” in the algorithm. Some people, given to psychotic rantings and ravings, took Eric Schmidt’s “brands can clean up the cesspool that is the Web” comment from a few years back and extrapolated a crackpot theory that Bryson Meunier has shot down in so many ways with so much data that no one should even be having this discussion any more.

    Brands are born every day. Brands die every day. No “brand bias” built into the algorithmic process would be able to cope with that kind of dynamic environment. How do you know when a brand is a brand? If people don’t know they cannot program algorithms to know.

    Don’t argue with the monkeys, Ryan. They only listen to their own chatter.

  6. Exactly Michael. How do you define a brand? Does Google maintain a list of the fortune500 and the algorithm references it? No way.

    It’s more likely that the signals google wants to reward (trust, authority, popularity) are also highly correlated to brands. Consumers trust brands more than the little guys. As long as they do, Google is going to reward what consumers what.

    All brands had to start somewhere though. They didn’t just file their corporation papers and suddenly people and google rewarded them. They were once little guys. They created value and made themselves stand out through quality or awareness or offline marketing, etc.

    The road to success isn’t short, and some people get a head start – but that’s how life works. Everybody has an opportunity, some just have to work harder to achieve it. Google’s goal is to provide a more useful experience to searchers – and that is best done by matching their algorithms to searchers needs, wants, and expectations – not by trying to appear unbiased.

    (Note: I said appear, not be. I get why people think there’s a bias, but it’s not a real bias, just one perceived by people who are only looking at one side of the argument)

  7. aaron wall says:

    Surprise, surprise, Michael Martinez is still an angry ignorant asshole.

    You’re right…some things never change.

  8. Stop whining, Aaron, and try doing some real SEO for a change.

  9. aaron wall says:

    Sorry Michael, but unlike you, I don’t have any websites dedicated to analyzing important business topics like Xena warrior princess or such.

    … a man can dream … one day … be like Mike … I wish that I could be like Mike …

  10. That’s okay, Aaron. If you had, you might have learned NOT to blow $50,000 on a stupid black hat trick that got an otherwise good Website penalized.

    Try not to spoil this blog with any more pointless nonsense, okay?

  11. aaron wall says:

    >$50,000Try not to spoil this blog with any more pointless nonsense, okay?<

    And here I was misinformed to believe Xena was foundational business knowledge.

    My bad!

  12. You have admitted to being misinformed on many issues in the past. Your first ebook was an exceptional example of just how wrong you can be on so many fine points. You sure really want to revisit the past like this? There are so many Dumb Aaron Wall moments to share….

  13. aaron wall says:

    Do you mean like the time I showed you screenshots of the search results & you claimed they were “smoke and mirrors” ;)

    I think the difference between an intelligent person and a willfully ignorant one is the ability to accept new information & believe one’s own lying eyes … but then there might be some Xena secrets about the universe that I am aware of. I defer to your superiority on that front.

  14. Aaron, did Google do you wrong again this week? Are you just hurting from yet another avoidable penalty? Let it out, Bunky: we’re all listening.

  15. aaron wall says:

    Hahaha.

    Thankfully, not in the least.

    And, more importantly, I didn’t waste thousands of hours of my life on Xena either.

  16. You seem to be wasting quite a bit of your life on being toxic and petty. That’s your choice to make.

  17. aaron wall says:

    Let me get this straight … you can call other people “monkeys,” but if people call you out on that & behave like you do they are both “toxic and petty.”

    Well at least we both agree about your fundamental nature.

    Wow – look at that – we agreed on something.

    First time ever!

    Here’s a cordial goodbye to you and the Xena clan.

  18. Aaron, you were toxic and petty long before I called you a monkey. And you’ll obviously continue to be toxic and petty. Despite your cordial goodbye, I suggest you give up on the Google conspiracy theories and learn to do real SEO for a change.