How Google Works: A Google Ranking Engineer’s Story #SMX

Google Software Engineer Paul Haahr has been at Google for more than 14 years. For two of them, he shared an office with Matt Cutts. He’s taking the SMX West 2016 stage to share how Google works from a Google engineer’s perspective – or, at least, share as much as he can in 30 minutes. After, Webmaster Trends Analyst Gary Illyes will join him onstage and the two will field questions from the SMX audience with Search Engine Land Editor Danny Sullivan moderating (jump to the Q&A portion!).

From left: Google Webmaster Trends Analyst Gary Illyes, Google Software Engineer Paul Haahr and Search Engine Land Editor Danny Sullivan on the SMX West 2016 stage in San Jose.

How Google Works

Haahr opens by telling us what Google engineers do. Their job includes:

  • Writing code for searches
  • Optimizing metrics
  • Looking for new signals
  • Combining old signals in new ways
  • Moving results with good ratings up
  • Moving results with bad ratings down
  • Fixing rating guidelines
  • Developing new metrics when necessary

Two parts of a search engine:

  • Ahead of time (before the query)
  • Query processing

Before the Query

  • Crawl the web
  • Analyze the crawled pages
    • Extract links
    • Render contents
    • Annotate semantics
  • Build an index

The Index

  • Like the index of a book
  • For each word, a list of pages it appears on
  • Broken up into groups of millions of pages
  • Plus per-document metadata

Query Processing

  • Query understanding and expansion
    Does the query name any known entities?
  • Retrieval and scoring
    • Send the query to all the shards
      Each shard

      • Finds the matching pages
      • Computes a score for query+page
      • Sends back the top N page by score
    • Combine all the top pages
    • Sort by score
  • Post-retrieval adjustments
    • Host clustering
    • Is there duplication

Scoring Signals

A signal is:

  • A piece of information used in scoring
  • Query independent – feature of a page
  • Query dependent

Metrics

“If you cannot measure it, you cannot improve it” – Lord Kelvin

  • Relevance
    • Does a page usefully answer the user’s query
    • Ranking’s top-line metric
  • Quality
    • How good are the results we show
  • Time to result (faster is better)

Google measures itself with live experiments:

  • A/B experiments on real traffic
  • Look for changes in click patterns
  • A lot of traffic is in one experiment or another

At one time, Google tested 41 different blues to see which was best.

Google also does human rater experiments:

  • Show real people experimental search results
  • Ask how the results are
  • Aggregate ratings across raters
  • Publish guidelines explaining criteria for raters
  • Tools support doing this in an automated way, similar to Mechanical Turk

Google judges pages on two main factors:

  • Needs Met (where mobile is front and center)
  • Page Quality

Needs Met grades:

  • Fully Meets
  • Very Highly Meets
  • Highly Meets
  • Moderately Meets
  • Slightly Meets
  • Fails to Meet

Page quality concepts:

  • Expertise
  • Authoritativeness
  • Trustworthiness

Google engineer development process:

  • Idea
  • Repeat until ready
    • Write code
    • Generate data
    • Run experiments
    • Analyze
  • Launch report by quantitative analyst
  • Launch review
  • Launch

What goes wrong?

There are two kinds of problems:

  • Systematically bad ratings
  • Metrics don’t capture the things we care about

Here’s an example of a bad rating. Someone searches for [Texas farm fertilizer] and the search result provides a map to the manufacturer’s headquarters. It’s very unlikely that that’s what they want. Google determines this through live experiments. If a rater sees the maps and rates it as “Highly Meets” needs, then this is a failing at the point of rating.

Or, what if the metrics are missing? In 2009-2011, there were lots of complaints about low-quality content. But relevance metrics kept going up, due to content farms. Conclusion: Google wasn’t measuring the metrics they needed to be. Thus, the quality metric was developed apart from relevance.

Here’s Paul Haahr’s slide deck, which is worth a look:
Update 7/19: Presentation has now been marked private by the author.

Gary Illyes and Paul Haahr Answer Questions from the SMX Audience

SMX: How does RankBrain fit into all of this?

Haahr: RankBrain gets to see a subset of the signals. I can’t go into too much detail about how RankBrain works. We understand how it works but not as much what it’s doing. It uses a lot of the stuff that we’ve published about deep learning.

How would RankBrain know the authority of a page?

Haahr: It’s all a function of the training that it gets. It sees queries and other signals. I can’t say that much more that would be useful.

SMX: When you are logged into a Google app, do you differentiate by the information you gather? If you’re in Google Now vs. Chrome can that impact what you’re seeing?

Haahr: It’s really a question of if you’re logged in or not. We provide a consistent experience. Your browsing history follows you to either.

Does Google deliver different results for the same queries at different times in the day?

Illyes: I’m not sure. In Maps, for example, if we display something maps related we will show the hours. It doesn’t change what shows up, to Gary’s knowledge.

SMX: What’s going on with Panda and Penguin?

Illyes: I gave up on giving a date or timeline on Penguin. We are working on it, thinking about how to launch it, but I honestly don’t know a date and I don’t want to say a date because I was already wrong three or four times, and it’s bad for business.

SMX: Post-Google Authorship, how are you tracking author authority?

Haahr: There I’m not going to go into any detail. What I will say is the raters are expected to review that manually for a page that they are seeing. What we measure is: are we able to do a good job of serving results that the raters think are good authorities.

SMX: Does that mean authority is used as a direct or indirect factor?

Haahr: I wouldn’t say yes or no. It’s much more complicated than that and I can’t give a direct answer.

SMX: When explicit authorship ended, Google did say to keep having bylines. Should you bother with rel=author at all?

Illyes: There is at least one team that is still looking into using the rel=author tag just for the sake of future developments. If I were an SEO I would still leave the tag. It doesn’t hurt to have it. On new pages, however, it’s probably not worth it to have. Though we might use it for something in the future.

SMX: What are you reading right now?

Haahr: I read a lot of journalism and very few books. However, I just finished “City on Fire” – it’s about New York in the ’70s. There are 900 pages and I was disappointed when it ended. I’ve just started “It Can’t Happen Here.”

Subscribe to the BCI blog link

Kristi Kellogg is a journalist, news hound, professional copywriter, and social (media) butterfly. Currently, she is a senior SEO content writer for Conde Nast. Her articles appear in newspapers, magazines, across the Internet and in books such as "Content Marketing Strategies for Professionals" and "The Media Relations Guidebook." Formerly, she was the social media editor at Bruce Clay Inc.

See Kristi's author page for links to connect on social media.

Comments (2)
Filed under: SEO — Tags: , , ,
Still on the hunt for actionable tips and insights? Each of these recent SEO posts is better than the last!
Bruce Clay on April 2, 2024
What Is SEO?
Bruce Clay on March 28, 2024
Google’s Explosive March Updates: What I Think
Bruce Clay on March 21, 2024
3 Types of E-commerce Product Reviews for SEO + Conversions

2 Replies to “How Google Works: A Google Ranking Engineer’s Story #SMX”

I think the basic of SEO like link building, op-page / off-page optimization techniques are still the root of building a successful blog and get better ranking.

First time I read about the working of Google. This is a wonderful knowledge. Thanks for sharing this info with us.

LEAVE A REPLY

Your email address will not be published. Required fields are marked *

Serving North America based in the Los Angeles Metropolitan Area
Bruce Clay, Inc. | PO Box 1338 | Moorpark CA, 93020
Voice: 1-805-517-1900 | Toll Free: 1-866-517-1900 | Fax: 1-805-517-1919