eMarketing101.net: Traffic Means Business   Contact UsSite Map

Previous Posts

Archives

May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008

Complete Archives

Categories

AdHack

Search Marketing News

Annoyances

Black Hat SEO Techniques

Other Resources & Links

Blogging & RSS Promotion

Canadian Search Community

Canadian SEM Issues

SEM en Français

Domain Name Issues

eCommerce

Keyword Research

eMarketing 101 General

Francouver

Free Webcast

eMarketing 101 Projects

Google *Stuff*

eMarketing 101 Promotion

En Français

General

Hopstudios Projects

International SEM

Love & Please Share

Link Building Best Practices

Music

Musique (Francophone)

Video Content

PPC Planning

Personal

Search Engines Market Share

Search Marketing Smile

SEM *Must* Read!

Rants

eMarketing 101 News

PPC 101 Education

SEM Best Practices

SEM Events

SEM Glossary

SEM Studies & Research

SEM Whitepaper & Reports

SEM & Usability Experiments

SEM Local Events

SEO Advices for Beginners

SEO Planning for Beginners

SEMPO Canada Updates

SEO Tools

SEO Ranking Factors

Sports (Francophone)

Spectacular SEM Results

Vacation

Vision & Future Trends

ROI & Results

Web Analytics

Web Copywriting

Web Strategy Partners

White Hat SEO Techniques

Category Archives

Blogroll

Out of my Gord - By Gord Hotchkiss

GrokDotCom - By Brian Eisenberg

Link Building Best Practices Blog

Search Engine Watch

SEM Hints: Search Engine Marketing Hints, Tips & Tools For Online Businesses

Search Engine Land - by Danny Sullivan

Virtual Marketing Blog: Internet Marketing News, Reviews and Insights

Search Insider

SEMPO Global Search Marketing Blog

ClickZ Online Marketing News

Pandia Search Engine News

Search Marketing Expo News

SEMPO Canada Search Marketing Blog

SiteProNews SEO Blog

Complete Blog List

Feeds

  Web feed Main RSS feed

  Web feed Jobs feed

eMarketing news

/// Posted by Alexandre Brabant on Monday, November 30, 2009

Google’s Indexation Cap

From: google.com

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation - getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) - that is - the concept that there’s an “indexation cap” on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was “confirmed,” but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The “indexation cap” makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little “unique” content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s “last 24 hours” function:


SEOmoz blog post search on Google in the past 24 hours

Seriously, go have a look; the quantity of “junk” you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So - long story short - Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:


  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:


We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.



  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: “We spent $10 million @#$%ing dollars with you last month and you can’t help?”
    AdWords Rep: “I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there.”
    Exec: “Get me your boss on the phone. Now.”
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: “Can you get me some answers?”
    AdWords Boss: “They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before.”
    Exec: “Yeah, we kind figured that part out. Are they going to put us back in.”
    AdWords Boss: “My understanding is no.”
    Exec: “So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know.”
    AdWords Boss: “You might try talking to someone who does SEO.”
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):


  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t “worthy” of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content - which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:


  • Deep pages rarely receive external links - if you’re producing hundreds or thousands of pages of new content and fewer than “dozens” earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts - if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow - if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the “how do we get those pages back?” is “make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs,” there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.


Do you like this post? Yes No



Read Original: http://feedproxy.google.com/~r/seomoz/~3/GhAmDMIjn_Y/googles-indexation-cap

Comments