Organizational Research By

Surprising Reserch Topic

Question:How to get tens of millions of pages indexed by Google bot?


We are currently developing a site that currently has 8 million unique pages that will grow to about 20 million right away, and eventually to about 50 million or more.

Before you criticize... Yes, it provides unique, useful content. We continually process raw data from public records and by doing some data scrubbing, entity rollups, and relationship mapping, we've been able to generate quality content, developing a site that's quite useful and also unique, in part due to the breadth of the data.

It's PR is 0 (new domain, no links), and we're getting spidered at a rate of about 500 pages per day, putting us at about 30,000 pages indexed thus far. At this rate, it would take over 400 years to index all of our data.

I have two questions:

    Is the rate of the indexing directly correlated to PR, and by that I mean is it correlated enough that by purchasing an old domain with good PR will get us to a workable indexing rate (in the neighborhood of 100,000 pages per day).
    Are there any SEO consultants who specialize in aiding the indexing process itself. We're otherwise doing very well with SEO, on-page especially, besides, the competition for our "long-tail" keyword phrases is pretty low, so our success hinges mostly on the number of pages indexed.

Our main competitor has achieved approx 20MM pages indexed in just over one year's time, along with an Alexa 2000-ish ranking.

Noteworthy qualities we have in place:

    page download speed is pretty good (250-500 ms)
    no errors (no 404 or 500 errors when getting spidered)
    we use Google webmaster tools and login daily
    friendly URLs in place
    I'm afraid to submit sitemaps. Some SEO community postings suggest a new site with millions of pages and no PR is suspicious. There is a Google video of Matt Cutts speaking of a staged on-boarding of large sites, too, in order to avoid increased scrutiny (at approx 2:30 in the video).

    Clickable site links deliver all pages, no more than four pages deep and typically no more than 250(-ish) internal links on a page.
    Anchor text for internal links is logical and adds relevance hierarchically to the data on the detail pages.
    We had previously set the crawl rate to the highest on webmaster tools (only about a page every two seconds, max). I recently turned it back to "let Google decide" which is what is advised.

asked Sep 13, 2013 in Java Interview Questions by rajesh
edited Sep 12, 2013
0 votes
38 views



Related Hot Questions

4 Answers

0 votes
How to get tens of millions of pages indexed by Google bot?

It won't happen overnight, however, I guarantee that you would see more of your pages spidered sooner if inbound links to deep content (particularly sitemap pages or directory indexes which point to yet deeper content) were being added from similarly-large sites which have been around for a while.

    Will an older domain be sufficient to get 100,000 pages indexed per day?

Doubtful, unless you're talking about an older domain that has had a significant amount of activity on it (i.e. accumulated content and inbound links) over the years.

    Are there any SEO consultants who specialize in aiding the indexing process itself.

When you pose the question that way, I'm sure you'll find plenty of SEO's who loudly proclaim "yes!" but, at the end of the day, Virtuosi Media's suggestions are as good advice as you'll get from any of them (to say nothing of the potentially-bad advice).

From the sound of it, you should consider utilizing business development and public relations channels to build your site's ranking at this point - get more links to your content (preferably by partnering with an existing site which offers regionally-targeted content to link in to your regionally-divided content, for example), get more people browsing to your site (some will have the Google toolbar installed so their traffic may work toward page discovery), and, if possible, get your business talked about on the news or in communities of people who have a need for it (if you plan to charge for certain services, consider advertising a free trial period to draw interest).
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
Some potential strategies:

    Google Webmaster Tools allows you to request an increased crawl rate. Try doing that if you haven't already.
    Take another look at your navigation architecture to see if you can't improve access to more of your content. Look at it from a user's perspective: If it's hard for a user to find a specific piece of information, it may be hard for search engines as well.
    Make sure you don't have duplicate content because of inconsistent URL parameters or improper use of slashes. By eliminating duplicate content, you cut down on the time Googlebot spends crawling something it has already indexed.
    Use related content links and in-site linking within your content whenever possible.
    Randomize some of your links. A sidebar with random internal content is a great pattern to use.
    Use dates and other microformats.
    Use RSS feeds wherever possible. RSS feeds will function much the same as a sitemap (in fact, Webmaster Tools allows you to submit a feed as a sitemap).
    Regarding sitemaps, see this question.
    Find ways to get external links to your content. This may accelerate the process of it getting indexed. If it's appropriate to the type of content, making it easy to share socially or through email will help with this.
    Provide an API to incentivize use of your data and external links to your data. You can have an attribution link as a requirement to the data use.
    Embrace the community. If you reach out to the right people in the right way, you'll get external links via blogs and Twitter.
    Look for ways to create a community around your data. Find a way to make it social. API's, mashups, social widgets all help, but so do a blog, community showcases, forums, and gaming mechanics (also, see this video).
    Prioritize which content you have indexed. With that much data, not all of it is going to be absolutely vital. Make a strategic decision as to what content is most important, e.g., it will be most popular, it has the best chance at ROI, it will be the most useful, etc. and make sure that that content is indexed first.
    Do a detailed analysis of what your competitor is doing to get their content indexed. Look at their site architecture, their navigation, their external links, etc.

Finally, I should say this. SEO and indexing are only small parts to running a business site. Don't lose focus on ROI for the sake of SEO. Even if you have a lot of traffic from Google, it doesn't matter if you can't convert it. SEO is important, but it needs to be kept in perspective.
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
How to get tens of millions of pages indexed by Google bot?

It won't happen overnight, however, I guarantee that you would see more of your pages spidered sooner if inbound links to deep content (particularly sitemap pages or directory indexes which point to yet deeper content) were being added from similarly-large sites which have been around for a while.

    Will an older domain be sufficient to get 100,000 pages indexed per day?

Doubtful, unless you're talking about an older domain that has had a significant amount of activity on it (i.e. accumulated content and inbound links) over the years.

    Are there any SEO consultants who specialize in aiding the indexing process itself.

When you pose the question that way, I'm sure you'll find plenty of SEO's who loudly proclaim "yes!" but, at the end of the day, Virtuosi Media's suggestions are as good advice as you'll get from any of them (to say nothing of the potentially-bad advice).

From the sound of it, you should consider utilizing business development and public relations channels to build your site's ranking at this point - get more links to your content (preferably by partnering with an existing site which offers regionally-targeted content to link in to your regionally-divided content, for example), get more people browsing to your site (some will have the Google toolbar installed so their traffic may work toward page discovery), and, if possible, get your business talked about on the news or in communities of people who have a need for it (if you plan to charge for certain services, consider advertising a free trial period to draw interest).
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
Some potential strategies:

    Google Webmaster Tools allows you to request an increased crawl rate. Try doing that if you haven't already.
    Take another look at your navigation architecture to see if you can't improve access to more of your content. Look at it from a user's perspective: If it's hard for a user to find a specific piece of information, it may be hard for search engines as well.
    Make sure you don't have duplicate content because of inconsistent URL parameters or improper use of slashes. By eliminating duplicate content, you cut down on the time Googlebot spends crawling something it has already indexed.
    Use related content links and in-site linking within your content whenever possible.
    Randomize some of your links. A sidebar with random internal content is a great pattern to use.
    Use dates and other microformats.
    Use RSS feeds wherever possible. RSS feeds will function much the same as a sitemap (in fact, Webmaster Tools allows you to submit a feed as a sitemap).
    Regarding sitemaps, see this question.
    Find ways to get external links to your content. This may accelerate the process of it getting indexed. If it's appropriate to the type of content, making it easy to share socially or through email will help with this.
    Provide an API to incentivize use of your data and external links to your data. You can have an attribution link as a requirement to the data use.
    Embrace the community. If you reach out to the right people in the right way, you'll get external links via blogs and Twitter.
    Look for ways to create a community around your data. Find a way to make it social. API's, mashups, social widgets all help, but so do a blog, community showcases, forums, and gaming mechanics (also, see this video).
    Prioritize which content you have indexed. With that much data, not all of it is going to be absolutely vital. Make a strategic decision as to what content is most important, e.g., it will be most popular, it has the best chance at ROI, it will be the most useful, etc. and make sure that that content is indexed first.
    Do a detailed analysis of what your competitor is doing to get their content indexed. Look at their site architecture, their navigation, their external links, etc.

Finally, I should say this. SEO and indexing are only small parts to running a business site. Don't lose focus on ROI for the sake of SEO. Even if you have a lot of traffic from Google, it doesn't matter if you can't convert it. SEO is important, but it needs to be kept in perspective.
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013

...