This morning, on Twitter, a question was posed about how to prevent your web pages from being indexed by the search engines’ spiders. In one of the few ways that standards have truly been adopted, we have several options for doing this.
Why would I want to exclude my pages from search?
There are many reasons, really. I have created an online puzzle, as an example, and I do not want people to find the levels through a search engine. Another example would be a private Wiki. If you haven’t considered the power of a Wiki to manage your life’s data, I’ll be writing an article on that shortly. Finally, there are times where certain pages of your site just shouldn’t be found. I recently made a transition to .aspx pages from .html pages on my blog, and I wanted to exclude the HTML pages from the search engines.
The point is that there are plenty of reasons to hide a single page or an entire site from search engines. What’s great about this is that it’s SUPER easy.
So how do we do it?
There’s actually 3 different ways, depending on what you’re trying to accomplish. This list includes a .txt file on your server, meta tags on your pages, and additional properties on your <a href=””> tags.
I’m starting with the approach I feel is the best, and if you don’t like it, there are certainly alternatives below. Meta tags are universally supported by the search engines, and allow you page-by-page control. There are actually four different states this meta tag can take. Here they are:
<meta name="robots" content="noindex,nofollow" />
By adding this tag to your page, it will not only be ignored by the search engines, but any links on the page will also not be followed. This is the tag I wold recommend using on any page you want to be ignored.
<meta name="robots" content="noindex,follow" />
In this case, we’ve changed “nofollow” to “follow.” This means that while THIS page will be ignored, any pages this one links to will be checked out. However, if those other pages also have this meta tag on them, they will still be ignored.
<meta name="robots" content="index,nofollow" />
Now we’ve flipped the bit on the “noindex” value. This value will allow the search engines to index the page, but because we have “nofollow” specified, any pages it links to will not be found. (If other pages without this tag link to those pages, however, they WILL be found. You need to cover ALL of the possible paths to each page.)
<meta name="robots" content="index,follow" />
Finally, we have the complete opposite of what this article is about. These values will not only allow your page to be indexed, but it will also follow all of the links on the page, and try to index them as well. (Again, each page will be evaluated for meta tags as the spider finds it.)
Nofollow links are a bit of an extension from the meta tag. Instead of telling the spider not to index the whole page, you’re able to tell it not to index any of the pages it links to.
<a href="http://jeffblankenburg.com/secret.aspx" rel="nofollow">Jeff's Secret Page</a>
This option is technically the most flexible, because it allows you to manage what is found on a link-by-link basis. The downside is that you need to manage it on a link-by-link basis.
This is the most all-encompassing option, it allows you to exclude your entire site with one simple file. Here’s an example robots.txt file.
The “user-agent” section of this file allows you to even specify WHICH search engines you want to exclude. In our example file, we’re letting everyone in, but not allowing them in two specific directories, and one specific page.
The upside to this approach is that we can manage our search engine inclusions and exclusions in one centralized file. The downside is that if a search engine ignores your robots.txt file, your whole site will be indexed.
It is my opinion that you should use a combination of these methods if you really want to keep some of your pages from being found.
The guaranteed solution
It may seem simple, but many developers still use the idea of “security by obscurity.” Just because nobody knows your page is there does NOT mean that it can’t be found. If you need your content protected, the only guaranteed way to do that is by password protecting it. Create a wall, and make sure you don’t leave any holes. This will most certainly deter the spiders from finding your pages of content.
Removing pages that have already been indexed
This is a trickier conundrum. Your pages are already showing up in search engines, and you need to get them “unlisted.” The good news? It’s relatively easy, and will happen automatically if you used one of the methods listed above. The bad news? If you need your pages taken down ASAP, you have to go to each individual search engine. Here’s the places to start:
Live Search – http://blogs.msdn.com/webmaster/archive/2009/01/28/removing-content-from-the-live-search-index.aspx
Google – http://www.google.com/support/webmasters/bin/answer.py?answer=61062
Yahoo – http://help.yahoo.com/l/us/yahoo/search/siteexplorer/delete/siteexplorer-46.html
Leave a Reply