« Increase web site income - Sell Ad space on your Blog - set n forget hands-free System to serve AD packages on your blog »

Avoid Duplicate Content with Robots text

How to avoid a dreaded Google duplicate content penalty is a hot topic amongst loyal users of Wordpress.

Due to the link structure of Wordpress.. and as a  blog continues to grow, search engines will spider a site and at times rank identical articles under two separate topics.  Likewise duplicate Blog posts from say either of the Archives and Tag categories might end up listed at Google.

The  YLSY Permalink Redirect plug-in is always activated in the Plug-Ins directory of all my WP blogs and has worked with great success to avoid duplicate content penalties.

This blog is nearly 4 years old and the only time duplicate articles appeared in the SERPs was when this plug-in was accidentally deactivated for 3 weeks.  Purely an oversight on my part.

Rough way to find out that the plug-in really works, eh!

Use Robots.txt File to Prevent Google Duplicate Content Penalty

The application of this permanent method will ensure no duplicate content blog posts erroneously find their way on to Google’s SERPs.

Using the Robots.txt file to exclude some of a Blog’s directories from Googlebot and other SE spiders effectively protects a Wordpress installation from having duplicate content indexed.

As the permalink is the ONLY link relating to a post that we DO want spidered and indexed…  following are the directories that should be disallowed from the SE spiders crawl by including the block of code in your site’s Robots.txt file [If you don't have one as yet... it is a simple text file constructed in Notepad and saved as robots.txt - AND - uploaded to your public_html or www folder]

FYI - the first line in the code: User-agent - will block ALL search engine robots from indexing the directories disallowed directly under in the list. To block or disallow an entire directory in robots text file be sure to add the forward slash after the name of the directory. Please note that URLs are case sensitive.

User-agent: *
Disallow: /blog/archives/
Disallow: /blog/author/
Disallow: /blog/category/
Disallow: /blog/page/
Disallow: /blog/tag/
Disallow: /blog/wp-includes/
Disallow: /blog/search/

I’d also include directories like “wp-includes” and “search” in the robots.txt file, as I’ve personally seen files from these directories indexed in Google’s search pages.

Following is a free tool offering a speedy way to get a listing of all your web pages listed at Google, to check for any duplicate or unwanted listings. Just extract the “directory” from such URL’s and insert into your robots.txt file as per the above examples.

Free Tool Checks Google PR of Internal Pages

Insert the URL of your blog into this handy free Google page rank tool that Checks page rank of internal pages… you’ll get a quick but comprehensive listing of all the internal pages of your Blog indexed by Google.

The results will be a proverbial case of “do you want the good news or the bad”?

The unexpected healthy page rank of some of the internal pages is the bearer of good news…

But… the bad news comes in the form of the shock when faced with the unwanted directory pages Google has indexed from your Blog.

Thankfully though… the results offer a comprehensive listing of ALL the directories you will need to disallow via inclusion in the robots.txt file… [as applies to your Wordpress installation].

I have just started using the excellent SEO Wordpress plug-in Headspace2 which now allows input of the “nofollow” and “noindex” commands in the actual head meta tags of individual Blog pages or posts..  Thus rounding up the protection that must be applied to a blog… in a webmaster’s plan to avoid Google duplicate content penalties.

NOTEWORTHY: I have temporarily lost the PR5 this blog used to enjoy… and had most of the blog’s internal pages completely de-ranked completely… as nearly 2 years worth of this blog’s earlier posts originally hosted on Blogger, were deleted past the More link.  So have decided to bite the bullet and redesign the entire blog into the current Magazine theme… and in unison manually recompose hundreds of broken posts at the same time.

Then intend submitting the blog’s Site Map to Google, along with a written explanation to Google webmaster support of the changes made to the blog.  Also intend asking them if my change of web host to Hostgator around a year ago has anything to do with the lowered page rank…

Reason being…

You might recall that one of Google’s recent patent applications indicated they now monitor any change of web host and would penalize a Blog if their web host provided web space/hosting to other dubious spam type sites, particularly on a shared IP address. [It's certainly worth reading the SEO implications of that patent at Secrets to Google Top 10 Rankings]

[I have taken the precaution of paying for a separate IP address for this site - but, as the above referenced patent also revealed...  Google is not going to favourably view the changes I've made to a large number of pages as I fix missing text... in large batches at the same time, without a suitable written explanation.]

Please comment on your experiences with duplicate content. Have any of your blogs suffered as a result?

Categories: Google, Links, SEO, Wordpress
Tags: , , , , , , ,

2 Responses to “ Avoid Duplicate Content with Robots text ”

  1. Maybe this is a question with an absurdly obvious answer, but . . .

    How do I know whether or not my blog is currently being penalized for duplicate content issues?

  2. Call me old fashioned, but it takes away credibility as an “authority” in your niche, when you omit your real name as signatory to a comment. Particularly with “nofollow” in use on most blogs. No self thinking webmaster will risk years of hard work, to have their blog penalized by letting comments through with hyperlinks to unrelated and questionable domains [This is no reflection on the credibility of the comments of webmaster appearing above].

    Most bloggers take the time to check every domain before a comment gets approved.

    Why? Here’s just a couple of reasons.

    The topic is still a “grey area” at Google… but many webmasters have confirmed their blogs have been penalized for choosing to link to questionable sites, included in Comment widgets and the Comments field under blog posts.

    Also… if you have the dubious honour of being the top contributor on a “most comments” widget and have used a domain name with different keyphrases as the link back on each of those comments, then Google might penalize you directly. [which gets back to the point made: that no credible webmaster will bother to think up different keyphrases on comment fields, but will choose instead to sign off with their own name and get a live link to their domain. Honestly, its only a matter of time, before a Google algorithm will start to heavily discount the relevance of inbound links from the comment fields of blogs]

    Also… just like thousands of other bloggers, I am just so darn tired of spam comments, and glaring attempts to simply get a link back to a blog.

    @Spot cool stuff

    Posted your comment above as your question warrants an answer. Will post the reply shortly, but more than likely via a full post. The answer to how to recognize a google duplicate content penalty is not an obvious one line reply.

    cheers
    Rox

Leave a Reply

You can use these XHTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>