What are the effects of duplicate content

Duplicate content

Duplicate content (in German: "Double content" or "Duplicated content") is a common problem with websites and can lead to a poor ranking in search engines (e.g. Google and Bing) - especially due to insufficient indexing of identical pages or deliver largely similar content.

Duplicate content means identical or very similar content that can be found using different URLs. A distinction is made between internal duplicate content (duplicated content on your own website) and external duplicate content (mostly duplicates on third-party websites). The basis of every OnPage optimization is to find duplicate content and to avoid it as best as possible!

Table of Contents

 

Duplicate content: causes & effects

Let's start with the most common question: What is duplicate content? Under "Duplicate content"Or also"Double content"One understands the same or very similar page content that can be found via different URLs.

A distinction is made between:

  • Internal duplicate content: When the same content can be accessed (consciously or unconsciously) via several URL variants of a domain. For example, on overview pages for tags, filters and categories, paginations, internal search results pages or articles and pages that are assigned to several categories.
  • External duplicate content: If the same content can be found on different domains, for example on different (own) web projects or language versions of your own website. The dissemination of press releases or plagiarism through content theft and content scraping published on third-party websites can also lead to external duplicate content.

Although Google is now able to identify duplicate content as much as possible itself and to evaluate it correctly, it is nevertheless advisable to avoid duplicate content as much as possible and to support Google in prioritizing the content.

 

How does duplicate content affect ranking?

If you are concerned with the topic of search engine optimization, you will inevitably deal with duplicate content. The search engines, above all Google, prefer unique and high quality content. At the top of the list is the searcher who expects to receive the most suitable search hits for his query on the first page. In order to meet this high standard, Google uses considerable resources to browse, analyze, evaluate and index millions of websites every day. If there are now several pages that deliver the same or almost identical content, it is no longer possible to clearly determine which of these pages has the highest relevance. As a result, the pages have to share the relevance, which can have a negative effect on the ranking. In the worst case, pages will not even be indexed because there are already other pages in Index that provide the same content.

The more pages are affected by duplicate content, the more serious the effects on the ranking can be. In particular, external duplicate content can become a problem if Google can no longer clearly determine the origin.

 

What is Google doing with duplicate content?

Since Google cannot know in advance whether a page contains duplicate content, it is initially crawled and indexed as normal. Regardless of the fact that the entire process of indexing duplicate content is a waste of time for Google and eats up unnecessary resources, it is still checked whether the origin can be determined.

If it is a matter of several identical or similar pieces of content that only appear on your own site, then it is determined whether it is necessary duplication or manipulative measures such as spam. In the case of necessary duplications (e.g. repetitive legal information in a web shop), these do not have a negative effect on the ranking - if, on the other hand, it is spammy keyword content, in the best case all affected pages simply lose relevance and sink into the endless expanses of the Goooooooooogle. In the worst case, namely when the assumption is obvious that permanently and deliberately duplicated content is distributed, there is even a threat of a penalty and a reset in the ranking! The situation is similar if it is found that the content has been copied or "stolen" from another site. The result is an immediate downgrade in the ranking or even a penalty if it occurs more frequently.

 

How is duplicate content rated by Google?

Sometimes duplicate content cannot be avoided or is even an integral part of a project in which individual sections of text are repeated. For example, content that should be easily available to every user on every page or legally relevant texts (terms & conditions) that have to be repeated for different offers.

This is exactly the topic that Matt Cutts has addressed in the following video and thus answers the question: Is duplicate content really that bad? What consequences can it have on the ranking on Google?

[dsgvo-youtube url = "https://www.youtube.com/watch?v=Vi-wkEeOKxM"] [/ dsgvo-youtube]

In a nutshell: [blockquote] Duplicate content is not fundamentally something bad or bad. Only those who use duplicate content as a manipulation tool have to fear a penalty from Google. [/ Blockquote]

 

Common causes of duplicate content

The duplication of content happens very often without the site operator really being aware of it. On the one hand, technical aspects can be the cause of duplicate content, on the other hand, content-related processes that are taken for granted:

  • Your homepage can be reached with and without www
    Example: https://ihredomain.de & http://ihredomain.de
  • Your homepage can be reached via http and https
    Example: https://ihredomain.de & https://www.ihredomain.de
  • An old page is replaced by a new page
    The old page is still in the search engine's index and has possibly built up a good ranking, but the new page now offers the same content under a new permalink (same domain, but new link structure).
  • Domain transfer
    All pages can still be reached under the old domain and indexed by the search engine, and the new domain now delivers the same content under a completely new URL. When a domain is moved, the old URL should be redirected to the new URL immediately.
  • Categories, tags, archives, pagination
    One and the same page can be reached both directly and via different categories, tags, page numbers, etc. (Example: / duplicate-content / - / seo / duplicate-content / - / 2 / duplicate-content). Sometimes a common problem with the popular tag clouds.
  • Print versions of page content
    If you give your users the option of printing out page content (be it a separate printable page or a PDF document), then Google will also find this version and possibly classify it as duplicate content.
  • Mobile website with identical content
    If you deliver a mobile website to users of mobile devices (smartphone, tablet), you should ensure that Google can recognize this. Google uses a special crawler / user agent for mobile websites or mobile search, so this should always be directed to the mobile version of the website:
    [dsgvo-youtube url = "https://www.youtube.com/watch?v=mY9h3G8Lv4k"] [/ dsgvo-youtube]
  • Different language versions of a website
    Many online shops and websites of international companies are available in different languages, with the available content (products, services and descriptions) usually only differing slightly. As long as Google can determine the geographical orientation of the page, the ranking is not affected - because depending on the country, the appropriate language version of the website is always delivered as a search hit. In the case of international, multilingual pages, the geographical orientation of the page should always be communicated (e.g. using the hreflang attribute). Matt Cutts explains how Google deals with duplicate content on different TLDs (top-level domains): [dsgvo-youtube url = “https://www.youtube.com/watch?v=Ets7nHOV1Yo“] [/ dsgvo-youtube ]
  • URL parameters and session IDs
    Are often used to track the origin and behavior of website visitors (? Sid = 82). These tracking parameters result in different URLs for the search engine that deliver the same content and thus duplicate content.
  • Pagination (numbering) of comments
    Many content management systems offer the option of separating several pages from a certain number of comments. This creates new URLs for the search engine (? Comments-1, & comments-2, ...) with duplicate content.
  • Multiple domains posting the same content
    It happens that page content is (deliberately) published on several websites. If no source is indicated on this website with a corresponding link, the search engine can no longer recognize where the original comes from. These pages may then rank better for the same content than your own website.
  • Upper and lower case in URLs
    Only lowercase letters should be used for URLs in order to avoid multiple indexing (e.g. / duplicate-content / and / Duplicate-Content /).
  • Identical or very similar product descriptions
    A common problem with web shops that adopt manufacturer article descriptions or automatically place the same descriptions for many similar products (e.g. color variants). The content as well as the META description and the page title differ only minimally here. Especially when using affiliate feeds or manufacturer product descriptions, there is the problem that the same content is published on hundreds of e-commerce sites and you can no longer stand out - what should be added value? There is a very nice video from Google in which Matt Cutts explains the facts: [dsgvo-youtube url = “https://www.youtube.com/watch?v=z07IfCtYbLw“] [/ dsgvo-youtube]

There are certainly other causes of duplicate content, but these are hardly significant. A point that I have not mentioned, but which I also take for granted: Never knowingly duplicate content in the mistaken belief that the page will then be found better! Just the opposite will be the case!

 

Duplicate content analysis

There are several ways to check your website for duplicate content. The easiest way is to use free SEO tools, which I present to you below:

 

Find internal duplicate content

Internal duplicate content occurs when identical or similar content is delivered within a domain. This problem occurs particularly frequently with content management systems such as WordPress, for example when pages or posts are assigned to different categories or are tagged.

The most common reasons for internal duplicate content are:

  • Website can be reached with and without www
  • The website can be accessed via http and https
  • Archive and Category Pages
  • Overview pages for filters or tags
  • internal search results pages
  • Pages or posts that are assigned to multiple categories and / or tags
  • Pagination (page numbering) e.g. of comments
  • URL parameters and session IDs
  • Print versions of page content
  • Identical or very similar product descriptions
  • Mobile website with identical content

Find internal duplicate content with Siteliner:

A very useful and free tool to internal duplicate content can be found on their own website Siteliner. With the help of Siteliner you can check your entire website for duplicate content and receive a detailed list of the individual pages with information on the percentage agreement of the content (matching words, percentage agreement of the page content, number of pages with similar content, relevance of the page for search engines) .

Pages with a high level of content agreement should urgently be examined more closely and analyzed. Mostly these are categories or archive pages that are irrelevant for search engines and that can be safely excluded from indexing using “noindex, follow” in the META data.

 

Find external duplicate content

External duplicate content always arises when the same content can be found on different domains. These can be your own web projects, on which the same content is published in part, or third-party websites such as press portals, news services, forums, RSS or news aggregators, etc.

The most common reasons for external duplicate content are:

  • Web projects on which partly similar or identical content is published
  • Content theft or theft
  • Content scraping (scraped content)
  • Dissemination / publication of press releases
  • Adoption of manufacturer product or article descriptions
  • Publish your own content on news portals or in forums
  • Content import from newsletters or via RSS feeds

Find external duplicate content with Copyscape

Copyscape is the counterpart to Siteliner and specializes in searching for copies and plagiarism on the Internet. After entering your own URL, the tool scans its database and external data sources for similar or identical content and then outputs a text excerpt for the search hits together with the websites on which the duplicates were found.

 

Duplicate content analysis with Google search

Another option for duplicate content analysis is to query the Google search on distinctive page content, because you will find duplicate content both internally and externally. To do this, copy a text excerpt from your website (which you know should not appear on any other website) and paste it into the Google search field with quotation marks (at the beginning and at the end of the sentence). The text excerpt should not contain more than 32 words, as search queries on Google are limited to 32 words. If there are more, Google will automatically cut it.

Ideally, if no duplicate content is found, the search result looks like this:

If you get more than one search hit, then there is duplicate content. You can easily recognize external duplicates by the fact that different websites (domains) are listed on which the same text excerpt was found. On the other hand, Google tries to filter out the internal duplicates and usually shows you the "find" instead of the search hits with the following message:

[blockquote source = “DC notice in the search hits from Google“] In order for you to receive only the most relevant results, some entries have been omitted that are very similar to the 2 hits displayed. If necessary, you can repeat the search taking into account the results you skipped. [/ blockquote]

In such a case, have the supposed duplicate content displayed (repeat the search) and check whether it affects your own website or whether plagiarism of your page content appears on the Internet.

 

Avoid duplicate content

1.) Domain redirection with ModRewrite in .htaccess

With the following entry in the .htaccess file (in the root directory of your web server) you can redirect the domain without www to the domain with www:

RewriteEngine on RewriteCond% {HTTP_HOST} ^ example.de RewriteRule ^ (. *) $ Https: //beispiel.de$1 [R = 301, L]

Of course, the whole thing also works the other way around:

RewriteEngine on RewriteCond% {HTTP_HOST} ^ www.beispiel.de RewriteRule ^ (. *) $ Http: //beispiel.de$1 [R = 301, L]

 

2.) Permanent, server-side forwarding with 301 redirect via .htaccess

Permanent redirection should only be used if you want to redirect an old (no longer existing) file to a new one or even an entire domain. The big advantage of a 301 redirect is that the PageRank is also transferred to the new target! The setting is again made via the .htaccess file in the root directory of your web server.

If you want to redirect a single file:

RedirectPermanent /seite-alt.html https://ihredomain.de/seite-neu.html

If you want to redirect an entire domain:

RedirectPermanent / https://domain-neu.de

 

3.) 301 Redirect with the header () function in PHP

As an alternative to permanent forwarding in the .htaccess, there is also the option of placing the following reference directly in the HTML or PHP file:

 

4.) Canonical Tags / rel = canonical

Canonical tags are a very good way to group different pages with very similar or identical content by telling the search engines which URL is preferred or representative.At this point I like to quote Google, who put it in a nutshell: "A canonical page is the preferred version of several pages with similar content."

The canonical tag is placed in the area of ​​the page as follows:

A very detailed explanation of canonical tags as well as easy-to-understand examples of how best to use them can be found directly on Google. For web shops or print versions of pages, rel = canonical is usually the best and most effective way to prevent duplicate content.

 

5.) noindex reference in the META tags

About a noindex Note in the META tags you can tell the search engines that the page should not be indexed. The rest is self-explanatory: what is not indexed cannot cause duplicate content.

The following entry ensures that the page is not indexed, but can still be crawled by the search engine without restriction (recommended):

The nofollow attribute is used if the page given should neither be indexed nor any links on it to other pages should be tracked:

 

6.) Do not copy any content, do not duplicate pages, no text modules!

Be creative and take your time with the content of your page! Refrain from copying content from other sites and do not use identical or similar text modules. And if a page is not yet finished, please do without so-called “placeholders” or exclude them from the indexing with “noindex” until it is finished. And worst of all: do not simply use the content of other pages, always use a source reference.
7.) Configure the URL parameters via the Google Search Console

The Google Search Console (formerly Webmaster Tools) contains an extensive collection of tools for webmasters and is the best (free) way to keep track of the indexing of your own website. Here you can, among other things, tell how the domain should be indexed (with or without www), see immediately if problems occur and define how various URL parameters should be handled. But be careful: Incorrectly configured parameters can cause pages on your website to be removed from our index, so you should only use this tool when necessary.
8.) Define language version (rel = "alternate" hreflang = "x")

Different language versions of a website - especially in online shops with marginal differences in products and descriptions - can be defined using the rel = "alternate" hreflang = "x" link attribute. This attribute is defined in the HTML header of the website as follows (German / English / Spanish):

You can find more information about the hreflang attribute here.

 

9.) Follow Google's recommendations

Google itself provides webmasters with various aids to avoid duplicate content. Among other things, Google recommends:

  • Use 301 redirects
  • Pay attention to the consistency of the internal linking
  • Syndicate carefully
  • Minimize recurring text modules
  • Avoid publishing placeholders
  • Analyze your content management system
  • Minimize similar content
  • Use the Google Search Console

Please refer to the Google Search Console Help for details

 

Which pages shouldn't be indexed?

Basically everyone has to decide for themselves to what extent indexing of their own website should be allowed. However, it is advisable to exclude the following content, as this can generally be useful for navigation and interesting for the user, but not for search engines:

  • All pages that provide identical or very similar content.
    Example: socks-blue.html, socks-green.html, socks-yellow.html
  • Categories and tags that deliver the same content on several levels
    Example: ... / socks /, ... / socks / blue /, ... / socks / blue / knit /
  • Page numbers and archives serving the same content
    Example:… / sockel-blau.htm,… / archive / 1 / sockel-blau.htm
  • Affiliate URLs / Session IDs / Tracking parameters
    Example:? Partner_id = 123,? Session_id = 123,? Tracking_id = 123
  • Search filters or results pages with search parameters
    Example: & ort = stuttgart, & color = blue, & ort = stuttgart? Color = blue

The easiest way to exclude these pages from indexing is to define URL parameters in the Google Search Console and / or to specify noindex in the header.

 

What not to do to avoid duplicate content:

 

Don't forbid the search engines to crawl! For example, if you have pages or areas via the robots.txt exclude from indexing, then you forbid the search engine to visit these pages and to get an overview. Google, Bing & Co. are quite capable of correctly assessing the relevance of individual pages and thus also similar content and they also learn new things every day - but that won't work if you push the bolt completely. Google in particular strongly advises against using robots.txt to avoid duplicate content.

That too "Remove URLs" tool in the Google Search Console is completely unsuitable for avoiding duplicate content, because it does not remove the cause but only leads to a temporary removal of the page from the search results.

Conclusion

The problem Duplicate content or. duplicate content hits a lot of website operators and I can only recommend that you study it in detail, as it can have a direct impact on your placement and visibility in search engines.

In general, you should be aware that duplicate content is not fundamentally bad - because Google is able to differentiate between good and bad (spammy) content. If you have followed my article carefully, you now know how to identify and eliminate the most common causes and where to start. I wish you success!