This is a guest post from Robert Padgett, Associate Director of Technical SEO at Adapt Partners.
What is thin content?
Thin content can be defined as content with little or no added value. Some of the pages that lie into this classification are pages with duplicate content, automatically generated content, affiliate and scraped content, and doorway pages can be classified as thin content by Google. These techniques don’t provide users with substantially unique or valuable content, and are in violation of our Webmaster Guidelines.
Sometimes, webmasters do not use spammy tactics for generating content but their pages are still fall under the thin content category because they do not answer user’s queries, either because the content isn’t relevant, it is too short or because it doesn’t match what you promise on the page title or meta description. When this is the case, your pages need more and better content. You can use several tools that can help you find ideas and keywords that can complement and improve your original content. Some of the tools I like using are Market Muse, Ahrefs and Semrush.
Here is a video of Matt Cutts explaining thin content back in 2013:
Sometimes Google will send a message to you via Google Search Console about a manual action due to detecting low-quality pages or shallow pages on your site, but other times your site can be affected by bad content and Google will not warn you about it.
The length of your content isn’t necessarily an indicator of thinness, Google can also judge the value of content with the “time to long click” metric, which measures how long a user stays on a page after finding it on a SERP before going back to the search engine and clicking on another result or making another search.
Below you can find a checklist of technical issues on your site that can cause a problem with thin or duplicate content.
Technical issues that cause thin / duplicate content
1. Faceted Navigation
- Problem: Faceted navigation allows users to filter and narrow content and usually each combination of facets is typically (at least one) unique URL. Because of this reason, faceted navigation can create duplicate content, eats up valuable crawl budget and can dilute link equity by passing link authority to URLs that we don’t want to be indexed.
- Solution: Use meta robots tags to “no index” URLs generated by the faceted navigation that are showing duplicate content. If this is not possible, you can implement canonical tags so that all of the URL variations point to the main canonical URL. You can also include “no follow” tags on the internal links to these faceted urls to reduce the chances of getting them crawled. To be sure that Googlebot will not crawl, it is important to disallow these pages on the robots.txt to save on crawl budget.
There is a great article that goes into depth about this issue that you might want to read: https://moz.com/blog/large-site-seo-basics-faceted-navigation
2. Search result pages
- Problem: It is great to allow users to search for information on your site. However, it is a bad idea to allow these results pages to get indexed. They do not have unique content, only repurposed snippets of content from other pages on your site.
- Solution: Add a disallow line in the robots.txt file to avoid bots crawling search result pages. Also, these should contain a meta robots “no index” tag.
This includes pages created by search pagination, search sorts or filters.
3. URL Parameters (filters, order, etc)
- Problem: Parameters can be used for filtering, narrowing and ordering content on a page. Parameters are added at the end of the canonical URL and look something like http://www.domain.com/page-slug?dir=asc&order=price
- Solution: Add a canonical tag that points to the canonical URL. This tag will tell Google that all of these pages with parameters are the same page and the parameter variations will not be indexed. It might also be beneficial to disallow crawling of the commonly used URL parameters via the /robots.txt file, in order to maximize crawl budget
4. Photo / Video Gallery
- Problem: Sometimes the way photo and video galleries are structured can create thin content. For example, creating one page for each photo without any text around the photo.
- Solution: Add content that complements the photo on each of those pages. This can be a really complicated task if you have thousands of photos on your site, each on a different URL. Changing the structure of your gallery and using carousels to display photos on the same URL can be an easier solution.
5. Www vs non-www URLs
- Problem: Having several variations of the same URL can cause duplicate content issues. Therefore, it is critical that only one version of the URL is chosen for every page on your website, for example: http://domain.com vs http://www.domain.com
- Solution: Redirect with a 301 the non-preferred version to the canonical version.
6. Uppercase vs lowercase URLs
- Problem: Same as above, allowing URLs to respond with 200 that contain capital letters can create duplicate content issues. For example: http://www.domain.com/Capital/Letters vs http://www.domain.com/capital/letters
- Solution: Redirect with a 301 the capital letter version to the lowercase version.
7. Trailing slash vs no trailing slash
- Problem: This is another item on the list that can cause multiple versions of a URL. Search engines consider URLs that render both with a trailing slash and without, to be different URLs, causing duplicate content.
- Solution: Redirect with a 301 the non-preferred version to the canonical version.
8. Http vs https
- Problem: With Google promoting sites to be secure, http vs https URLs has become a common issue causing duplicate content.
- Solution: Add a redirect rule in your server that redirects with 301 every http version to the https canonical version. Also, it is important to update every internal link on the site that points to one of the https versions to avoid Googlebot crawling those 301s every time it visits the site. Furthermore, http internal links can cause the browsers to display a non-secure warning.
9. Index.htm, default.asp, etc
- Problem: Sometimes homepages or category pages work with the root category and the actual page. For example: www.domain.com vs www.domain.com/index.htm or www.domain.com vs www.domain.com/default.asp
- Solution: Add a redirect rule in your server that redirects with 301 the index or default pages to the root.
10. Session IDs
- Problem: Session IDs are used to track user behavior and sometimes systems fall back to using Session IDs in URLs as parameters.
- Solution: As explained above, duplicate content issues with parameters can be fixed by including a canonical tag on the page. Also, it is important to disallow crawling of session ID URLs via robots.txt to save on crawl budget and prevent Google from spending time crawling the session ID URLs.
11. Shopping Cart Pages
- Problem: Some e-commerce systems might not block shopping cart pages from getting indexed. Shopping cart pages will list products that a user has selected for purchase and are already listed on other places in the site.
- Solution: It is important to no index these pages via robots.txt and meta robots tags.
12. Thin Category Pages
- Problem: Some sites, like blogs or ecommerce sites can get caught up in categorization. Managers can create so many categories that some of the category pages might get left with two or three products or posts only.
- Solution: Make sure that you have enough category pages and that all of them have enough content that provide value to users.
13. Product Review Pages
- Problem: Some e-commerce cms’ will create URLs for product reviews, thus showing reviews in two places on the site, below the product and on a separate URL. If a lot of your products have reviews, you’ll have a lot of extra URLs that Google will crawl that do not provide any value by themselves.
- Solution: Block these review pages via robots.txt and add a noindex meta tag.
14. Dev Subdomains or Site Copies
- Problem: It is normal to have a copy of the site where developers test out changes before implementing on the production site. Having a copy that can be crawled by Google and indexed, it will cause duplicate content issues.
- Solution: Block via robots.txt and also add a meta noindex tag.
15. Comment Pagination
- Problem: Some CMS like wordpress provide an option to allow comment pagination. This leads to new URLs being created for each page of comments that show the same article.
- Solution: Do not activate this option if your CMS allows it, or make sure the pagination series include a canonical tag to the main article’s URL.
16. Mobile sites — e.g., m.example.com and www.example.com
- Problem: Having a version of your site for mobile users in a subdomain can cause duplicate content issues if it’s not setup properly.
- Solution: Add canonical tags on the mobile version site that point back to the desktop version. Also include a rel=”alternate” tags on the desktop version that point back to the mobile version of the site.
Read more about how to properly set up mobile sites on a different URL here.
17. International sites without correct geo-targeting
- Problem: These duplicates occur when you have similar or same content on the site targeting different locations, like US and UK.
- Solution: Use hreflang for language and regional URLs:
<link rel=”alternate” href=”http://example.com/en-ie” hreflang=”en-ie” />
<link rel=”alternate” href=”http://example.com/en-ca” hreflang=”en-ca” />
<link rel=”alternate” href=”http://example.com/en-au” hreflang=”en-au” />
<link rel=”alternate” href=”http://example.com/en” hreflang=”en” />
- Problem: If AMP pages are implemented incorrectly, they can generate duplicate content. An AMP page is a stripped-down form of HTML, that allows the page to load faster for mobile users.
- Solution: Your regular page should include a rel=”amphtml” page pointing to the amp version. The amp version pages should include canonical tags pointing back to the regular page.
19. CMS generated multiple URLs
- Problem: Some CMSs can create archive urls that should not be indexed, since they are not showing any original content.
- Solution: Make sure these archive URLs are not crawled or indexed by Google by adding a disallow on the robots.txt file and meta robots tag.
20. Print pages
- Problem: If your site includes print friendly pages that are linked from the article pages, can cause duplicate content issues.
- Solution: Block print friendly URLs with robots meta tag and robots.txt. If print pages are created using parameters, these can be fixed by using canonical tags instead of meta robots tags.
21. Product Variations
- Problem: Some e-commerce systems will create different urls for product variations, such as colors, sizes or quantity. This generates a lot of duplicate content, since only minor details are different between each page. Also, it is not user friendly since, it would be better to show a user all of the product variations on the same page.
- Solution: Pick one of the product variations and make it a canonical. You can add a rel canonical tag on the other product variations that point to the main page. Also, try to show the user on the canonical page, that there are other options to choose from.
Tools That Can Help You Identify These Issues
Google Search Console
The new Google Search console provides more data about how the pages on your site are getting indexed and it also provides an explanation of why some of the pages have been excluded. On the images below, you can see how the new Google Search Console displays this data:
Pages not being indexed in Google is a signal of having duplicate or thin content on the site. However, since Google has not rolled this new dashboard for every site, you might need to use other tools like the ones listed below.
On Google Search Console, you can also use some of the features to accelerate results:
- Google URL Removal – You can ask Google to remove a URL from the search results after you’ve blocked them.
- URL Parameters Function – You can tell Google how to treat URLs using different parameters on your site optimizing crawl budget.
Siteliner and Copyscape
Siteliner is a great tool for finding issues of internal duplicate content. The free Siteliner service is limited to monthly analyses of sites up to 250 pages. The Siteliner Premium service allows you to analyze websites up to 25,000 pages, with no limitations on how often analyses are run. Simply enter your domain name, and Siteliner begins crawling your site by following internal links. It rapidly retrieves and analyzes your content, generating a live report as the crawl proceeds.
Copyscape is a tool from the same company as siteliner. It will check for duplicate content on different domains. You’ll easily find out if your content has been scraped and posted on another site.
Screaming frog can also help you finding pages with thin and duplicate content. There are different data points that you need to consider:
- Word Count: Sort the ‘Word Count’ column from low to high to find pages with low text content.
- Connect the crawl with Google Analytics and find pages that are not getting any organic visits. Some pages might have a lot of text but are not getting any organic traffic. This could be an indication of Google qualifying the content as thin.
- Check the crawl overview panel and check the duplicate titles report.
Reviewing your site and making sure you don’t have any of these issues can help you diagnose if your site’s rankings are affected by having pages that do not provide any value to users.
If you’ve received a message from Google about a manual action due to thin content, once you’re sure that your site is no longer in violation of Google’s guidelines, request reconsideration of your site. After you’ve submitted a reconsideration request, be patient and watch for a message in your Search Console account, Google will let you know when they’ve reviewed your site and if they will revoke the manual action.
[Featured image by Dom J from Pexels]
Robert Padgett has been working in the search marketing industry since 2009. Robert’s current role is Associate Director of Technical SEO at Adapt Partners, where he focuses on analyzing sites to find technical issues that impact our customers’ website performance, crawling, CTR, and conversions.
Prior to joining Adapt, Robert worked at two previous online marketing agencies: one with offices in Spain and the second located in the United States.
Post from State of Digital Guest Contributor
This is a news feed, by author State of Digital Guest Contributor, the original post can be found here Thin Content: what is it, what can cause it, and how do you find it?.