There are some pages on sites that you don't necessarily want to have indexed. Some of them are easy to find because they fulfill important purposes on sites, but still shouldn't be crawled and indexed. Others can be considerably more difficult to find.
For example, on one site, I found widgets that appeared on only 28 pages of a site, that inflated the amount of URLs on that site from around 3,500 to over 90,000 in Google's index.
It was one of the first assignments I had at an agency I went to work for years ago, and looking at the URLs that were being returned in a site search at Google [site:www.example.com], it looked like there was an almost infinite number of URLs for the site, in the form of URLs that weren't very search friendly, which made understanding the architecture of the site extremely difficult. Was this a bug, or was the site truly that large? The first part of SEO for a site is often a matter of discovery.
On some sites, there are some pages you want to keep from being crawled and/or indexed, and these can include pages such as "Email to a Friend" pages, "Write a Review" pages, "Compare Products" pages, and other pages that have little unique text on them, are unlikely to be linked to by most people, will rarely if ever be shared socially or "emailed to a friend." Sometimes these pages only show one or two lines of unique information, and then something like a map or a calendar.
Sometimes there are pages are near duplicates of other pages, and contain substantially similar content that might differ by how content (often products) on a page is sorted:
- Best selling
- Alphabetic order a-z
- Alphabetic order z-a
- Price high-low
- Price low-high
- 15 items per page
- 30 items per page
- 50 items per page
- List display
- Grid display
Limiting which versions of those sorted (and filtered) pages get crawled and indexed, and using canonical link elements and pagination link elements wisely can be helpful.
For example, when you have multiple pages within a product category, you may want to limit (using robots.txt disallows, robots noindex meta elements, hash URLs (#), Java Script, option dropdowns, etc.) which sorting order and other parameters to one version, such as pages that display: (1) Best sellers, (2) using a Grid display, (3) with 15 items per page, and have that be the sorted version that gets crawled and indexed by search engines.
Sometimes the way elements of a content management system work can cause issues outside of those sorting pages, or pages that add features but shouldn't be indexed (email to a friend pages). For example, as I noted above, I once worked on a site that had approximately 3,500 URLs that I wanted indexed. I didn't know that when I started working on the site. All I knew was that Google estimated that there were about 90,000 or so URLs on the site.
Shortly after I started on the site, I tried to crawl it using a program intended to find broken links (Xenu Link Sleuth), and I noticed that there were times during the crawl when URLs just started to get really ugly. Some of the pages on the site had widgets on them that expanded or contracted sections of content on those pages.
Everytime a section was expanded or contracted, the URL for the page changed. For example, if the page had 21 of these widgets on it, and the first three widgets were clicked upon to expand them, the URL would change to look like this:
If I then contracted the first widget and expanded the 15th widget, the URL might then look like:
Notice how the order of the different parameters can change, and the number of parameters can change as well. All of these can be listed in any order, and there was a very large number of variations of that URL on pages where there were 21 of these expansion/contraction widgets.
I ended up spending about 15 hours running Xenu Link Sleuth on the site. I disallowed crawling of the pages that had those widgets on them, one at a time, to find that there were 28 pages on the site that did include those widgets with the changing URLs. If I kept the extra parameters from those pages from being crawled, I ended up with only 3,500 or so pages.
Within a month of so, Google had removed many of those expanding and contracting URLs that appeared only because of the widgets. Instead of Google estimating 90,000 pages for the site in its index, it reported "about" 3,500.
Having a manageable amount of URLs on the site, and being able to understand why the inflation of URLs had happened on the site previously made it a lot easier to manage and make changes to the remaining URLs and content on the site. It was like the sun was shining on the site:
One of the first steps that I take these days is in looking for URLs for unnecessary pages, so that I can remove and reduce pages (from being indexed) that don't provide value in a meaningful way. Pages that allow you to email a friend with the URL of the page, or that enable you to rate and review a specific product are important, but there's usually no value in letting those pages get indexed by search engines.
By removing unnecessary pages from being indexed, I can then move on to pages that should be indexed, to make them even more valuable with added content, faster loading and rendering time, and in other ways that both improve the experience of visitors on those pages and help meet the informational and situation needs of visitors to pages as well as helping to meet the objectives of the owners of the site.