Site Crawling Options

This tab of the project settings window contains options related to connection to a website's pages, its links and content analysis:

Website Crawl Options

Obey robots.txt

This simple text file contains the rules for search bots and other site crawlers to follow. You can find more details about the robots.txt protocol specifications at the official site.

This option allows crawling a site's pages just like a search bot does when indexing the site. If the option is checked (it is checked by default), the Site Visualizer bot won't try to access any forbidden URLs. Otherwise, the program's bot will crawl all the links that will be found within an HTML text of a page.

Allow Redirects

Defines whether to access an URL a link redirects to.

Go Outside of Start Directory

If the option is checked, the whole website will be crawled. Uncheck this option if you need to crawl particular directory of a website only. In this case, specify the Site URL on General tab including the directory to crawl and a trailing slash, e.g. ''. Specifying the Site URL without trailing slash (e.g., '') means that you want to crawl the root directory of only, omitting /dir/ and others directories.

Count Word Number

This feature is very helpful for SEO analysis of a website's pages. Content size is one of the most important ranking factors. If the option is checked, the number of words on every page will be counted and saved to the Word count field of the pages table. In other words, Site Visualizer will count the content length of every page in words. The rules of the counting algorithm are as follows:

  • Only the content enclosed in the <BODY> HTML tag is counted. Any text enclosed in other tags (<TITLE>, <META>, etc.) is omitted.
  • The text of scripts (enclosed in the <SCRIPT>...</SCRIPT> tag) won't be counted.
  • Any other character sequence (except for spaces, tabs, and line breaks) enclosed in any tags is considered as a word.

For instance:

<a alt="Some ALT text" title="Some Title">Only these 8 words will affect word count</a>

Include URLs

The feature allows to crawl certain URLs (or directories) only. This may be useful for extremely large websites crawling.

At the Crawling tab of the project settings window, click Include link. In the dialog box appears, type URLs (or part of a URL, such as directory, file name, etc.) that should be crawled – one per a line. Only links that contain at least one of the specified string in its URL address will be crawled.

Exclude URLs

The feature allows to define a list of URLs or directories that should be skipped while crawling. The behaviour is in a similar to the above inclusion URLs process. Links that have at least one of the specified strings in its URL address will not be crawled:

Do not crawl certain URLs

The home page of the website will be crawled in any case regardless of Include URLs and Exclude URLs lists.

Delete Parameters from URLs

Opens the dialog window that allows to specify parameters to be deleted from a URL string. In most of cases such parameters can be session IDs, category IDs, various preview modes, and many others that make a URL different, but it still references to the same page of the website. Removing of such parameters makes the sitemap cleaner and smaller that in turn simplifies site analyzing. And at last, this allows to submit to Google and other search engines URLs without such "garbage" parameters.

Type parameters you want to exclude from URLs in the dialog window appears. A line should contain single parameter (without '?', '&', or any other delimiters). For instance, the picture below contains set of parameters for a phpBB forum crawling:

Remove parameters from URL

Click OK button to save the parameters list to current project.

Crawl Secured Pages

On access to protected area, the program will ask for your credentials:

Crawl protected website

In the dialog box appears, specify the username and password. If the credentials are correct, Site Visualizer will crawl the secured pages.

Check Tag Types

These parameters allow you to define which tags should be parsed and added to the Links table. This is useful when you need to exclude, for example, occurrences of links to images, CSS files, or JavaScript files:

  • <IMG> – uncheck it if you need to exclude the <IMG> image tags from the site's HTML text parsing and prevent adding it to the Links table.
  • <LINK> – these tags in most cases are used for linking from a page HTML text to one or more Cascading Style Sheets (*.CSS) files. Like the <IMG> tag, the <LINK> tag is just a reference to a file, not a hyperlink to a page of a website. It doesn't transfer any "page weight," so this option can be usually unchecked in case of SEO analysis.
  • <SCRIPT> – this tag, just like <IMG> and <LINK>, is a reference to a JavaScript (*.JS) file. If you don't want to see such references in the Links table, keep this option unchecked.

Remain this option switched on to check links from a website you'll crawl to external sites. This allows you to find broken external links on your website.

Check Bookmarks

When turning this option on, links to the same page but with different bookmarks will be recognized as different as well. For instance, and would be recognized as links to two different pages, and would be added to the Pages table separately. Every of these bookmarks (#b1 and #b2) will be checked for presence on In case of fail, the corresponding Response column will contain #b1 Not Found message.

Thread Number

This parameter defines the number of parallel threads for crawling a website. The maximum value is 25, and the default is 5.

Please note that increasing this value does not always lead to increasing the site crawling speed. And the higher the thread number, the more system resources (RAM and CPU) are consumed by the Site Visualizer crawl bot, which might have a negative impact on the performance of other running applications. Try finding an optimal thread number value for your PC while working with the program.

Page Access Delay

This option is only available when the number of threads is decreased to 1, because only in this case the delay between URL requests has any sense. This feature allows you to avoid a possible ban by particular sites because of too frequent URL requests. A delay of 2 or 3 seconds should be enough.

Limit Crawl Depth

Allows restricting the depth of URLs crawling:

  • 0 - only the main page of a site will be crawled, e.g.
  • 1 - the main page and all pages of the first level will be crawled, e.g.,, etc.
  • And so on up to 9, which is the maximum page level.

URL Access Timeout

The connection speed to various websites, as well as to various pages within one website, is usually different. This feature allows you to define the maximum number of seconds for an URL access. Decreasing this value will make crawling faster, since no connection to "slow" pages will occur in the timeout, and a "Read timed out" message will appear. If you are sure that the connection to some pages or to a site is slow and you're ready to wait, increase this value.

User Agent

This option allows you to crawl a site on behalf of a search bot: Google bot, Yahoo! Slurp, etc. Or you can enter your custom user-agent string. By default, the Site Visualizer user agent is used.


See Also:

Visualize Website Structure

Visual Sitemap Parameters