Site Crawling Options

This tab of the project settings window contains options related to connection to a website's pages, its links and content analysis:

Website Crawl Options

Obey robots.txt

This simple text file contains the rules for search bots and other site crawlers to follow. You can find more details about the robots.txt protocol specifications at the official site.

This option allows crawling a site's pages just like a search bot does when indexing the site. If the option is checked (it is checked by default), the Site Visualizer bot won't try to access any forbidden URLs. Otherwise, the program's bot will crawl all the links that will be found within an HTML text of a page.

Store blocked URLs

This option allows you to get the list of URLs disallowed by robots.txt file. To do this, turn this option on, and start crawling. After the process is complete, run Blocked by robots.txt report.

Allow Redirects

Defines whether to access an URL a link redirects to.

Go Outside of Start Directory

If the option is checked, the whole website will be crawled. Uncheck this option if you need to crawl particular directory of a website only. In this case, specify the Site URL on General tab including the directory to crawl and a trailing slash, e.g. 'http://example.com/dir/'. Specifying the Site URL without trailing slash (e.g., 'http://example.com/page') means that you want to crawl the root directory of http://example.com/ only, omitting /dir/ and others directories.

Count Word Number

This feature is very helpful for SEO analysis of a website's pages. Content size is one of the most important ranking factors. If the option is checked, the number of words on every page will be counted and saved to the Word count field of the pages table. In other words, Site Visualizer will count the content length of every page in words. The rules of the counting algorithm are as follows:

  • Only the content enclosed in the <BODY> HTML tag is counted. Any text enclosed in other tags (<TITLE>, <META>, etc.) is omitted.
  • The text of scripts (enclosed in the <SCRIPT>...</SCRIPT> tag) won't be counted.
  • Any other character sequence (except for spaces, tabs, and line breaks) enclosed in any tags is considered as a word.

For instance:

<a alt="Some ALT text" title="Some Title">Only these 8 words will affect word count</a>

Include URLs

The feature allows to crawl certain URLs (or directories) only. This may be useful for extremely large websites crawling.

At the Crawling tab of the project settings window, click Include link. In the dialog box appears, type URLs (or part of a URL, such as directory, file name, etc.) that should be crawled – one per a line. Only links that contain at least one of the specified string in its URL address will be crawled.

Exclude URLs

The feature allows to define a list of URLs or directories that should be skipped while crawling. The behaviour is in a similar to the above inclusion URLs process. Links that have at least one of the specified strings in its URL address will not be crawled:

Do not crawl certain URLs

The home page of the website will be crawled in any case regardless of Include URLs and Exclude URLs lists.

Delete Parameters from URLs

Opens the dialog window that allows to specify parameters to be deleted from a URL string. In most of cases such parameters can be session IDs, category IDs, various preview modes, and many others that make a URL different, but it still references to the same page of the website. Removing of such parameters makes the sitemap cleaner and smaller that in turn simplifies site analyzing. And at last, this allows to submit to Google and other search engines URLs without such "garbage" parameters.

Type parameters you want to exclude from URLs in the dialog window appears. A line should contain single parameter (without '?', '&', or any other delimiters). For instance, the picture below contains set of parameters for a phpBB forum crawling:

Remove parameters from URL

Click OK button to save the parameters list to current project.

Crawl Secured Pages

This feature allows you to crawl pages of a website protected with login and password. In the dialog box appears, type the URL of the login page and click Connect link:

Crawl protected website

At the second step, please specify your login (username) and password, then click Test link. Your credentials will be sent to the login page you've specified, and the result page will be opened with your browser. If the credentials you've provided are correct and logging to the website was successful, the result page at your browser will contain secured information (or an inscription like 'You've been successfully logged in', or something like this).

Click Save button and start crawling. Once the spider reaches the login page, it will use your credentials to log in to the secured area and will proceed crawling.

Check Tag Types

These parameters allow you to define which tags should be parsed and added to the Links table. This is useful when you need to exclude, for example, occurrences of links to images, CSS files, or JavaScript files:

  • <IMG> – uncheck it if you need to exclude the <IMG> image tags from the site's HTML text parsing and prevent adding it to the Links table.
  • <LINK> – these tags in most cases are used for linking from a page HTML text to one or more Cascading Style Sheets (*.CSS) files. Like the <IMG> tag, the <LINK> tag is just a reference to a file, not a hyperlink to a page of a website. It doesn't transfer any "page weight," so this option can be usually unchecked in case of SEO analysis.
  • <SCRIPT> – this tag, just like <IMG> and <LINK>, is a reference to a JavaScript (*.JS) file. If you don't want to see such references in the Links table, keep this option unchecked.

Remain this option switched on to check links from a website you'll crawl to external sites. This allows you to find broken external links on your website.

Thread Number

This parameter defines the number of parallel threads for crawling a website. The maximum value is 25, and the default is 5.

Please note that increasing this value does not always lead to increasing the site crawling speed. And the higher the thread number, the more system resources (RAM and CPU) are consumed by the Site Visualizer crawl bot, which might have a negative impact on the performance of other running applications. Try finding an optimal thread number value for your PC while working with the program.

Page Access Delay

This option is only available when the number of threads is decreased to 1, because only in this case the delay between URL requests has any sense. This feature allows you to avoid a possible ban by particular sites because of too frequent URL requests. A delay of 2 or 3 seconds should be enough.

Limit Crawl Depth

Allows restricting the depth of URLs crawling:

  • 0 - only the main page of a site will be crawled, e.g. http://example.com.
  • 1 - the main page and all pages of the first level will be crawled, e.g. http://example.com/page1, http://example.com/page2, etc.
  • And so on up to 9, which is the maximum page level.

URL Access Timeout

The connection speed to various websites, as well as to various pages within one website, is usually different. This feature allows you to define the maximum number of seconds for an URL access. Decreasing this value will make crawling faster, since no connection to "slow" pages will occur in the timeout, and a "Read timed out" message will appear. If you are sure that the connection to some pages or to a site is slow and you're ready to wait, increase this value.

User Agent

This option allows you to crawl a site on behalf of a search bot: Google bot, Yahoo! Slurp, etc. Or you can enter your custom user-agent string. By default, the Site Visualizer user agent is used.

 

See Also:


Visualize Website Structure

Visual Sitemap Parameters