Site Crawling Options
This tab of the project settings window contains options related to connection to a website's pages, its links and content analysis:
This simple text file contains the rules for search bots and other site crawlers to follow. You can find more details about the robots.txt protocol specifications at the official site.
This option allows crawling a site's pages just like a search bot does when indexing the site. If the option is checked (it is checked by default), the Site Visualizer bot won't try to access any forbidden URLs. Otherwise, the program's bot will crawl all the links that will be found within an HTML text of a page.
Defines whether to access an URL a link redirects to.
Go Outside of Start Directory
If the option is checked, the whole website will be crawled. Uncheck this option if you need to crawl particular directory of a website only. In this case, specify the Site URL on General tab including the directory to crawl and a trailing slash, e.g. 'http://example.com/dir/'. Specifying the Site URL without trailing slash (e.g., 'http://example.com/page') means that you want to crawl the root directory of http://example.com/ only, omitting /dir/ and others directories.
Count Word Number
This feature is very helpful for SEO analysis of a website's pages. Content size is one of the most important ranking factors. If the option is checked, the number of words on every page will be counted and saved to the Word count field of the pages table. In other words, Site Visualizer will count the content length of every page in words. The rules of the counting algorithm are as follows:
- Only the content enclosed in the <BODY> HTML tag is counted. Any text enclosed in other tags (<TITLE>, <META>, etc.) is omitted.
- The text of scripts (enclosed in the <SCRIPT>...</SCRIPT> tag) won't be counted.
- Any other character sequence (except for spaces, tabs, and line breaks) enclosed in any tags is considered as a word.
<a alt="Some ALT text" title="Some Title">Only these 8 words will affect word count</a>
The feature allows to crawl certain URLs (or directories) only. This may be useful for extremely large websites crawling.
At the Crawling tab of the project settings window, click Include link. In the dialog box appears, type URLs (or part of a URL, such as directory, file name, etc.) that should be crawled – one per a line. Only links that contain at least one of the specified string in its URL address will be crawled.
The feature allows to define a list of URLs or directories that should be skipped while crawling. The behaviour is in a similar to the above inclusion URLs process. Links that have at least one of the specified strings in its URL address will not be crawled:
Delete Parameters from URLs
Opens the dialog window that allows to specify parameters to be deleted from a URL string. In most of cases such parameters can be session IDs, category IDs, various preview modes, and many others that make a URL different, but it still references to the same page of the website. Removing of such parameters makes the sitemap cleaner and smaller that in turn simplifies site analyzing. And at last, this allows to submit to Google and other search engines URLs without such "garbage" parameters.
Type parameters you want to exclude from URLs in the dialog window appears. A line should contain single parameter (without '?', '&', or any other delimiters). For instance, the picture below contains set of parameters for a phpBB forum crawling:
Click OK button to save the parameters list to current project.
Use Chromium Rendering
Crawl Secured Pages
On access to protected area, the program will ask for your credentials:
In the dialog box appears, specify the username and password. If the credentials are correct, Site Visualizer will crawl the secured pages.
Check Tag Types
- <IMG> – uncheck it if you need to exclude the <IMG> image tags from the site's HTML text parsing and prevent adding it to the Links table.
- <LINK> – these tags in most cases are used for linking from a page HTML text to one or more Cascading Style Sheets (*.CSS) files. Like the <IMG> tag, the <LINK> tag is just a reference to a file, not a hyperlink to a page of a website. It doesn't transfer any "page weight," so this option can be usually unchecked in case of SEO analysis.
Check External Links
Remain this option switched on to check links from a website you'll crawl to external sites. This allows you to find broken external links on your website.
When turning this option on, links to the same page but with different bookmarks will be recognized as different as well. For instance, example.com/somepage#b1 and example.com/somepage#b2 would be recognized as links to two different pages, and would be added to the Pages table separately. Every of these bookmarks (#b1 and #b2) will be checked for presence on example.com/somepage. In case of fail, the corresponding Response column will contain #b1 Not Found message.
Store Response Headers
Check this option on (by default) in order to store headers that were received in response on access to a internal URL during the crawling.
This parameter defines the number of parallel threads for crawling a website. The maximum value is 25, and the default is 5.
Page Access Delay
This option is only available when the number of threads is decreased to 1, because only in this case the delay between URL requests has any sense. This feature allows you to avoid a possible ban by particular sites because of too frequent URL requests. A delay of 2 or 3 seconds should be enough.
Limit Crawl Depth
Allows restricting the depth of URLs crawling:
- 0 - only the main page of a site will be crawled, e.g. http://example.com.
- 1 - the main page and all pages of the first level will be crawled, e.g. http://example.com/page1, http://example.com/page2, etc.
- And so on up to 9, which is the maximum page level.
URL Access Timeout
The connection speed to various websites, as well as to various pages within one website, is usually different. This feature allows you to define the maximum number of seconds for an URL access. Decreasing this value will make crawling faster, since no connection to "slow" pages will occur in the timeout, and a "Read timed out" message will appear. If you are sure that the connection to some pages or to a site is slow and you're ready to wait, increase this value.
This option allows you to crawl a site on behalf of a search bot: Google bot, Yahoo! Slurp, etc. Or you can enter your custom user-agent string. By default, the Site Visualizer user agent is used.