Crawl statistics
The Yandex indexing robot regularly crawls site pages and loads them into the search database. The robot can fail to download a page if it is unavailable.
Yandex.Webmaster lets you know which pages of your site are crawled by the robot. You can view the URLs of the pages the robot failed to download because the hosting server was unavailable or because of errors in the page content.
Information about pages is available on the
page in Yandex.Webmaster. The information is updated daily within six hours after the robot visits the page.By default, the service provides data on the site as a whole. To view the information about a certain section, choose it from the list in the site URL field. Available sections reflect the site structure as known to Yandex (except for the manually added sections).
If the list doesn't contain any pages that should be included in the search results, use the Reindex pages tool to let Yandex know about them.
You can download the information about pages in the XLS or CSV format using the filters.
- Page status dynamics
- Page changes in the search database
- List of pages crawled by the robot
- Data filtering
Page status dynamics
Page information is presented as follows:
- New and changed — The number of pages the robot crawled for the first time and pages that changed their status after they were crawled by the robot.
- Crawl statistics — The number of pages crawled by the robot, with the server response code.
Page changes in the search database
Changes are displayed if the HTTP response code changed when robot accessed the page again. For example, 200 OK becomes 404 Not Found. If only the page content changed, this won't be shown in Yandex.Webmaster.
To view the changes, set the option to Recent changes. Up to 50,000 changes can be displayed.
Yandex.Webmaster shows the following information about the pages:
- The date when the page was last visited by the robot (the crawl date).
- The page path from the root directory of the site.
- The server response code received at the crawl.
Base on this information you can find out how often the robot crawls the site pages. You can also see which pages were just added to the database and which ones were re-crawled.
- Pages added to the search base
-
If a page is crawled for the first time, the Was column displays the N/a status, and the Currently column displays the server response (for example, 200 OK).
After the page is loaded to the search database successfully, it can be displayed in the search results once the search database is updated. Information about it is shown on the Pages in search page.
- Pages reindexed by the robot
-
If the robot crawled the page before, the page status can change when it's re-crawled: the Was column shows the server response received during previous visit, the Currently column shows the server response received during the the last crawl.
Assume that a page included in the search became unavailable for the robot. In this case, it is excluded from the search. After some time you can find it in the list of excluded pages on the Pages in search page.
A page excluded from the search can stay in the search database so that the robot could check its availability. Usually the robot keeps requesting the page as long as there are links to it and it isn't prohibited in the robots.txt file.
List of pages crawled by the robot
To view the list of pages, set the option to All pages. The list can contain up to 50,000 pages.
You can view the list of site pages crawled by the robot and the following information about them:
- The date when the page was last visited by the robot (the crawl date).
- The page path from the root directory of the site.
- The server response code received when the page was last downloaded by the robot.
Data filtering
You can filter the information about the pages and changes in the search database by all parameters (the crawl date, the page URL, the server response code) using the icon. Here are a few examples:
- Match any of the conditions (corresponds to the “OR” operator).
- Match all conditions (corresponds to the “AND” operator).
You can create a list of pages that the robot crawled but failed to download because of the 404 Not Found server response.
You can filter only new pages that were unavailable to the robot. To do this, set the radio button to Recent changes.
Also, you can get the full list of pages that were unavailable to the robot. To do this, set the radio button to All pages.
You can create a list of pages with the URL containing a certain fragment. To do this, choose Contains from the list and enter the fragment in the field.
You can use special characters to match the beginning of the string or a substring, and set more complex conditions using regular expressions. To do it, choose URL matches from the list and enter the condition in the field. You can add multiple conditions by putting each of them on a new line.
For conditions, the following rules are available:
Character | Description | Example |
---|---|---|
* | Matches any number of any characters | Display data for all pages that start with https://example.com/tariff/, including the specified page: Using the * character The * character can be useful when searching for URLs that contain two specific elements or more. For example, you can find news or announcements for a certain year: |
@ | The filtered results contain the specified string (but don't necessarily strictly match it) | Display information for all pages with URLs containing the specified string: @tariff |
~ | Condition is a regular expression | Display data for pages with URLs that match a regular expression. For example, you can filter all pages with address containing the fragment ~table|sofa|bed repeated once or several times. |
! | Negative condition | Exclude pages with URLs starting with https://example.com/tariff/: !/tariff/* |
Character | Description | Example |
---|---|---|
* | Matches any number of any characters | Display data for all pages that start with https://example.com/tariff/, including the specified page: Using the * character The * character can be useful when searching for URLs that contain two specific elements or more. For example, you can find news or announcements for a certain year: |
@ | The filtered results contain the specified string (but don't necessarily strictly match it) | Display information for all pages with URLs containing the specified string: @tariff |
~ | Condition is a regular expression | Display data for pages with URLs that match a regular expression. For example, you can filter all pages with address containing the fragment ~table|sofa|bed repeated once or several times. |
! | Negative condition | Exclude pages with URLs starting with https://example.com/tariff/: !/tariff/* |
The use of characters isn't case sensitive.
The @,!, ~ characters can be used only at the beginning of the string. The following combinations are available:
Operator | Example |
---|---|
!@ | Exclude pages with URLs containing tariff: !tariff |
!~ | Exclude pages with URLs that match the regular expression |
Operator | Example |
---|---|
!@ | Exclude pages with URLs containing tariff: !tariff |
!~ | Exclude pages with URLs that match the regular expression |
FAQ
- Pages are slower to get into search results, see the Why is it taking so long for pages to appear in search results? section.
- The robot creates an additional load on the site and wants to reduce it. Follow the recommendations.
Perhaps, too little time passed since you created the website. To inform the robot about the website, add the website to Yandex.Webmaster and verify your rights to it. Also check if there were any server failures. In case of a server error, the Yandex robot stops indexing and makes another attempt when it crawls the website next time.
Yandex employees can't speed up how fast pages are added to the search base.
We don't forecast the website indexing timeframe and we can't guarantee that a website will be indexed. Usually, it takes from several days to two weeks from when the robot finds the website until the pages are shown in search results.
The number of pages crawled by the Yandex robot may be higher or lower on different days. These changes don't affect site indexing or ranking in search results.
The robot takes links from other pages. This means that that some other page contains links to confidential sections of your website. You can either protect them with a password or block them from indexing by the Yandex robot in the robots.txt
file. In both cases, the robot won't download confidential information.