How does Yandex search work?
- Stage 1. Crawling the site
- Stage 2. Loading and processing (indexing) the data
- Stage 3. Creating a database of the pages that can be included in the search results
- Stage 4. Generating search results
- FAQ
To start displaying your site in search results, Yandex must find out about its existence using robots.
A robot is a system that crawls site pages and loads them into its database. Yandex has lots of robots. Saving pages to the database and their further processing using algorithms is called indexing. The loaded data is used to generate search results. They are regularly updated and may affect the site ranking.
There are several stages before a site appears in search results:
Stage 2. Loading and processing (indexing) the data
Stage 3. Creating a database of the pages that can be included in the search results
Stage 4. Generating search results
Stage 1. Crawling the site
The robot determines which sites to crawl and how often, as well as how many pages to crawl on each of them.
- Links specified in the Sitemap file.
- Directives in the robots.txt file.
- Page size (pages larger than 10 MB are not indexed).
- The link is placed on your own or third-party site.
- The page is not prohibited for indexing in the robots.txt file.
When the robot tries to load a site page, it receives a response from the server with the HTTP status code:
HTTP status code | Note |
---|---|
200 OK | The robot will crawl the page. |
3XX | The robot needs to crawl the page that is the redirect target. Learn more about handling redirects. |
4XX and 5XX | A page with this code won't be included in the search. If it was before the robot crawled it, then it will be removed from the search. To prevent the page from falling out of the search, configure the server so that it responds with the 429 code. The robot will access the page and check the response code. This can be useful if the site page looks incorrect due to problems with the CMS. After the error is fixed, change the server response. Note. If the page responds with the 429 code for a long time, this indicates that the server experiences difficulties with the load. This can reduce the site crawl rate. |
HTTP status code | Note |
---|---|
200 OK | The robot will crawl the page. |
3XX | The robot needs to crawl the page that is the redirect target. Learn more about handling redirects. |
4XX and 5XX | A page with this code won't be included in the search. If it was before the robot crawled it, then it will be removed from the search. To prevent the page from falling out of the search, configure the server so that it responds with the 429 code. The robot will access the page and check the response code. This can be useful if the site page looks incorrect due to problems with the CMS. After the error is fixed, change the server response. Note. If the page responds with the 429 code for a long time, this indicates that the server experiences difficulties with the load. This can reduce the site crawl rate. |
- Troubleshooting — Helps check the quality of a site and fix errors, if any.
- Crawl statistics — Shows which pages the robot has crawled and how often it accesses the site.
- How to reindex a site — Allows you to report a new page on the site or an update of a page already included in the search.
- Region — Helps the robot to correctly determine the region of the site and display it for location-dependent queries<.
- Server response check — Indicates whether the page to be indexed is accessible to the robot.
Stage 2. Loading and processing (indexing) the data
- The contents of the Description meta tag, the title element, and the Schema.org micro markup, which can be used to generate a page snippet.
- The noindex directive in the robots meta tag. If it's found, the page won't be included in the search results.
- The rel="canonical" attribute indicating the address that you consider a priority for displaying in the search results for a group of pages with the same content.
- Text, images, and videos. If the robot determines that the content of several pages matches, it may treat them as duplicates.
- Troubleshooting — Helps check the quality of a site and fix errors, if any.
- Crawl statistics — Shows which pages the robot has crawled and how often it accesses the site.
- How to reindex a site — Allows you to report a new page on the site or an update of a page already included in the search.
Stage 3. Creating a database of the pages that can be included in the search results
Based on the information collected by the robot, the algorithms determine the pages that can be included in the search results. The algorithms take into account a variety of ranking and indexing factors that are used to make the final decision. For example, the database won't include pages with indexing disabled or duplicate pages.
A page may contain the original, structured text but the algorithm won't add it to the database, as it's highly unlikely that the page gets into the range of view in the search results. For example, due to lack of demand from users or high competition in this topic.
- Pages in search — Helps you track the status of site pages, for example, HTTP response status codes or duplicate pages.
- Site security — Provides information about violations and infected files.
To find out if a site subdomain appears in the search results, subscribe to notifications.
Stage 4. Generating search results
- To what extent the page content matches the search query (i.e., whether it's relevant).
- Whether the page content is clear and useful to the user.
- Whether the page is convenient (how text is structured, paragraphs and headers of different levels are arranged, and so on).
How do I improve the site ranking in the search?
- Pages in search — Allows you to find out which site pages are included or excluded from the search results. You can also track pages that are most important to you.
- Query statistics — Helps you track the number of impressions of your site and clicks on the snippet.
- All queries and groups — Shows the search queries for which your site is displayed in the search results.
- Troubleshooting — Provides information about pages that have no Description meta tag and title element.
- Sitelinks — Helps you check if there're sitelinks in the snippet and configure them.
FAQ
The date won't be displayed in the search results next to your website pages.
The robot won't know if a website page has been updated since it was last indexed. Modified pages will be indexed less often because the number of pages that the robot gets from a website each time is limited.
If you use HTTP/2, the Yandex robot indexes your site using the HTTP/1.1 protocol. However, there will be no conflicts with your server settings. The HTTP/2 version doesn't affect the speed of crawling and doesn't change the site's position in Yandex search results.
Your website will still be indexed even if your server doesn't provide last-modified document dates. However, you should keep in mind the following:
A large number of parameters and nested directories in the URL, or overly long URLs, may interfere with the site indexing.
The URL can be up to 1024 characters.
The Yandex robot doesn't index anchor URLs of pages, except for AJAX pages (with the #! character). For example, the http://example.com/page/#title page won't get into the robot database. It will index the http://example.com/page/ page (URL before the # character).