Using robots.txt

What is robots.txt?

Robots.txt is a text file that contains site indexing parameters for the search engine robots.

We recommend watching How to manage site indexation.

How to set up robots.txt

  1. Create a file named robots.txt in a text editor and fill it in using the guidelines below.
  2. Check the file in the Yandex.Webmaster service (Robots.txt analysis in the menu).
  3. Upload the file to your site's root directory.

The User-agent directive

The Yandex robot supports the robots exclusion standard with enhanced capabilities described below.

The Yandex robot's work is based on sessions: for every session, there is a pool of pages for the robot to download.

A session begins with the download of the robots.txt file. If the file is missing, is not a text file, or the robot's request returns an HTTP status other than 200 OK, the robot assumes that it has unrestricted access to the site's documents.

In the robots.txt file, the robot checks for records starting with User-agent: and looks for either the substring Yandex (the case doesn't matter) or *. If a string User-agent: Yandexis detected, directives for User-agent: * are ignored. If the User-agent: Yandex and User-agent: * strings are not found, the robot is considered to have unlimited access.

You can enter separate directives for the following Yandex robots:

If there are directives for a specific robot, directives User-agent: Yahoo and User-agent: * aren't used.

Example:

User-agent: YandexBot # will be used only by the main indexing robotDisallow: /*id=User-agent: Yandex # will be used by all Yandex robotsDisallow: /*sid= # except for the main indexing robotUser-agent: * # won't be used by Yandex robotsDisallow: /cgi-bin 

Disallow and Allow directives

To prohibit the robot from accessing your site or certain sections of it, use the Disallow directive.

Examples:

User-agent: YandexDisallow: / # blocks access to the whole siteUser-agent: YandexDisallow: /cgi-bin # blocks access to the pages                     # starting with '/cgi-bin'

According to the standard, you should insert a blank line before every User-agent directive.

The # character designates commentary. Everything following this character, up to the first line break, is disregarded.

Use the Allow directive to allow the robot to access specific parts of the site or the entire site.

Examples:

User-agent: YandexAllow: /cgi-binDisallow: /# prohibits downloading anything except for the pages # starting with '/cgi-bin'
Note. Empty line breaks aren't allowed between the User-agent, Disallow and Allow directives.

Combining directives

The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect the way they are used by the robot. Examples:

# Source robots.txt:User-agent: YandexAllow: /catalogDisallow: /# Sorted robots.txt:User-agent: YandexDisallow: /Allow: /catalog# only allows downloading pages# starting with '/catalog'
# Source robots.txt:User-agent: YandexAllow: /Allow: /catalog/autoDisallow: /catalog# Sorted robots.txt:User-agent: YandexAllow: /Disallow: /catalogAllow: /catalog/auto# prohibits downloading pages starting with '/catalog',# but allows downloading pages starting with '/catalog/auto'.
Note. If there is a conflict between two directives with prefixes of the same length, the Allow directive takes precedence.

Allow and Disallow directives without parameters

If the directives don't contain parameters, the robot handles the data as follows:

User-agent: YandexDisallow: # same as Allow: /User-agent: YandexAllow: # isn't taken into account by the robot

Using the special characters * and $

You can use the special characters * and $ to set regular expressions when specifying paths for the Allow and Disallow directives. The * character indicates any sequence of characters (or none). Examples:

User-agent: YandexDisallow: /cgi-bin/*.aspx # prohibits '/cgi-bin/example.aspx'                          # and '/cgi-bin/private/test.aspx'Disallow: /*private # prohibits both '/private',                    # and '/cgi-bin/private'

The $ character

By default, the * character is appended to the end of every rule described in the robots.txt file. Example:

User-agent: YandexDisallow: /cgi-bin* # blocks access to pages                     # starting with '/cgi-bin'Disallow: /cgi-bin # the same

To cancel * at the end of the rule, use the $ character, for example:

YandexDisallow: /example$ # prohibits '/example',                     # but allows '/example.html'
YandexDisallow: /example # prohibits both '/example',                    # and '/example.html'
The $ character doesn't forbid * at the end, that is:
YandexDisallow: /example$  # prohibits only '/example'Disallow: /example*$ # exactly the same as 'Disallow: /example'                      # prohibits both /example.html and /example

The Sitemap directive

If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameterof the Sitemap directive (if you have multiple files, indicate all paths). Example:

User-agent: YandexAllow: /sitemap: https://example.com/site_structure/my_sitemaps1.xmlsitemap: https://example.com/site_structure/my_sitemaps2.xml

The directive is intersectional, meaning it is used by the robot regardless of its location in robots.txt.

The robot remembers the path to your file, processes your data and uses the results during the next visit to your site.

Host directive

If your site has mirrors, a special mirror bot (Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)) detects them and forms a mirror group for your site. Only the main mirror is included in the search. You can indicate which site is the main mirror in the robots.txt file. The name of the main mirror should be listed in the Host directive.

The Host directive does not guarantee that the specified main mirror will be selected. However, the decision-making algorithm will assign it a high priority. Example:

#If https://www.main-mirror.com is your site's main mirror, then #robots.txt for all your sites from the mirror group will look like this: User-Agent: *Disallow: /forumDisallow: /cgi-binHost: https://www.main-mirror.com

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, add the Host directive to the group that starts with the User-Agent record right after the Disallow and Allow directives. The Host directive argument is the domain name with the port number (80 by default), separated by a colon.

#Example of a well-formed robots.txt file, where#the Host directive will be taken into account during processingUser-Agent: *Disallow:Host: https://www.myhost.ru

However, the Host directive is intersectional and is used by the robot regardless of its location in robots.txt.

Note. For every robots.txt file, only one Host directive is processed. If several directives are indicated in the file, the robot will use the first one.

Example:

Host: myhost.ru # is usedUser-agent: *Disallow: /cgi-binUser-agent: YandexDisallow: /cgi-binHost: https://www.myhost.ru # isn't used

The Host directive should contain:

  • The HTTPS protocol if the mirror is available only over a secure channel. If you use the HTTP protocol, there is no need to indicate it.

  • One valid domain name that conforms to RFC 952 and is not an IP address.

  • The port number, if necessary (Host: myhost.com:8080).

An incorrectly formed Host directive is ignored.

# Examples of Host directives that will be ignoredHost: www.myhost-.comHost: www.-myhost.comHost: www.myhost.com:100000Host: www.my_host.comHost: .my-host.com:8000Host: my-host.com.
Host: my..host.comHost: www.myhost.com:8080/Host: 213.180.194.129Host: www.firsthost.ru,www.secondhost.comHost: www.firsthost.ru www.secondhost.com

Examples of theHost directive usage:

# domain.myhost.ru is the main mirror for# www.domain.myhost.com, so the correct use of # the Host directive is:User-Agent: *Disallow:Host: domen.myhost.ru

The Crawl-delay directive

If the server is overloaded and it isn't possible to process downloading requests, use the Crawl-delay directive. You can specify the minimum interval (in seconds) for the search robot to wait after downloading one page, before starting to download another.

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, add the Crawl-delay directive to the group that starts with the User-Agent entry right after the Disallow and Allow directives.

The Yandex search robot supports fractional values for Crawl-Delay, such as "0.5". This doesn't mean that the search robot will access your site every half a second, but it may speed up the site processing.

Examples:

User-agent: YandexCrawl-delay: 2 # sets a 2-second timeoutUser-agent: *Disallow: /searchCrawl-delay: 4.5 # sets a 4.5-second timeout

The Clean-param directive

If your site page addresses contain dynamic parameters that don't affect the content (for example, identifiers of sessions, users, referrers, and so on), you can describe them using the Clean-param directive.

The Yandex robot uses this information to avoid reloading duplicate information. This improves the robot's efficiently and reduces the server load.

For example, your site contains the following pages:

www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

The ref parameter is only used to track which resource the request was sent from. It doesn't change the page content. All three URLs will display the same page with the book_id=123 book. Then, if you indicate the directive in the following way:

User-agent: YandexDisallow:Clean-param: ref /some_dir/get_book.pl

the Yandex robot will converge all the page addresses into one:

www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123,

If a page without parameters is available on the site:

www.example.com/some_dir/get_book.pl?book_id=123

all other URLs are replaced with it after the robot indexes it. Other pages of your site will be crawled more often, because there will be no need to update the pages:

www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

Directive syntax

Clean-param: p0[&p1&p2&..&pn] [path]

In the first field, list the parameters that must be disregarded, separated by the & character. In the second field, indicate the path prefix for the pages the rule should apply to.

Note. The Clean-Param directive is intersectional, so it can be indicated anywhere within the robots.txt file. If several directives are specified, all of them will be taken into account by the robot.

The prefix can contain a regular expression in the format similar to the one used in the robots.txt file, but with some restrictions: you can only use the characters A-Za-z0-9.-/*_. However, * is interpreted in the same way as in robots.txt. A * is always implicitly appended to the end of the prefix. For example:

Clean-param: s /forum/showthread.php

means that the s parameter is disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site. It is case sensitive. The maximum length of the rule is 500 characters. For example:

Clean-param: abc /forum/showthread.php
Clean-param: sid&sort /forum/*.php
Clean-param: someTrash&otherTrash

Additional examples

#for addresses like:www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain the following:
User-agent: Yandex
Disallow:
Clean-param: s /forum/showthread.php
#for addresses like:www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df
www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae

#robots.txt will contain the following:
User-agent: Yandex
Disallow:
Clean-param: sid /index.php
#if there are several of these parameters:www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311
www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896

#robots.txt will contain the following:
User-agent: Yandex
Disallow:
Clean-param: s&ref /forum*/showthread.php
#if the parameter is used in multiple scripts:www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain the following:
User-agent: Yandex
Disallow:
Clean-param: s /forum/index.php
Clean-param: s /forum/showthread.php

Using Cyrillic characters

The use of the Cyrillic alphabet is not allowed in robots.txt file and HTTP server headers.

For domain names, use Punycode. For page addresses, use the same encoding as the one used for the current site structure.

Example of the robots.txt file:

#Incorrect:
User-agent: Yandex
Disallow: /cart
Host: online-store.ru

#Correct:
User-agent: Yandex
Disallow: /%D0%BA%D0%BE%D1%80%D0%B7%D0%B8%D0%BD%D0%B0
Host: xn----8sbalhasbh9ahbi6a2ae.xn--p1ai

Additional information

The Yandex robot supports only the robots.txt directives listed on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in a different way.

The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:

User-agent: Yandex Allow: /Disallow: /# without extensions everything was prohibited because 'Allow: /' was ignored, # with extensions supported, everything is allowedUser-agent: YandexDisallow: /private*html# without extensions, '/private*html' was prohibited, # with extensions supported, '/private*html', # '/private/test.html', '/private/html/test.aspx', and so on are prohibited as wellUser-agent: YandexDisallow: /private$# without extensions supported, '/private$' and '/private$test', and so on were prohibited, # with extensions supported, only '/private' is prohibitedUser-agent: *Disallow: /User-agent: YandexAllow: /#  without extensions supported, because of the missing line break, # 'User-agent: Yandex' would be ignored # the result would be 'Disallow: /', but the Yandex robot # parses strings based on the 'User-agent:' substring. # In this case, the result for the Yandex robot is 'Allow: /'User-agent: *Disallow: /# comment1...# comment2...# comment3...
User-agent: YandexAllow: /# same as in the previous example (see above)

Examples using the extended robots.txt format:

User-agent: YandexAllow: /archiveDisallow: /# allows everything that contains '/archive'; the rest is prohibitedUser-agent: YandexAllow: /obsolete/private/*.html$ # allows HTML files                                 # in the '/obsolete/private/... path'
Disallow: /*.php$  # probibits all '*.php' on siteDisallow: /*/private/ # prohibits all subpaths containing                      # '/private/', but the Allow above negates                      # part of the prohibitionDisallow: /*/old/*.zip$ # prohibits all '*.zip' files containing                        # '/old/' in the pathUser-agent: YandexDisallow: /add.php?*user= # prohibits all 'add.php?' scripts with the ' user ' option

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:

User-agent: YandexDisallow:

Similarly, robots.txt is assumed to allow everything if it couldn't be downloaded (for example, if HTTP headers are not set properly or a 404 Not found status is returned).

Exceptions

A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by the site owners, they may ignore the robots.txt directives designed for random robots (User-agent: *).

In addition, robots may ignore some robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.

Attention. If such a robot downloads a document that the main Yandex robot can't access, this document will never be indexed and won't be found in search results.

Yandex robots that don't follow common disallow directives in robots.txt:

  • YaDirectFetcher downloads ad landing pages to check their availability and content. This is needed for placing ads in the Yandex search results and on partner sites. When crawling a site, the robot does not use the robots.txt file and ignores the directives set for it.
  • YandexCalendar regularly downloads calendar files by users' requests. These files are often located in directories prohibited from indexing.
  • YandexDirect downloads information about the content of Yandex Advertising network partner sites to identify their topic categories to match relevant advertising.
  • YandexDirectDyn is the robot that generates dynamic banners.
  • YandexMobileBot downloads documents to determine if their layout is suitable for mobile devices.
  • YandexAccessibilityBot downloads pages to check their accessibility for users.
  • YandexScreenshotBot takes a screenshot of a page.
  • YandexMetrika is the Yandex.Metrica robot.
  • YandexVideoParser is the Yandex video indexer.
  • YandexSearchShop regularly downloads product catalogs in YML files by users' requests. These files are often placed in directories prohibited for indexing.

To prevent this behavior, you can restrict access for these robots to some pages or the whole site using the robots.txt directives, for example:

User-agent: YandexCalendarDisallow: /
User-agent: YandexMobileBotDisallow: /private/*.txt$