What is robots.txt?
Robots.txt is a text file that contains site indexing parameters for search engine robots.
How to set up robots.txt
- Create a file with the name robots.txt in the text editor and fill it in following the guidelines below.
- Check the file using the Yandex.Webmaster service (Robots.txt analysis in the menu).
- Upload the file to your site's root directory.
The Yandex robot supports the robot exclusion standard with the enhanced capabilities that are described below.
The Yandex robot uses the session robot principle: for every session, a given pool of webpages is put together that the robot plans to visit.
A session begins when the robots.txt file is loaded. If the file is missing, is not a text file, or the robot's request returns an HTTP-status other than
200 OK, the robot assumes that it has unrestricted access to the site's documents.
In the robots.txt file, the robot checks for records beginning with
User-agent: and looks for either the substring
Yandex (case doesn't matter) or
* . If it finds the line
User-agent: Yandex, the directives for
User-agent: * are disregarded. If the lines
User-agent: Yandex and
User-agent: * are absent, robot access is assumed to be unrestricted.
Separate directives can be entered for the following Yandex robots:
'YandexBot'— the main indexing robot
'YandexDirect'— downloads information about the content on Yandex Advertising Network partner sites for selecting relevant ads; interprets robots.txt in a special way
'YandexDirectDyn'— generates dynamic banners and interprets robots.txt in a special way
'YandexMedia'— robot used to index multimedia data
'YandexImages'— indexing robot for Yandex.Images
'YaDirectFetcher'— the Yandex.Direct robot; it interprets robots.txt in a special way
'YandexBlogs'blog search — robot that indexes posts and comments
'YandexPagechecker'— micromarkup validator
‘YandexMetrika’— Yandex.Metrica robot
‘YandexCalendar’— Yandex.Calendar robot
If directives are found for a specific robot,
User-agent: Yandex and
User-agent: * are not used.
User-agent: YandexBot # will only use the main indexing robot Disallow: /*id= User-agent: Yandex # will use all Yandex robots Disallow: /*sid= # besides the main indexing robot User-agent: * # Yandex robots won't be used Disallow: /cgi-bin
Disallow and Allow directives
If you don't want to allow robots to access your site or certain sections of it, use the
User-agent: Yandex Disallow: / # blocks access to whole site User-agent: Yandex Disallow: /cgi-bin # blocks access to pages # starting with '/cgi-bin'
In accordance with the standard, we recommend that you insert a blank line before every
# character designates commentary. Everything following this character, up to the first line break, is disregarded.
Allow directive to allow the robot access to specific parts of the site or to the entire site.
User-agent: Yandex Allow: /cgi-bin Disallow: / # forbids downloads of anything except for pages # starting with '/cgi-bin'
Using directives jointly
Disallow directives from the corresponding
User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot. Examples:
# Source robots.txt: User-agent: Yandex Allow: /catalog Disallow: / # Sorted robots.txt: User-agent: Yandex Disallow: / Allow: /catalog # only allows downloading pages # starting with '/catalog'
# Source robots.txt: User-agent: Yandex Allow: / Allow: /catalog/auto Disallow: /catalog # Sorted robots.txt: User-agent: Yandex Allow: / Disallow: /catalog Allow: /catalog/auto # disallows downloading pages starting with '/catalog', # but allows downloading pages starting with '/catalog/auto'.
Allowdirective takes precedence.
Allow and Disallow directives without parameters
If the directives don't contain parameters, the robot handles data in the following manner:
User-agent: Yandex Disallow: # the same as Allow: / User-agent: Yandex Allow: # not considered a robot
Using the special characters * and $
You can use the special characters
$ when specifying paths for the
Disallow directives, setting certain regular expressions this way. The
* character indicates any sequence of characters (including blanks). Examples:
User-agent: Yandex Disallow: /cgi-bin/*.aspx # disallow '/cgi-bin/example.aspx' # and '/cgi-bin/private/test.aspx' Disallow: /*private # disallow both '/private' # and '/cgi-bin/private'
The $ character
By default, the
* character is appended to the end of every rule described in the robots.txt file. For example:
User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages # starting with '/cgi-bin' Disallow: /cgi-bin # the same
* at the end of the rule, you can use the
$ character, for example:
User-agent: Yandex Disallow: /example$ # disallows '/example', # but allows '/example.html'
User-agent: Yandex Disallow: /example # disallows both '/example', # and '/example.html'
*if it is specified at the end, in other words:
User-agent: Yandex Disallow: /example$ # prohibits only '/example' Disallow: /example*$ # exactly the same as 'Disallow: /example' # prohibits both /example.html and /example
If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameter of the
Sitemap directive (if you have multiple files, indicate all paths). Example:
User-agent: Yandex Allow: / Sitemap: http://example.com/site_structure/my_sitemaps1.xml Sitemap: http://example.com/site_structure/my_sitemaps2.xml
The robot will remember the path to your file, process your data, and use the results during the next visit to your site.
If your site has mirrors, special mirror bots (
Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)
) detect them and form a mirror group for your site. Only the main mirror will participate in search. You can indicate which site is the main one in the robots.txt file. The name of the main mirror should be listed as the value of the
The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision-making algorithm will assign it a high priority. For example:
#If www.main-mirror.com is your site's main mirror, then #robots.txt for all your sites from the mirror group will look like this: User-Agent: * Disallow: /forum Disallow: /cgi-bin Host: www.main-mirror.com
To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the
Host directive needs to be added to the group that starts with the
User-Agent record right after the
Allow directives. The
Host directive argument is the domain name with the port number (80 by default), separated by a colon.
#Example of a well-formed robots.txt file, where #the
Hostdirective will be taken into account during processing User-Agent: * Disallow: Host: www.myhost.com
Host directive is intersectional, so it will be used by the robot regardless of its location in robots.txt.
Hostdirective is processed. If several directives are indicated in the file, the robot will use the first one.
Host: myhost.ru # uses User-agent: * Disallow: /cgi-bin User-agent: Yandex Disallow: /cgi-bin Host: www.myhost.ru # is not used
Host directive should contain:
The protocol set to HTTPS if the mirror is only available via a secure channel (
One valid domain name that conforms to RFC 952 and is not an IP address.
The port number, if necessary (
An incorrectly formed
Host directive will be ignored.
# Examples of Host directives that will be ignored Host: www.myhost-.com Host: www.-myhost.com Host: www.myhost.com:100000 Host: www.my_host.com Host: .my-host.com:8000 Host: my-host.com. Host: my..host.com Host: www.myhost.com:8080/ Host: 22.214.171.124 Host: www.firsthost.ru,www.secondhost.com Host: www.firsthost.ru www.secondhost.com
Host directive use:
# domain.myhost.ru is the main mirror for # www.domain.myhost.com, so the correct use of # the Host directive is: User-Agent: * Disallow: Host: domain.myhost.ru
If the server is overloaded and it isn't possible to process downloading requests, use the
Crawl-delay directive. You can specify the minimum interval (in seconds) for a search robot to wait after loading one page, before starting to load another.
To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the
Crawl-delay directive needs to be added to the group that starts with the
User-Agent entry right after the
The Yandex search robot supports fractional values for
Crawl-Delay, such as "0.5". This does not mean that the search robot will access your site every half a second, but it may speed up the site processing.
User-agent: Yandex Crawl-delay: 2 # sets a 2 second time-out User-agent: * Disallow: /search Crawl-delay: 4.5 # sets a 4.5 second time-out
If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the
Using this information, the Yandex robot will not reload duplicate information again. This will improve how efficiently the robot processes your site and reduce the server load.
For example, your site contains the following pages:
www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123
ref parameter is only used to track which resource the request was sent from, and does not change the content. All three addresses will display the same page with
book_id=123. Then, if you indicate the directive in the following way:
User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl
the Yandex robot will converge all the page addresses into one:
If a page without parameters is available on the site:
everything will go to that page after the robot indexes it. Other pages of your site will be processed more often, because there will be no need to update the pages:
Clean-param: p0[&p1&p2&..&pn] [path]
In the first field, list the parameters that must be disregarded, separated by the
& symbol. In the second field, indicate the path prefix for the pages the rule should apply to.
The prefix can contain a regular expression in a format similar to the one used in the robots.txt file, but with a few restrictions: only the characters
A-Za-z0-9.-/*_ can be used. However, * is interpreted in the same way as in robots.txt. A
* is always implicitly appended to the end of the prefix. For example:
Clean-param: s /forum/showthread.php
means that the
s parameter will be disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site. It is case sensitive. The maximum length of a rule is 500 characters. For example:
Clean-param: abc /forum/showthread.php Clean-param: sid&sort /forumt/*.php Clean-param: someTrash&otherTrash
#for these types of addresses: www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/showthread.php
#for these types of addresses: www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: sid /index.php
#if there are several of these parameters: www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311 www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s&ref /forum*/showthread.php
#if the parameter is used in multiple scripts: www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/index.php Clean-param: s /forum/showthread.php
The Yandex robot doesn't support robots.txt directives that aren't shown on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in different ways.
The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:
User-agent: Yandex Allow: / Disallow: / # without extensions everything is disallowed since 'Allow: /' is ignored, # with extension support everything is allowed User-agent: Yandex Disallow: /private*html # without extensions '/private*html' is disallowed, # but with extensions it disallows '/private*html', # and '/private/test.html', and '/private/html/test.aspx' etc. User-agent: Yandex Disallow: /private$ # without extensions, '/private$' and '/private$test' etc. are disallowed, # but with extensions, only '/private' is disallowed User-agent: * Disallow: / User-agent: Yandex Allow: / # without extensions due to no empty line break, # 'User-agent: Yandex' would be ignored and # the result would be 'Disallow: /', but the Yandex robot # selects entries that have 'User-agent:' in the line, # so the result for the Yandex robot in this case is 'Allow: /' User-agent: * Disallow: / # commentary1... # commentary2... # commentary3... User-agent: Yandex Allow: / # same as in the previous example (see above)
Examples of extended robots.txt format use:
User-agent: Yandex Allow: /archive Disallow: / # allows everything that contains '/archive'; everything else is disallowed User-agent: Yandex Allow: /obsolete/private/*.html$ # allows html files # at the path '/obsolete/private/...' Disallow: /*.php$ # disallows all '*.php' on site Disallow: /*/private/ # disallows all subpaths containing # '/private/', but the Allow above negates # part of the disallow Disallow: /*/old/*.zip$ # disallows all '*.zip' files containing # '/old/' in the path User-agent: Yandex Disallow: /add.php?*user= # disallows all 'add.php?' scripts with the 'user' parameter
When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:
User-agent: Yandex Disallow:
Similarly, robots.txt is assumed to allow everything if it couldn't be accessed (for example, if the HTTP headers are not set properly or a
404 Not found HTTP status message is returned).
A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by site owners, they may not follow the robots.txt limiting directives designed for random robots (
It's also possible to partially ignore robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.
Here is a list of Yandex robots that don't follow general limiting rules in robots.txt:
YaDirectFetcherdownloads ad landing pages to check their availability and content. This is compulsory for placing ads in Yandex search results and YAN partner sites.
YandexCalendarregularly downloads calendar files requested by users, despite being located in directories that are blocked from indexing.
'YandexDirect'downloads information about YAN partner site content in order to clarify what their topics are so that relevant ads can be selected.
YandexDirectDynis the robot that generates dynamic banners.
YandexMobileBotdownloads documents for analysis in order to determine if their page layouts are suitable for mobile devices.
YandexAccessibilityBotdownloads pages to check how accessible they are for users.
YandexScreenshotBottakes a screenshot of a page.
Yandex.Metrikais the Yandex.Metrica robot.
YandexVideoParseris the Yandex.Video indexer.
To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:
User-agent: YaDirectFetcher Disallow: /
User-agent: YandexMobileBot Disallow: /private/*.txt$