The meta tag robots it must be placed on all pages and has the purpose of giving information to the spider to indicate which pages to index and which ones should be skipped.
Here is its syntax:
- tells the spider to archive the page in the db INDEX
- NOINDEX tells the spider not to store the page in the db
- FOLLOW tell the spider to follow the links on the page
- NOFOLLOW tell the spider not to follow the links on the page
The robots tag must be inserted in the content of the head, between the tags And page as in this example:
The title of the page The content of your web page
Google adds the content of the indexed pages to its database as it navigates through the pages. The saved content is called the version cache, and can be viewed by clicking on the link Content cache. If you don't want the content to be saved in the google database use this tag:
This will not stop Google from indexing your page, it just avoids saving content that you may find it inappropriate to present in outdated versions. If you don't want to index the page then you will still need to use the "noindex" tag.
Another alternative to the above procedure is to communicate in a targeted way with the google spider or other agent. This will allow the other engines to store the data but not google.
It files robots.txt It must be inserted in the root of the site and is composed of:
User-agent:
Disallow:
In the User-agent field you have to put the name of the spider. With the symbol * you are referring to all spiders.
In the Disallow: field you will say what you do not want the spider to withdraw. Ex. Tell google not to get the soultricks.htm file
- User-agent: googlebot
- Disallow: /soultricks.htm
Example of a record:
User-agent: googlebot
Disallow: /testi.html
Disallow: /poesie/
The aforementioned record tells Google ("googlebot" is the name of the Google spider) that it is not allowed to download the text.html file or access the "poems" directory and its contents, including subdirectories. Notice how the file name is preceded by a "/" character (which indicates the site's root directory) and how the directory name also uses a "/" character at the end.
Field User-agent may contain a asterisk "*", Synonym of "any spider". So the following example tells all spiders not to fetch the temporary.html file:
User-agent: *
Disallow: /temporaneo.html
Field Disallow it can contain a "/" character to indicate "any file and directory". The following example prevents a scooter (Altavista's spider) from picking up anything:
User-agent: scooter
Disallow: /
Finally, the Disallow field can be left blank, indicating that there are no files or directories that you want to prevent from being picked up. The following example shows how to tell all search engines to fetch all site files:
User-agent: *
Disallow:
Example of a robots.txt file
The robots.txt file is made up of one or more records, each of which examines different spiders. So here is a complete example of robots.txt file, which blocks completely Altavista, prevents Google access to some files and directories and leave free access to all other search engines.
User-agent: scooter
Disallow: /
User-agent: googlebot
Disallow: /intestazione.html
Disallow: /links.html
Disallow: / temporary /
Disallow: / cgi-bin /
User-agent: *
Disallow:
Here is the list of some Spiders
Spider Search engine
========================
googlebot Google
fast Fast – Alltheweb
slurp Inktomi – Yahoo!
Altavista scooter
mercator Altavista
Ask Jeeves Ask Jeeves
teoma_agent Teoma
ia_archiver Alexa - Internet Archive