Legal ===== Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. This document may not be modified, and derivative works of it may not be created, except to format it for publication as an RFC or to translate it into languages other than English. This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract ======== The Gopher protocol is over twenty years old. Changing practices and unofficial extensions have caused Gopher as currently used to differ, but remain largely compatible with, the standard established in its official governing document, *The Internet Gopher Protocol (a distributed document search and retrieval protocol)*, known within the Gopher community as *RFC 1436*. Therefore, this document attempts to establish a contemporary specification of the Gopher communications protocol, departing as little as possible from current practice. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Furthermore, backticks (`) around a string mean that it is to be interpreted literally. Goals ===== The over-riding aim of this document is to author a contemporary specification of the Gopher world-wide information system, without falling short of reflecting actual practice, without attempting to re-invent the wheel, and without breaking compliance with RFC 1436. In particular, this document shall attempt to describe, and, where necessary, update current practice as regards the means of handling errors, line and file terminators, policy files, TITLE selectors, the URL: re-direction scheme, and new selector types not compliant with RFC 1436. Most importantly, this document is not to be construed as a replacement for RFC 1436; it merely brings this original document into the modern age. What is Gopher? =============== Gopher is a lightweight, client/server-oriented query/answer protocol, functioning as a world-wide information system (WWIS). It accomplishes its purpose by facilitating access to other servers around the world, be they running Gopher or not. The protocol and software permit users of a heterogeneous variety of desktop systems to browse, search, and retrieve documents residing on multiple distributed server machines. Faster and more hierarchical than HTTP, Gopher provides the ideal method for transmitting information from and to mobile devices. Basic Gopher Transactions ========================= There are four broad forms of basic transactions in Gopher: * Menu Transaction; * Index Transaction; * Simple Text Transaction; and * Binary Transaction. The precise composition of these transactions is elucidated below. Menu Transaction ---------------- Client : [Open Connexion] Client : Send [selector] Server : Send Server : Send . Server : [Close Connexion] Index Transaction ----------------- Client : [Open Connexion] Client : Send [selectorquery parameters] Server : Send [] Server : Send . Server : [Close Connexion] Simple Text Transaction ----------------------- Client : [Open Connexion] Client : Send [selector] Server : Send [] Server : Send . Server : [Close Connexion] Binary Transaction ------------------ Client : [Open Connexion] Client : Send [selector] Server : Send [] Server : DO NOT send . Server : [Close Connexion] The fourth step of each transaction, with the exception of the binary type, is OPTIONAL. Servers MAY send a full-stop character after sending a menu, index, or text; if they do, clients MUST accept it. Further information may be found in the appropriate sub-section. Gopher servers are normally found on TCP port 70. Clients MUST assume this port if no other port is specified. When a client opens a connection to a server, the server MUST accept the connection but say nothing, waiting for a CR/LF-terminated selector string from the client. The client MAY then send the selector string followed by CR/LF (or nothing to retrieve the root menu from the server, which MUST always be type 1). The server MUST then send the requested content and close the connection. Line Terminators ================ ASCII, the international standard that governs the interchange of plain-text information between computer systems, is nothing more or less than a table mapping each character (letter, number, space, or symbol) to a numerical code, which is then converted to binary and written to disc. Its necessity was seen long before the advent of the electronic monitor, so some of its more unique quirks must be understood in view of the time period of which it was a product. Historically, input and output was through a specially-adapted typewriter, and the ASCII convention reflects this in the codes it uses to terminate lines of text. In ASCII, there are two codes, both having physical equivalents in the real world, that signal the end of the line: the Carriage Return (abbreviated C/R, CR, or c/r) and the Line Feed (abbreviated L/F, LF, or l/f). Originally, the term *carriage return* was used for a command that caused the assembly holding the paper (the carriage) to return to the right so the machine was ready to type again on the left side of the paper (assuming a left-to-right language). On the other hand, the *line feed* moved the paper upwards, allowing the carriage to type on the following line. Different operating systems traditionally signal the end of a line in different ways. UNIX and its descendants (including Mac OS X), the operating systems most likely to run on a server, use the line feed alone. CP/M, DOS, and Microsoft Windows use the sequence of carriage return and line feed (CR/LF). Obsolete versions of Mac OS (up to, and including, System 9) use the carriage return alone. All programmes using Gopher MUST always use the Microsoft standard of CR/LF, irrespective of the operating system they run on. Both internal Gopher commands and policy files MUST comply with this standard. Other text files SHOULD use standard Gopher format, but this is not strictly required as a matter of technical form; the client MUST be capable of converting to and from all variants of line terminators. The recommendation stands for the benefit of non-compliant clients only. Selector Formats ================ Type Codes ---------- The following selectors are defined by RFC 1436: Type Treat As Meaning 0 TEXT Plain text file 1 MENU Menu 2 EXTERNAL CCSO flat database (formerly used as telephone directories); other databases 3 ERROR Error message 4 TEXT Macintosh BinHex file 5 BINARY Binary archive (zip; rar; 7-Zip; gzip; tar) 6 TEXT UUEncoded archive 7 INDEX Query a search engine or CGI script 8 EXTERNAL Telnet to: VT100 series server 9 BINARY Binary file (see also 5) + - Redundant server T EXTERNAL Telnet to: tn3270 series server g BINARY GIF format graphics file (TODO: Why not use I?) I BINARY Any image file. The `+` selector indicates a mirror of the previous item in the menu, and MUST behave as though it had the same type as that entry. For example: 5Download the software /software.zip gopher.example.com 70 +example.net mirror mirror.example.net /example.com/software.zip 70 +Another mirror mirror2.example.com /software.zip 70 Additionally, the following selectors have been in common use and are standardised here. If a client does not have the capability to display a particular item type, it SHOULD treat it as a more generic item type, passing it off to the operating system (itemtype p "implies" itemtype 0, etc.). Type Treat As Meaning c BINARY Calendar file (Kim Holviala) d BINARY Word-processing document (MS Word; OpenOffice.org; WordPerfect); PDF document h TEXT HTML document i - Informational text (not selectable) p TEXT Page layout or markup document (TeX; LaTeX; PostScript; Rich Text Format)—these documents are all plain text, but contain ASCII "tags" that make the document prettier when sent through a special program. m BINARY Electronic mail repository (also known as MBOX) (Kim Holviala) s BINARY Audio recordings (files that consist of audible, but no visible, data) (Wesley Teal) x TEXT eXtensible Markup Language document (Wesley Teal) ; BINARY Video files (files that consist of both audible and visible data) (Wesley Teal) Filetypes `4`, `6`, `h`, `p`, and `x` SHOULD send as text (itemtype 0). This way, the text appears directly on the user's terminal without being downloaded (unless the appropriate command is given to the client, i.e. `CTRL/S`). It is vital to note that text information can be sent via binary (with the minor inconvenience noted above), as binary files contain a greater range of information than ASCII. However, binary files, if sent via text, will be irreparably ruined, as this effectively passes raw eight-bit data through an ASCII filter. In the case of confusion, the owner/operator of the server should simply mark the file as binary to ensure that it transfers safely. Gopher Menus ------------ Menu (type 1) content has the following format: T^I^I^I Where: * `^I` is the ASCII character corresponding to the `Tab` key * `T` is the type code, which MUST be run together with the item text * is the selector string to send to the specified server * is the server to send the selector to * is the port on the server to connect to In addition, the server MAY place additional information after the port, preceded by a tab. (Gopher+, for instance, adds `^I+`.) The client SHOULD ignore this unless it has a good reason not to. Note on `i` item type: For the `i` item type, Selector, Server, and Port are mostly ignored, but MUST be there anyway. In that case, the host SHOULD be set to placeholder value `example.com`, and the port SHOULD be set to placeholder value `0` (zero). One exception to their being ignored is TITLE entries. These have TITLE as the selector value; host and port SHOULD again be set to the aforementioned placeholder values. ### Note on the terminating full stop ### Per RFC 1436, a terminating full stop (.) character followed by CR/LF should be sent on a line by itself after the end of the content, with exceptions for binary data. This terminating full stop has caused no end of trouble ever since. Many, if not most, modern Gopher servers omit this terminating full stop. Therefore, the practice suggested in RFC 1436 is DEPRECATED and the following practice is RECOMMENDED. * Servers MAY send the full stop; clients MUST accept it * Servers SHOULD send the full stop after menus and may OPTIONALLY send it after other files * Clients SHOULD display the full stop at the end of menus, if sent, to notify the user that this is the end of the menu * Clients SHOULD NOT include the full stop in other output, in case that output has some significance which the full stop may disrupt. * Clients SHOULD NOT consider a full stop significant, unless it occurs immediately before the connection is terminated. Errors ------ Although undesirable in communication, errors do occur in Gopher, and their handling is crucial for a user-friendly, and standards-compliant, Gopher experience. When an error is encountered, the server MUST return a menu whose first item bears itemtype `3`. All other ways of signalling an error, such as redirecting to a Gopher error menu, an image, or (worst of all) an HTML page, are PROHIBITED. The selector string for itemtype `3` is the text of the error. It is the responsibility of the server application to have understandable and accurate strings for error handling. As they are well-understood and common, HTTP-style error codes are acceptable and RECOMMENDED; however, they SHOULD also be followed by a clear, legible description of the error in both English and the local language. ### Error Codes ### This is a listing of HTTP-style error codes used in Gopher; due to Gopher's simplicity, it lacks most of the errors possible in HTTP. Codes beginning with 4 can generally be traced to the client; codes beginning with 5 are usually due to the server. 400 Bad Request The request could not be understood by the server due to malformed syntax. 401 Unauthorised The request requires authentication. For example, the received query value (as password) does not match the expected value. 403 Forbidden The request was received, but not filled. 404 Not Found The server could not find anything matching the requested URL. If the condition is known to be permanent, use error code 410 (Gone). 408 Request Time-out The client did not produce a request within the time that the server was prepared to wait. 410 Gone The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. If this is unknown, use error code 404 (Not Found). 500 Internal Server Error The server encountered an unexpected condition which prevented it from fulfilling the request. 501 Not Implemented The server does not support the functionality required to fulfill the request. 503 Service Unavailable The server is currently unable to handle the request due to temporary overload/ maintenance. Titles in Gopher ---------------- No mention of menus with titles exists per RFC 1436. When one simply browses about Gopherspace, this does not matter; for bookmarking and Gopher crawlers, such as Veronica-2, however, this presents a large problem. A Gopher TITLE resource has the following format: i^ITITLE^Iexample.com^I0 It is identical to a standard informational resource (itemtype `i`); the selector string, however, is set to the specific value, `TITLE`. The composition of the above format is as follows: * `^I` is the ASCII character corresponding to a press of the `Tab` key * The type code MUST be `i` (information) * The selector string MUST be `TITLE` * There is no server to connect to; the dummy text used in place of the server SHOULD be `example.com` * There is no port to connect to; the placeholder number SHOULD therefore be `0` (zero). A Gopher client that conforms to the above `TITLE` standard SHALL render it in one of two ways, depending on the placement of the resource. If the `TITLE` is the *first* resource in the document, it SHALL be considered its principal `TITLE` and used *wherever a principal title is needed* (window headings, bookmarks, etc.); furthermore, it SHOULD be rendered in a different size, font, and/or colour to the remainder of the document. In *all other* cases, it SHALL be considered a subordinate `TITLE` and SHOULD be rendered in a different size, font, and/or colour to the remainder of the document, but smaller and/or with less emphasis than the main title. If a non-standards compliant Gopher client receives a `TITLE` resource as per above, it will render it as plain informational text. As the main `TITLE` must be on the first line of a menu, it will appear visually similar to a title in any case, although not rendered as such. Linking to Web Addresses ------------------------ It is now possible, and standard, to link to documents, preferably in HTML, on the World Wide Web, Gopher's younger, more widespread cousin, from Gopher itself, using a two-part system: a `URL:` selector on the Gopher (local) end, and a *redirect page* (following rules as set out below) on the HTTP (remote) end. There are no compliance requirements for Gopher servers, with one exception: servers MUST follow the bulleted list located immediately after the example redirect page. A Gopher client SHALL, when it sees a selector with a path starting with `URL:`, interpret the path as a URL. It SHALL ignore the host and port components of the Gopher selector, using those components from the URL instead, if applicable. `URL:` selectors SHOULD NOT be used if it is possible to link to the required content and protocol by any other means. In particular, the following protocols SHALL NOT be used with the URL: selector. * gopher * telnet (VT100-compatible) * tn3270 Authors SHOULD NOT link to any document not of HTML type unless absolutely necessary; linking to non-HTML documents will break compatibility with non-compliant Gopher browsers. A Gopher `URL:` selector MUST take the following format: h^IURL:

^I^I `URL:` selectors are, for the most part, identical to standard HTML selectors, but composed of particular data: * The item type corresponds to the type of document on the remote end. Most typically, this is a Web page authored in HTML; therefore, the item type is most commonly `h`. * is the text of the link; this can be almost anything. *

is the full URL, preceded by the string `URL:`. For example, this could be `URL:http://www.example.com` * is the server that the link *originated* from; this MUST be ignored by a compliant client, but MUST also be sent by a compliant server * is the port that the link *originated* from; this MUST be ignored by a compliant client, but MUST also be sent by a compliant server It is possible for a non-compliant Gopher client to follow a link to an HTML page, as long as the server is compliant, by the following means: when the client receives a command to follow a `URL:` selector, it will contact the server that provided the menu, as the originating host and port are *mandatory* per this standard. When a Gopher server receives a request from a client beginning with the string `URL:`, it SHALL write out an HTML document that redirects the browser to the appropriate place. A conforming example of such a document is as follows: You are following an external link to a Web site. You will be automatically taken to the site shortly. If you do not get sent there, please click here to go to the web site.

The URL linked is:http://www.example.com/">

http://www.example.com/

Thanks for using Gopher! This document may be any desired by the server authors, but MUST adhere to the following requirements. * It SHALL provide a refresh of a duration of 10 seconds or less * It SHALL NOT use `IMG` tags, frames, or have any reference whatsoever to content outside that particular file, with the sole exception of the link to the real destination. * It SHALL NOT use JavaScript. * It SHALL adhere to the W3C HTML 3.2 standard. When a non-compliant Gopher client finds a reference to a HTML file (type `h`), it will open up the file via Gopher, receiving the redirect document using a Web browser. The Web browser will then be redirected to the actual link destination. Compliant Gopher clients will simply render the target directly. Algorithm to use with selectors ------------------------------- Here is a description for a hypothetical algorithm for parsing item types, splitting them into levels of interaction. PROTOCOL -------- Type Description What to do 0 Brief text Render directly line by line. 1 Menu Request and analyse menu. If it contains '3' error node, print error. Else, render menu in new window. 7 Index/Search Server DATA NODES ---------- Type Description What to do 4, 9, g, I, c, Binary file Request and analyse file. If it contains d, m, s, ; '3' error node, print error. Else, does plug-in exist? If yes, display. If no, save to disc. 6, p, x Text file Request and analyse file. If it contains '3' error node, print error. Else, print on screen. h, 2, 8, T Link Treat as URL. 5 Archive File Request and analyse file. If it contains '3' error node, print error. Else, does plug-in exist? If yes, display. If no, save to disc. For instance, if the client is incapable of handling images as it is text-only, the algorithm above would have it save to disc. Representation of Gopher Addresses ================================== This section is greatly indebted to RFC 4266. A Gopher address, or uniform resource locator, takes the form: gopher://:/ where is one of: * * %09

* %09%09 If : is omitted, the port defaults to 70. is a single-character field to denote the Gopher type of the resource to which the URL refers. The entire may also be empty, in which case the delimiting `/` is also optional and the defaults to `1`. is the Gopher selector string. Selector strings are arbitrary sequences of characters; they MUST NOT, however, contain the characters corresponding to horizontal tab, line feed, or carriage return. Gopher clients specify which item to retrieve by sending the Gopher selector string to a Gopher server. It is important to know that within the itself, there are no reserved characters, so one may be arbitrarily creative when creating selector names. Note that some Gopher strings begin with a copy of the character, in which case that character will occur twice consecutively. The Gopher selector string may be an empty string; this is how Gopher clients refer to the top-level directory on a Gopher server. If the URL refers to a search to be submitted to a Gopher search engine, the selector is followed by an encoded tab `%09` and the search string. To submit a search to a Gopher search engine, the Gopher client sends the string (after decoding), a tab, and the search string to the Gopher server. Gopher Policy Files =================== It is often useful to provide information to Gopher clients that MAY, but need not, be read by a human being. It is for this reason that policy files exist. This document enumerates two types of policy files, formally known as the Capability Policy and the Robot Access Restriction Policy, but also informally known under their filenames: `caps.txt` and `robots.txt`, respectively. Capability Policy ----------------- It is RECOMMENDED, when hosting a public-access Gopher server, to include a capability policy. Although it is, ultimately, the choice of the owner or operator of the server, a capability policy (or caps file) can be useful for clients querying the server for certain information without using extensions such as Gopher+. The purpose of a capability policy is so that a server can instruct a client on how properly to parse selectors in its filesystem; it ensures that the client can understand how files on the server are organised. The scheme used in the current implementation of caps can handle POSIX (UNIX and related operating systems), FAT/NTFS (used by Microsoft Windows), and HFS (used by all versions of Apple Mac OS, including OS X, which is otherwise POSIX-compatible). For technical reasons, capability policies cannot handle VMS or Files-11 paths; however, owing to their open interface, the specification can be arbitrarily extended. A capability policy is quite simple in its composition: it is a plain text file with no more than seventy characters per line in the root directory of a Gopher server with the name caps.txt and beginning with the six characters CAPS[CR][LF] Because of the constrained name and location of the policy, it is a trivial matter to verify if one exists or not; the address is always of the form , with the real name of the server substituting for `example`. The server should accept both `caps.txt` and `/caps.txt` as selectors, and return the same content for both. A caps file contains *keys*, *values*, and *comments*. Keys can be compared to labelled containers for data; for instance, the key `ServerSoftware` is a container for the name of the Gopher software running on the server. Keys in capability policies are always alphanumeric (i.e., composed of letters and numbers only) and generally are in CamelCase (each individual word within the key capitalised). The data in these containers is called a value; values can use letters, numbers, and symbols. Keys and values are connected by the equals (=) sign. Any amount of whitespace (spaces and tabs) around the equals sign is acceptable. Anything not conforming to the syntax SomeKey = Value is ignored (treated as a comment). To be standards-compliant, comments must begin with a hash (#) sign. More importantly, they must be on a line to their own. Below is an example caps file. CAPS # These four characters must be at the beginning to identify the file # as successfully fetched. # This is a caps file. This contains a list of key=value pairs that are # useful to clients wishing to query the server for special information # without using extensions such as Gopher+. Not all clients support caps # queries, so your site should be navigable without it. This is an # optional feature and is not currently a Gopher protocol standard. # Blank lines and lines starting with # are ignored. # Clients should cache the information where possible. # Some servers may automatically generate caps files for you as a pseudo # selector. In that case, this file may serve only to supersede the # machine-generated keys. You should read your server documentation. # # To use this file, customize it and place it in your server's root mountpoint # such that a fetch for selector "caps.txt" will retrieve it. # # All keys are optional. Not all keys listed here need be specified, and # in fact many sites won't specify all of them. The client should be # prepared to deal with that too. ### CAPS META PROPERTIES ### # # Spec version of this caps file. This should be the first key specified # so that an incompatible later format might be detected by the client. CapsVersion=1 # This tells the client the recommended caps cache expiry time, in seconds. # This particular property tells the client to refetch the caps file after # an hour has passed, preferentially. This is optional for the client to # implement. ExpireCapsAfter=3600 ### PATH SECTION. USE THESE DEFAULT VALUES IF YOU ARE ON A POSIX FILESYSTEM ### # This tells the client how to cut up a selector into a breadcrumb menu. # This is a simple ASCII string. If it is not specified, the selector is # treated as if it were opaque. The client may collapse consecutive # delimiters (e.g., x//y is treated as x/y) except if PathParentDouble is # true (for Mac). PathDelimeter=/ # This tells the client what the "identity" path is, i.e., it can treat # this as a no-op, turning x/./y into x/y. If this is not specified, the # literal path . is used. PathIdentity=. # This tells the client what the parent path is, i.e., it can treat this # as a path instruction to delete previous path, turning x/y/../z into x/z # If this is not specified, the literal path .. is used. PathParent=.. # This tells the client that consecutive path delimiters are treated as # parent (mostly for Mac HFS prior to Mac OS X), e.g., turning # MacHD:x:y:::z into MacHD:z. If this is not specified, it is default FALSE. PathParentDouble=FALSE # This tells the client the escape character for quoting the above # metacharacters. Most of the time this is \\. If this is not specified, # no escape characters are used. PathEscapeCharacter=\\ # This tells the client not to cut everything up to the first path delimiter. # Normally caps makes gopher://x/11/xyz and gopher://x/1/xyz both into /xyz # assuming your server is happy with the latter URL (almost all will be). # If this is not specified, it is by default FALSE. This should be TRUE # *only* if your server requires URLs like gopher://x/0xyz (i.e., the # selector should NOT start with the path delimiter). PathKeepPreDelimeter=FALSE ### OTHER PROPERTIES ### # # Some clients will or may make use of these; some won't. # Freetext description of the server software and server hardware. ServerSoftware=Bucktooth ServerSoftwareVersion=0.2.9 ServerArchitecture=AIX ServerDescription=IBM Power 520 Express, dual 4.2GHz POWER6 CPU, 8GB RAM ServerGeolocationString=Southern California, USA # Special server features. ServerSupportsStdinScripts=TRUE # An E-mail contact for the server. ServerAdmin=gopher@floodgap.com Robot Access Restrictions Policy -------------------------------- WWIS robots, also known as spiders, crawlers, or wanderers, are computer programmes that, without human intervention, recursively travel throughout linked pages or directories on an information system (that is, by repeatedly travelling up and down a tree) and store the copies of these files at an independent location. The process of programmatically gathering information in this manner is called crawling or spidering. Many sites, in particular search engines (such as Google on the World Wide Web, or Veronica on Gopher), use spidering as a means of providing up-to-date data. Robots are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Robots can also be used for redundancy; data can be preserved by a third party in case the original server becomes inaccessible. In 1993 and 1994, however, there were occasions where robots had visited locations on the Web at which they were not welcome. Inexperienced or heavy-handed use of robots caused situations where servers were swamped with requests at a high rate of speed; or, the same files were retrieved repeatedly. Both could cause denial of service. In other situations, robots traversed parts of servers that were unsuitable, such as temporary information or server-side scripts, especially those with side-effects (such as polls). Abuse of robots was also an issue, and continues to be one now; for instance, electronic mail addresses have been harvested with knowing intent to distribute unsolicited mail ('spam'). These incidents indicated the need for established mechanisms for Gopher servers to indicate to robots which parts of their server should not be accessed. This standard addresses this requirement with an operational solution, adapted from the identical method used on sites using the Hypertext Transfer and File Transfer Protocols. The method used to exclude robots from a Gopher server is formally known as the Robot Access Restrictions Policy (RARP) and consists of placing a plain-text file specifying, in simple and user-friendly syntax, which robots may access which directory. The policy file, if it exists, MUST be accessible via Gopher on the local address /robots.txt A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document. Furthermore, Gopher administrators should bear in mind that the Robot Access Restrictions Policy works largely on the honour system. Many crawlers can be set to ignore the policy, and it is trivial to write this capability into a new crawler. The policy file consists of one or more records, separated by one or more blank lines, terminated by the Gopher-standard CR/LF. Each record contains two or more lines of the form : The field name is not case-sensitive. Comments (lines to be ignored by robots themselves, but useful to robot operators and others) start with the hash (#) character and end with the line terminator (CR/LF). A value can share a line with a comment. A record starts with at least one `User-agent` field, followed by at least one `Disallow` field. There are two further, optional fields: `Crawl-delay`, as well as `Allow`. The value of the `User-agent` field is the name of the robot whose access policy is being descrtibed. If more than one `User-agent` field is present, the record is describing an identical access policy for each robot. This field is to be interpreted broadly. The recommended implementation of access policies in the robot's code is for a case-insensitive sub-string match, without version information. Since one is describing an access policy for at least one robot, at least one `User-agent` field is required. The value `*` (quotes excluded) describes access policy for any robot not matching any previous records; therefore, if listed, it SHOULD be listed last of all. If it is not listed last of all, anything below it will be ignored. The value of the `Disallow` field specifies a partial URL that is not to be visited. This can be a full path, or a partial path. Any address that begins with this value will not be retrieved; for instance, the line Disallow: /help would disallow `/help/index.html`; `/help/faq.html`; as well as `/help.html`. Conversely, the line Disallow: /help/ would allow `/help.html`, but nothing in the directory `/help/`. An empty `Disallow` field indicates that all addresses can be retrieved. As one is defining policy and not simply listing the names of robots, at least one `Disallow` field is required per record. One can also add specific exceptions to the locations disallowed by using the `Allow` field. The `Crawl-delay` field is also supported; this field indicates the number of seconds to wait between successive requests to the same server; the value must be an integer with no units. The following is an example of a well-built policy file: # Robot Exclusion File for gopher://gopher.scdp.com # If you wish to crawl gopher.scdp.com, please contact # lane.pryce@scdp.com to apply for an exemption. Our terms of # service are available at gopher://gopher.scdp.com/0/tos.txt. User-agent: baiduspider User-agent: googlebot User-agent: msnbot User-agent: bingbot User-agent: naverbot User-agent: seznambot User-agent: slurp User-agent: teoma User-agent: yandex Disallow: /cgi-bin/ # Dynamically generated scripts Disallow: /images/ # This consumes bandwidth! Disallow: /tmp/ # Temporary files---blink and they're gone! Disallow: /private/ # No peeking! Allow: /images/logo.jpg # Main logo. Mirror this if possible. Crawl-delay: 10 User-agent: * Disallow: / # If you have received authorisation to crawl this site, and are # getting denied, please contact support@scdp.com, or dial # (212) 555 0169. This site is copyright Sterling, Cooper, Draper, # and Pryce, 2012. In plain terms, this server allows major search engines Baidu, Google, Bing, Naver, Seznam, Teoma, Yahoo, and Yandex to mirror the site freely, with the exception of everything in the directories /cgi-bin/, /tmp/, and /private/, as well as everything with the exception of the single file logo.jpg in the directory /images/. So as to not unduly slow the server down, the policy file requests that search engines wait ten seconds between requests. All other robots are prohibited from accessing the site. Examples such as the following SHOULD NOT be used except in very rare situations. Robots generally cause more good than harm, and excluding them entirely, as this anti-social user would, does not make Gopher a healthy place. # Piss off! User-agent: * Disallow: / Administrator Contact File -------------------------- It is worth remembering that computers, like anything else, are fallible and prone to error. When failure occurs in Gopherspace, the person in the best position to rectify it is the system administrator. Furthermore, users may have questions or comments, also best directed to the system administrator. For this reason, each Gopher server MUST have a file in its top-level directory with the name *about.txt* and a RECOMMENDED selector string of *About* or *About this server* (equivalents in the local language are permissible, but an English translation is similarly RECOMMENDED). It is the Gopher equivalent of a Unix user's finger output. Since this file is intended to be readable by humans and not computers, it does not have a defined file format. However, it should have a short description of the server's contents, as well as the contact details of the server administrator and any other key employees, such as the legal department. A well-structured contact file looks as follows: Sterling|Cooper|Draper|Pryce ============================ Welcome to SCDP! We are a full-service advertising and marketing agency staffed by a team of diverse, senior professionals with a flair for solid strategy and compelling creative output. Our team produces unique television, radio, print, and Web advertisements for a range of industries. Our ability to identify and communicate your greatest benefit to your customers is our greatest benefit to you. We find out what makes you truly unique. We have built an excellent team: each member is an advertising specialist in their own right. Photography, programming, writing, design, strategy---you name it, we have a creative for that. System Administrator: Margaret Olson Telephone: (212) 555 0169 x808 Address: 13, Madison Avenue, New York, N.Y., U.S.A. eMail: peggy.olson@scdp.com Skype: peggyXolson All prospective clients: Please contact Creative Director Donald Draper at extension 069. Legal issues: For all legal and financial issues, please contact Lane Pryce at extension 777. References ========== * Anklesaria, Farhad; Lindner, Paul; McCahill, Mark P.; Torrey, Daniel; Johnson, David; Alberti, Bob (1993). *Gopher+: upward compatible enhancements to the Internet Gopher protocol* Retrieved 23 May, 2012, from * Anklesaria, Farhad; McCahill, Mark P.; Lindner, Paul; Johnson, David; Torrey, Daniel; Alberti, Bob (1993). *The Internet Gopher Protocol (a distributed document search and retrieval protocol).* Retrieved 23 May, 2012, from * Goerzen, John (2002). *Links to URL.* Retriever 23 May, 2012, from * Hoffman, P. (2005). *The gopher URI Scheme.* Retrieved 23 May, 2012, from * Kaiser, Cameron (2010). *Welcome to caps!* Retrieved 23 May, 2012, from * The “Updated Gopher RFC” thread (started May 8 2012) on the gopher-project mailing list * - A number of documents relating to gopher, including the RFCs * -- A paper on the history of Gopher * - Floodgap's caps file Authors ------- Wolfgang Faust Brain dump ---------- This is a brain dump of some other things to include at some point. Mime-multipart encoded files should use type 'M' and base64 encoding StdIn CGI scripts (WONTFIX --- this is not a CGI spec) I mention this because I notice Floodgap (whence the original caps.txt) has `ServerSupportsStdinScripts=TRUE` in the caps.txt file, and I was wondering if this was a common feature. Gopher+ (in particular the extra ^I+ on the end of menu entries) (thanks!) Periods in the middle of content Do we need to describe the full format of robots.txt? Isn't there a spec we can just point to? For 'i' types - why does the host/port/selector matter at all? Most sites use error.host, and I see no reason why the contents should matter at all.