Perspectives on the current web
Or how would the web be like, with a little more information and a little less glitter
This document is currently in a draft state...
Table of contents
It is interesting to note that over the years various attempts have been made to equate the Internet with something else. Until the mid-1980s lots of people tried to say the Internet was the ARPANET. In the late 1980s many tried to say the Internet was NSFNET. In the early 1990s many tried to say the Internet was USENET. Now many are trying to say the Internet is anything that can exchange mail. We say the Internet is the Internet, not the same as anything else.
—[RFC-1935], section DNS and Mail Addresses.
The first is that there is no unique information protocol that will provide the flexibility, scale, responsiveness, world view, and mix of services that every information consumer wants. A protocol designed to give quick and meaningful access to a collection of stock prices might look functionally very different from one which will search digitized music for a particular musical phrase and deliver it to your workstation. So, rather than design the information protocol to end all information protocols, we will always need to integrate new search engines, new clients, and new delivery paradigms into our grand information service.
—[RFC-1727], section 3, Axioms of information services.
The second is that distributed systems are a better solution to large-scale information systems than centralized systems. If one million people are publishing electronic papers to the net, should they all have to log on to a single machine to modify the central archives? What kind of bandwidth would be required to that central machine to serve a billion papers a day? If we replicate the central archives, what sort of maintenance problems would be encountered? These questions and a host of others make it seem more profitable at the moment to investigate distributed systems.
—[RFC-1727], section 3, Axioms of information services.
When starting to work on an experimental proposal for transporting DNS operations over a HTTP dialog with a JSON encoding, I've started reading the RFC (Request for Comments) articles governing the various involved protocols. As expected these documents are precise technical specifications, full of syntaxes, pseudo-code, implementation notes, security considerations, and so forth; however, especially in the introductory parts, there is plenty of generic wisdom and insight into the Internet's architecture and inner workings, with pointers to other RFC's.
Thus my curiosity drove me to look at the references section, and identify those discussing the Internet in general. After a quick depth-first-search, stopping at a small depth, through the graph of collected RFC's I've started to build another impression about the RFC publications. Are the RFC articles only technical specifications? Or they serve a much broader purpose?
It seems to me, that the RFC series resembles more to a high quality scientific journal, than a to a collection of specifications. It also has all the necessary qualifications:
- each RFC discusses either a specific pin-point technical aspect, or provides an overview of the Internet's ecosystem;
- each RFC is peer reviewed before publication, by experts in the subject's field --- a process perhaps better the one in academic publications;
- the citations from non-RFC documents are countless;
Indeed before discussing anything related with the Internet, we must first understand the big picture, in its entirety, thus getting a grip of the various gears that make the whole thing tick. And such a nice overview is provided by the [RFC-1935], written in 1996.
It first starts with a drill-down through various components that make the whole:
- "WWW [a.n. HTTP + HTML], Gopher, FTP, TELNET, mail, lists, an news [a.n. NNTP]: that's pretty characteristic set of major Internet services."
- consumer, supplier, and producer computers
- I.e. the active entities from end-to-end in the access chain.
- consumer-capable, and supplier-capable computers
- The extension of the previous set, to those computers that through the right software, could be able to play its role. But the interesting categorization factor is the firewall, as the computer on the "internal" side of the firewall is only a consumer-capable, meanwhile the one on the "outer" side is also supplier-capable.
Then it continues with some categorizations and considerations:
- core Internet, and consumer Internet
- Interestingly, at that time the "core Internet" (or "Backbone Internet") referred to those firewall-free computers, meanwhile the extension with those firewalled computers, is called the "Consumer Internet" (or "Internet Web").
- communication services, resource sharing and resource discovery services
Starting again from the implications of firewalls, services are split into a couple categories:
- "communication services", like "mail, lists, and news" [a.n. NNTP];
- "resource sharing (TELNET, FTP)"
- "resource discovery (Gopher, WWW [a.n. HTTP + HTML])"
- interactive vs non-interactive services
- Again, interestingly services that have a dose of interactivity are part of the Internet, meanwhile those lacking direct user interaction are left outside. In fact the cited document goes deep into a debate about what is, or isn't "on the Internet".
- IP is mandatory
- Only computers accessible over IP are "on the Internet". "There seems to be something about IP and the Internet that is especially conductive to the development of new protocols."
- text-only or graphical interactive access
- It doesn't matter. "However, we agree that the distinction of graphical access is becoming more important with the spread of WWW and Mosaic."
- "Russian dolls"
- "Let's not talk about that many concentric layers, though, rather just three: the Matrix on the outside, the consumer Internet inside, and the core Internet inside that."
- "Outside the Matrix"
- "LANs, mainframes, and BBSes that don't exchange any services with other networks or computers; not even mail."
It is very interesting to see how is the Internet defined in 1996, what are the few orthogonal service categories. Also it is interesting to stop and think if the definition has evolved? Have we added new service categories? Have we added new computer categories?
In my view the definition still applies today, in certain aspects:
- TCP/IP is still the de-facto network and transport layer protocol. Moreover TCP and UDP are the only protocols usable on the entire web, as most other transport protocols are dropped at network interconnections.
- Probably more than in 1996, interactivity is the main driving factor for any Internet application. Perhaps more than necessary, to the fact that interactivity has taken the place of information usefulness.
However in other places it should be updated:
- Some protocols have almost dropped dead, like Gopher or NNTP; others are replaced, like FTP with HTTP, or perhaps with BitTorrent; meanwhile another, HTTP, subsumes its older brothers.
- Text-only interactive access is almost unusable, graphical interaction being mandatory; again to the point where presentation has taken over usefulness.
- Firewalls don't serve any more the role of "technical" security appliances, instead they have become the "virtual border patrols" at the various network "border crossings", splitting the Internet into "nations" [Schneier-2013-a], and sometimes even partitioning the Internet.
The informational [RFC-1727] and [RFC-1728] --- both from 1994 thus only a couple of years after the "world wide web" was invented --- presents a consolidated view, by some of the important technical leaders of the time, relating to the architectural and interoperational aspects of future "information services".
After a few enlightening axioms --- which are already exhibited in the Quotes section --- it presents a layered access architecture, at the heart of which lies the "document" or "resource". The other central aspect is the proper usage of URN's to designate documents, and then after mapping and selection, URL's to retrieve them.
The envisaged layers, bottom-up are:
- resource layer
(see section [4.1]) That represents the atomic unit of publishing, tagged with a unique URN, and entrusted to a "transponder".
The atomic aspect refers to the the fact that a document is self-contained, including all the multi-media artifacts that it might need, which contrasts with today's web content organization. (Although this atomicity is not explicitly expressed, it can be inferred from the "Bogus Journey" to the Antarctic example in section [4.3].)
Then the "transponder" for each document --- whose functions could be mapped on today's CMS's --- has the explicit role of continuously updating the URN to URL mapping, as the resource "moves around". This concept is further detailed in [RFC-1728], as composed of the document's meta-information, and an active part, the "agent", that should keep the meta-information up-to-date, pushing it through the rest of the information system as needed to notify updates, and mediating with the user agent, access to the resource itself.
- resource locator
- (section [4.2]) Which is entrusted with the simple, but essential, task of mapping between a resource's URN (the unique universal name) and it's various URL's (the access method).
- resource discovery
(section [4.3]) The "giant repository of all information about every publication", that serves the current day purpose of search engines, that would allow the user to identify the URN's of the documents most likely containing the information he searches for.
Besides the indexing role, the RFC explicitly mentions also the notification role, in that users could express interests in particular topics, and when a new document is published they might receive notifications of such.
- information delivery tools
(section [4.4]) That are the equivalents of today's browsers, which from a protocol interoperability point of view are split into two large categories (the category names are not given in the RFC, but by the author of this document):
- (section [4.5]) gateways, which translate between one protocol or format to another, and which resides inside the server-side of the architecture, not without some information loss;
- (section [4.6]) multi-protocol clients, that solve the interoperability on the client-side, not without practical disadvantages;
Moreover it also covers the publishing aspects (section [5.1]), which informally implies "a creative thought, a couple of hours of typing, and a few cups of coffee, and we have a new resource", and a then the intermediation of a "publishing agent" that takes care of assigning the URN and pushing the document to distribution systems (and attaching the "transponders").
It also underlines the need of proper classification and filtering, done by the "librarians" of the net (section [5.2]).
The highlights that the reader must remember from this historical document are the following:
- the atomic, self-contained, nature of documents or resources --- contrasting to todays HTML mash-ups, made from a myriad of stylesheets, images, and AJAX calls; (although driven by performance reasons, we are moving back into that direction, through embedding of assets in HTML's, see [PageSpeed]_;)
- the divorcement of document identity, its URN, and the document's access, its URL --- which is almost inexistent today, except maybe in the form of PURL's [PURL]_;
- the publishing process --- which should involve more "creative thought" and a "few hours of typing", that contrasts strongly with today's 140-characters-tweet and push-a-button approach;
- the announcement of new documents --- that would remove the need of continuously crawl-the-net approach;
- the need of "librarians" --- that are sorely missed in todays high noise-to-information ratio of the web;
The only aspect, that I would strongly disagree, with is the centralized publishing method, but which could easily be dissolved into a distributed approach; however the document does imply something similar by stating "personal publishing agent or some big organization".
The reader is kindly asked to take this section as simple a caution, or as a single facet of the multi-faceted Wikipedia. In fact I still consider Wikipedia to be one of the best knowledge index, and most of its articles are better written and more accurate than most of the scientific publications.
Currently Wikipedia has become the de-facto primary source of knowledge for a large part of the population; including, unfortunately, higher education students, and even some researchers. Therefore most people attribute an authoritative power to Wikipedia, and conclude that "if it is written so, then it must be so", ceasing any further documentation; a reminder of the "believe and don't doubt" dictum. In a few words, as the title implies, Wikipedia has become an orwellian "ministry of truth".
However, this issue is not the most alarming one, as in the past other sources of knowledge shared a similar status, like the Britannica Encyclopedia, or other similar publications edited by either cultural, educational, or even political institutions, each with its own priorities and agenda. More to the fact Wikipedia is a collaborative endeavor, which implies that anyone spotting an error could easily submit a correction; then Wikipedia is also an on-line compendium, which implies that any accepted correction becomes visible almost instantaneously, without other delays.
Unfortunately Wikipedia's threat lies in its unique version of the truth. That is for any topic there is a single article presenting it; although to be fair many articles are balanced, and those that aren't are clearly tagged as pending for editorial attention. But back to the issue at hand, this is in contrast with the past de-facto repositories of truth, in that, although few in number, they did offer alternative perspectives, especially when they came from diverse geographical sources, thus with different social, economical, and even political views.
In conclusion, to tie this with the current document, Wikpedia's main fault is the contradiction of the second [RFC-1727] axiom of a truly distributed information system, by concentrating all editorial and publishing efforts under a single entity.
But all this is not beyond repair, because it could easily be fixed, first from a philosophical, by allowing and presenting, for each topic a set of alternative articles, which probably and hopefully, will have diverging views on that particular topic; this would be the first step in providing a more realistic and unbalanced mirror of the world.
Still, just providing alternative points of view, doesn't solve the fact that Wikipedia is still a centralized entity, which in the end can, although it doesn't, act in a dictatorial manner --- in the end even democracy is the dictatorship of the majority. Therefore the second step, technical in nature, is to decentralize its content to a number of interlinked information repositories, in the true nature of the web.