"Computer science is no more about computers than astronomy is about telescopes."
Introduction
What is software? As shapeless as an idea and as concrete as a machine, software shares a lot of characteristics with arts; if a painter mixes colors and stroke a canvas with them, then a software developer first relates code and data, then elegantly shapes them onto an execution machine. Surprisingly, it is a vague yet accurate definition. If you only take data and code, well, that is boring. When you take data, code, and their relationship, software suddenly becomes full of mathematical functions, algebras and topologies. When you take data, code, and their representation in a real world, software changes into linguistics, logistics and economics. For me software is an expression of form, a very unique, timely and timeless one. Like making soap bubbles, it is easy or difficult, magical, and always last for a mere second.
Studying biological cell mechanics helps to understand life but fails short to explain intelligence. In some way, additions and multiplications are the building blocks of any software. The art, however, lies within the study of dynamic information systems.
Such systems revolve around the concept of identifiers and computation models. We will first try to define what an identifier is and why it gets manipulated and how it is represented. Then we will delve into modeling time, causality and their relation to computers.
Identifiers in computer systems
Identifiers are pervasive through out the information systems. The need for names arises very early on in the history of humans. Whether it involves talking about families, friends, or enemies, you need to be able to name it.
The notion of street addresses became also quickly necessary to deliver messages and goods.
Until the 20th century, something like "to John, third house on the way out of town" would most likely arrive to the right destination. Humans can achieve very accurate results on very little information; computers cannot. The boom in mechanical processing of information forced the definition of what an identifier actually is out of philosophical circles into a precise, formal and practical world. As it turns out, today software developers spend a huge amount of time describing identifiers, domains and associated rules. On the other hand, computers spend a huge amount of time manipulating identifiers.
As more and more information lives in digital form, the need for formalized identifiers is ever increasing. Here are some examples of structured identifiers that are so ingrained in today's life and that did not exist a century ago:
- Social Security Number (1936)
- Internet Domain Name System (1984)
Usage of identifiers
Before we move into how identifiers are structured, we first should understand how their are used in an information system. Let's start with a real life example of a familiar information system: the library. A library is in many ways the most influential knowledge system in human history. It achieves its mission to diffuse knowledge by moving books around places while distributing rights to people to access books and places. A typical scenario involves you showing up at the library, looking for information. You would search through an index that associates book titles (information identifier) to library shelves (storage identifier) for a while. Then you would walk up to the counter, have your library card (access identifier) scanned and walk out with some books under your arm.
There are components in this system that needs to be identified: books, places and people - or in generic terms: information, storage and accessors.
Information identifierInformation identifiers describe and characterize a piece of information using less storage than the information itself. In our library analogy, an example of information identifier could be the title of a book or for more recent publication the ISBN number of such book. There are no reasons why information identifiers have to be unique. In fact, most times, you will be looking for some information based on criteria such as publication date, category, etc. Information identifiers are thus better described as information keys, specially crafted for the purpose of running efficient search queries.
Storage identifiers describe where information is stored. They have to uniquely describe a physical location such that information can efficiently be replicated from a location to another.
Information and storage are static. They do not get into motion without an access operation. Therefore access identifier control usage of space and time.
Limits of the library analogy
For example, the library can have a section it deems not suitable for children. Rights and restrictions are many times merged within the notion of access identifiers. Although we think it is important to separate access identifiers and rights identifiers because access identifiers are necessary to describe information systems while rights are not. The computer literature about capabilities focuses very much on these notion of access identifier and rights identifiers for anyone interested.
Sometimes librarians might become very proud of having a first edition of a book. Keeping historical artifacts is a very important activity. We will put it here in the mission of museums, for in our analogy with digital information systems, what matters is the content (information) in the books, not the physical books themselves.
A library is not typically associated with a printing manufactory so it cannot easily replicate books. As a result a library tracks physical books and usually keep multiple copies of popular titles. Replication, in a variety of presentation formats, has virtually the same cost as storage in digital systems such that both the library and printing manufactory can always be bundled together. Thus, contrary to a library, there are no need to return physical books in a digital system because there are no risk of depleting the library through checkouts. Nonetheless, cheap is not free and access identifiers remain in use to control the stress on the digital system in terms of time, storage and bandwidth - much the same way the number of books you can check out at any given time is limited.
We can thus rephrase our analogy in more familiar terms. The major mission of a digital information system is to diffuse information by associating search patterns to storages and replicating data between storages.
Structure of identifiers
Requests to an information system should be expressed in the most natural way and computed as fast and as accurately as possible. Since there is no "one size fits all", the use of identifiers in those requests will heavily influence the structure of such identifiers.
We should note here that the distinction between identifier and information is many times blurry. Often both can only be defined relative to the computation being executed. When data needs to be retrieved out of storage as part of a computation, we will define that data as information and the index to locate the associated storage as identifier.
Uniqueness
In a system where identifiers are ambiguous, you cannot guarantee whom you are connecting to. Therefore uniqueness of identifiers is required for predictability and reproducibility. Philosophical and scientific implications that ensue out of the uniqueness of identifiers are not simple. For example it might be at the heart of the difference between mechanical (predictable) systems and living (evolving) ones. If this were to be true, it might prove to never be possible to build a computer that imitate a human brain through current hardware and software methodologies.
Engineers are charted by definition to build predictable machines so computer systems are very much centered around unique identifiers. Nonetheless, when you look at an information system in terms of classification, the requirement of unique identifiers per information is not so strict. Classification puts information into different buckets based on a specific request. It is actually easier to sort out information if the ones that go in the same bucket have already the same identifier. With a wide variety of requests, there are often multiple sets of identifiers for the same information. It is common to used keys or tags instead of identifiers in that context.
Meaningful and random bits
When new information, agents and storage are added into a digital information system, they need to be identified, many times uniquely. How do you construct such identifier?
It is often useful to associate bucket of bits as an identifier for a specific purpose. For example, social security numbers could be constructed to start with one for males and two for females. That makes it straightforward to find out how many men and women a doctor has in his clientele with no other information than the patients social security identifier. The performance advantages of such identifier structure, especially when the identifier is also used as an authentication key to receive different benefits, has to be mitigated against the association between security and privacy risk.
An identifier composed of random bits with no associated meaning is deemed to be a good idea when privacy and security concerns tops the list because it does not reveal any more information than: it is an identifier.
When information is inherently hierarchical, identifiers can be composed through prefixes and suffixes. A hierarchical identifier structure underpins the entire Internet domain name server infrastructure.
Fixed vs. variable length representations
Since most computations are expected to terminate and deliver a result as soon as possible, shorter identifiers are better. Fixed-size identifiers are easier to store and process. Many digital systems manipulate and reason in terms of finite sets in a mathematical sense.
Variable-length identifiers appear in three contexts. First, humans tend to prefer variable-length alphabetical strings while computers deal more efficiently with fixed-size integers; it is often the case a system will translate (implement a bijection) between both types of identifiers. A typical example is network IP addresses and UTF domain names. Variable length identifiers are artifacts of the interface. Second, variable-length identifiers can be used to optimize storage usage. The field of data compression is dedicated to this problem. Third, variable-length identifiers can implement what are mathematically defined as infinite sets. Lastly, variable length identifiers are rarely use today with most usage falling under one of the two previous categories.
Pattern and metric representations
As we have seen in the example above, many times we are searching through information systems for relevant pieces, defining and refining criteria to separate one set of information from another. This is an activity often referred to as data mining, categorization or classification in general. As complex and refined the separation criteria are they are based on an analogy (Mars has very much earth-like clouds) or distance (London is 340km from Paris).
When searching by analogy, we usually represent a search pattern as a finite state machine. Finite state machines used for text pattern matching are through standard convention often written as regular expressions. For example, to find all books whose title contain the word digital, we could write the following regular expression ".*digital.*". By encoding images and sounds as sequences of bits, it is possible to extend the reasoning beyond text and use regular expressions to search other media. This way, finite state machines can be constructed to represent bit patterns to classify information. For pattern-matching classification it is thus required that similar identifiers have similar bit sequences.
A metric is a function, which defines a distance between elements of a set. In that case, we associate a number (the distance) to each information in the system and then use the natural order on numbers to classify information as matching or not. We will thus choose identifier structures that insures the distance function can be implemented efficiently then.
Resources: space and time
Most people have the capabilities to think and reason about three-dimensional space and mathematicians as well as computers seem to easily manipulate spaces with many more dimensions. Everyone and everything experiences time; it is so ingrained into us that it is almost impossible to define time, less imagine time differently.
In computer systems it is possible to create space by adding disks and somehow create time by adding processors. Software engineers are often seen furiously trading space and time. Outside defining the structure of identifiers, it is one of their main activities.
Physical storage
Processors can only manipulate information that is local or, to be precise information that resides on the same silicon die. Information stored elsewhere needs to be brought locally before anything can be done. Here the rule of thumb usually goes, the closer the faster, the closer the most expensive (in dollars). As a result storage implementations are usually seen as hierarchy of caches of increasing size and decreasing cost. Each level is connected to the next through a bus with specific bandwidth and latency characteristics. Bandwidth measures the number of bits per second going over the bus while latency measures the delay in seconds between the time a request is issued and the time its result is available. Both are influenced by the physical nature of the underlying storage and bus as well as the workload requested.
Mechanical devices such as spinning hard-drive and tapes need to physically move read and write heads. Though solid-state drives or random access memory do not have any moving parts, they nonetheless deal with information in blocks of bytes for efficiency. Locality, information that is often used together resides physically close in space, is an every day optimization problem.
Time models
There seem only two ways to model time: causality and intrinsically. Lifetime is also a central concept of any digital information system.
The first way to model time is causality, or actions lead to consequences. In this model, time is described as a succession of logic steps. Time exists because things change. More precisely the mapping between information and space has been altered.
If you define a live information set, or reciprocally the amount of space that is not allocated to meaningful information, a computer can mechanically reverse any consequences by duplicating the live set into empty space before applying an action therefore generating an inverse function. Interestingly, it can be seen as a way to trade space for time in the causality model.
Since it is trivial to build inverse functions in this model, it is generally a useful convention to define two inverse functions, do and undo, as respectively the time-forward and time-backward function even if it does not have any real meaning.
The second model defines time has an intrinsic metric. Time always passes measured as uniform units of time. It is common for computers to use transistor switch time, known as clock cycle, or the familiar definition of seconds as a unit for measuring time.
Information lifetime
Because storage is limited, it often has to be reallocated to new information. Lifetime of information plays an important role to that purpose because at that time an information no longer need to be stored in the system; its space can be reclaimed.
Lifetime can be expressed explicitly. First, agents can generate delete statements directly in code whenever information space can be recycled. This is a very usual model.
Lifetime can be expressed explicitly, relative to time or other information. An explicit expiration timestamp can be built into the information at the time it is entered in the system. The expiration date can also be specified relative to a scope. In a program, this is usually the case for local function variables. Their space is reclaimed as the function exits. Generalizing in any information system, a generic ownership relationship (vs. uses) will define that space for a dependent can be recycled when the owner is recycled.
The expiration date can also be specified implicitly relative to last reference. The implicit assumption here is that the lifetime of information is over when no one has a way to access it anymore.
The implicit model is very attractive because it is possible to substitute reachability and graph connectedness for explicit lifetime specification. The algorithms for finding connected subgraphs and reachable components are straightforward. A multitude of garbage collection implementations have been devised to mechanically and efficiently recycle space associated with unreachable subgraphs.
Mechanical garbage collection, a wrong debate
Garbage collection fans typically point out the advantages of implicit lifetimes and "old school" programmers point out the runtime cost of garbage collection implementations.
Both arguments miss the point. Understanding and specifying lifetime semantics is and will remain an important aspect of programming no matter the underlying runtime. Unfortunately, very few languages so far have explored and provide alternatives to express a rich variety of lifetime semantics.