API design: Choosing between names and identifiers in URLs
If you're involved in the design of web APIs, you know there's disagreement over the style of URL to use in your APIs, and that the style you choose has profound implications for an API’s usability and longevity. The Apigee team here at Google Cloud has given a lot of thought to API design, working both internally and with customers, and I want to share with you the URL design patterns we're using in our most recent designs, and why.
When you look at prominent web APIs, you'll see a number of different URL patterns.
Here are two API URLs that exemplify two divergent schools of thought on URL style:
The first is an anonymized and simplified version of a real URL from a U.S. bank where I have a checking account. The second is adapted from a pedagogic example in the Google Cloud Platform API Design Guide.
The first URL is rather opaque. You can probably guess that it’s the URL of a bank account, but not much more. Unless you're unusually skilled at memorizing hexadecimal strings, you can’t easily type this URL—most people will rely on copy and paste or clicking on links to use this URL. If your hexadecimal skills are as limited as mine, you can’t tell at a glance whether two URLs like these are the same or different, or easily locate multiple occurrences of the same URL in a log file.
The second URL is much more transparent. It’s easy to memorize, type and compare with other URLs. It tells a little story: there's a book that has a name that's located on a shelf that also has a name. This URL can be easily translated into a natural-language sentence.
Which should you use? At first glance, it may seem obvious that URL #2 is preferable, but the truth is more nuanced.
The case for identifiers
There is a long tradition—one that predates computers—of allocating numeric or alphanumeric identifiers to entities. Banks and insurance companies allocate identifiers for accounts and policies. Manufacturers, wholesalers and retailers identify products with product codes. Editions of books are identified by their ISBN numbers. Governments issue social security numbers, driver's license numbers, criminal case numbers, land parcel numbers and so on, and our first example is simply an expression of this idea in the URL format of the world wide web.
If identifiers like these have the disadvantages described above—hard to read, compare, remember and type, and devoid of useful information about the entity they identify—why do we use them?
The primary reason is that they remain valid and unambiguous even when things change, and stability and certainty are critically important qualities (Tim Berners-Lee wrote an often-quoted article on this topic). If we don't allocate an identifier for a bank account, how can we reliably reference it in the future? Identifying the account using information that we know about it is unreliable because that information is subject to change and may not uniquely identify the account. Details about its owner are all subject to change (e.g., name, address, marital status), or subject to ambiguity (date and place of birth), or both. Even if we have a reliable identifier for the owner, ownership of the account can change, and identifying the account by where and when it was created doesn’t guarantee uniqueness.
Hierarchical naming is a very powerful technique that humans have used for centuries to organize information and make sense of the world. The taxonomy of nature, developed in the 1700s by Carolus Linnaeus, is one very famous example.
URLs in the style of the second example—formed from hierarchies of simple names—are based on this idea. These URLs have the inverse qualities of simple numeric or alphanumeric identifiers: they're easier for humans to use, construct and get information from, but they're not stable in the face of change.
If you know anything about Linnaeus’ taxonomy, you know that its elements have been renamed and the hierarchy restructured extensively over time; in fact the rate of change has increased with the adoption of modern technologies like DNA analysis. The ability to change is very important for most naming schemes and you should be suspicious of designs that assume that names will not change. In our experience, renaming and reorganizing the name hierarchy turn out to be important or desirable in most user scenarios, even if it wasn’t anticipated by the original API designers.
The downside of the second example URL is that if a book or shelf changes its name, references to it based on hierarchical names like this one in the example URL will break. Changing name is probably not plausible for a book that is a copy of a mass-printed work of literature, but might apply to other documents you might find an a library, and renaming a shelf seems entirely plausible. Similarly, if a book moves between shelves, which also seems plausible, then references based on this URL will also break.
There is a general rule here. URLs based on opaque identifiers (sometimes called permalinks) are inherently stable and reliable, but they aren’t very human-friendly. The way to make URLs human-friendly is to build them from information that's meaningful to humans—like names and hierarchies—in which case one of two unfortunate things will happen: either you have to prohibit renaming entities and reorganizing hierarchies, or be prepared to deal with the consequences when links based on these URLs break.
Up until this point I have talked about the effects this identity dilemma in terms of its impact on URLs exposed by APIs, but the problem also affects identities stored in databases and exchanged between implementation components. URLs exposed by an API are generally based on the identities that the API implementation stores in databases, so design decisions that affect URLs usually also affect database and API implementation design, and vice versa. If you use hierarchical names to identify entities in the implementation as well as the API, the consequences of broken references is compounded, as is the difficulty of supporting renaming and reparenting. This means that the topic is a very important one for total system design, not just API design.
The best of both worlds
Faced with these tradeoffs, which style of URL should you choose? The best response is not to choose: you need both to support a full range of function. Providing both styles of URL gives your API a stable identifier as well as the ease of use of hierarchical names.
The Google Cloud Platform (GCP) API itself supports both types of URL for entities where renaming or reparenting makes sense. For example, GCP projects have both an immutable identity embedded in stable permalink URLs, and a separate mutable name that you can use in searches. The identity of one of my GCP projects is ‘bionic-bison-166600' (which shows that identifiers don't have to be as inscrutable as RFC-compliant UUIDs—they just need to be stable and unique) and its name is currently "My First Project Renamed".
Identifiers are for look-up. Names are for search.
We know from the principles of the world-wide web that every URL identifies a specific entity. It's fairly apparent that "https://ebank.com/accounts/a49a9762-3790-4b4f-adbf-4577a35b1df7" is the URL of a specific bank account. Whenever I use this URL, now or in the future, it will always refer to the same bank account. You might be tempted to think that 'https://library.com/shelves/american-literature/books/moby-dick' is the URL of a specific book. If you think renaming and relocating books could never make sense in a library API, even hypothetically, then you can perhaps defend that point of view, but otherwise you have to think of this URL differently. When I use this URL today, it refers to a specific, dog-eared copy of Moby Dick that is currently on the American Literature shelf. Tomorrow, if the book or shelf is moved or renamed, it may refer to a shiny new copy that replaced the old one, or to no book at all. This shows that the second URL isn’t the URL of a specific book—it must be the URL of something else. You should think of it as the URL of a search result. Specifically, the result of this search:
find the book that is [currently] named "moby-dick", and is [currently] on the shelf that is [currently] named "american-literature
Here’s another URL for the same search result, where the difference is entirely one of URL style, not meaning:
Understanding that URLs based on hierarchical names are actually the URLs of search results rather than the URLs of the entities in those search results is a key idea that helps explain the difference between naming and identity.
Using names and identifiers together
To use permalink and search URLs together, you start by allocating a permalink for each entity. For example, to create a new bank account, I might expect to POST a representation of the new account details to https://ebank.com/accounts. The successful response contains a status code of 201 along with an HTTP "Location" header whose value is the URL of the new account: "https://ebank.com/accounts/a49a9762-3790-4b4f-adbf-4577a35b1df7".
If I were designing an API for the library, I would follow the same pattern. I might start with the creation of a shelf by POSTing the following body to https://library.com/locations:
This results in the allocation of the following URL for the shelf:
Then, to create the entry for the book, I might post the following body to https://library.com/inventory:
resulting in the allocation of this URL for the book:
This stable URL will always refer to this particular copy of Moby Dick, regardless of what I call it or where in the library I put it. Even if the book is lost or destroyed, this URL will continue to identify it.
Based on these entities, I also expect the following search URLs to be valid:
You can implement both of these search URL styles in the same API if you have the time and energy; otherwise, pick the style you prefer and stick with it.
Whenever a client performs a GET on one of these search URLs, the identity URL (i.e., its permalink, in this case https://library.com/book/745ba01d-51a1-4615-9571-ee14d15bb4af) of the found entity should be included in the response, either in a header (the HTTP Content-Location header exists for this purpose), in the body, or, ideally, in both. This enables clients to move freely between the permalink URLs and the search URLs for the same entities.
The downside of two sets of URLs
Every design has its drawbacks. Obviously, it takes a little more effort to implement both permalink entity URLs and search URLs in the same API.
A more serious challenge is that you have to educate your users on which URL to use in which circumstance. When they store URLs in a database, or even just create bookmarks, they’ll probably want to use the identity URLs (permalinks), even though they may use search URLs for other purposes.
You also need to be careful about how you store your identifiers—the identifiers that should be stored persistently by the API implementation are almost always the identifiers that were used to form the permalinks. Using names to represent references or identity in a database is rarely the right thing to do—if you see names in a database used this way, you should examine that usage carefully.
Users who write scripts to access the API can choose between search and permalink URLs. Writing scripts with search URLs is often easier and faster, because you can construct search URLs easily from names or numbers you already know, whereas it usually takes a little more effort in a script to parse permalink URLs out of API response headers and bodies.
The downside of using search URLs in scripts is that they break if an API entity gets renamed or moved in the hierarchy, in the same way that scripts tend to break when files are renamed or moved. Since you are accustomed to fixing scripts when file names change, you may decide to go ahead and use the search URLs and simply fix the scripts when they break. However, if reliability and stability of scripts is very important to you, write your scripts with permalinks.
Permalinks and search URLs: better together
Unless you're very restrictive about the changes you allow to your data, you really can’t achieve stability, reliability and ease-of-use in an API with a single set of URLs. The best APIs implement both permalink URLs based on identifiers for stable identification and search URLs based on names (and perhaps other values) for ease-of-use. For more on API design, read the eBook, “Web API Design: The Missing Link” or check out more API design posts on the Apigee blog.