Creating better JSON: An EPUB odyssey


The dos and don'ts of keeping it clean

I'm going to begin this post with a list of dos and don'ts:
  • don't nest content more than necessary
  • use a limited number of keywords
  • don't mix types unnecessarily
  • don't write JSON that will take a lot of deciphering and decoding
  • keep things as flat and simple as possible
  • make the intention of the data as clear as possible
  • think about how the JSON will most likely be interpreted without being restrictive or prescriptive
  • use common concepts to indicate elements that are not straightforward values
  • remember that arrays indicate order and dictionaries (or objects) are for unordered key/value access of information

Example structure

Now for a JSON structure I created to represent the internal structure of an EPUB following the dos and don'ts listed above:

{
    "platform": "epub",
    "version": 3,
    "text-encoding": "UTF-8",
    "subdirectories": [
        "META-INF",
        "OEBPS",
        "OEBPS/css",
        "OEBPS/images"
    ],
    "files": [
        {
            "mimetype": "application/epub+zipheader"
        },
        {
             "[images]": "$0"
        },
        {
            "container.xml": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><container version=\"1.0\" xmlns=\"urn:oasis:names:tc:opendocument:xmlns:container\"><rootfiles><rootfile full-path=\"OEBPS/content.opf\" media-type=\"application/oebps-package+xml\" /></rootfiles></container>"
        },
        {
            "cover.xmhtl": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?><!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:epub=\"http://www.idpf.org/2007/ops\"><head><title>cover</title></head><body><div style=\"text-align:center;\" epub:type=\"cover\"><img src=\"images/cover.jpg\" alt=\"cover image\" style=\"max-width:100%;\" /></div></body></html>"
        },
        {
            "toc.xmhtl": "$0"
        },
        {
            "content.opf": "$0"
        },
        {
            "[chapters]": "$0"
        },
        {
            "stylesheet.css": "$0"
        }
    ],
    "file-locations": {
        "mimetype": null,
        "[images]": "OEBPS/images",
        "container.xml": "META-INF",
        "cover.xmhtl": "OEBPS",
        "toc.xmhtl": "OEBPS",
        "content.opf": "OEBPS",
        "[chapters]": "OEBPS",
        "stylesheet.css": "OEBPS/css"
    },
    "file-templates": {
        "mimetype": null,
        "[images]": "images.json",
        "container.xml": null,
        "cover.xmhtl": null,
        "toc.xmhtl": " toc.json",
        "content.opf": "content.json",
        "[chapters]": "chapters.json",
        "stylesheet.css": "stylesheet.json"
    }
}

Walkthrough

As part of a larger project I wanted to be able to represent the structure and components of the EPUB format in JSON. The starting point represented here is the file structure. It's unavoidable that there should be some top-level keywords. The parser will need to know the terms "platform", "version", "text-encoding","subdirectories", "files", "file-locations", and "file-templates" but it won't need to know what a "META-INF","OEBPS", etc. are or what the files contained in them are, or indeed where to find their templates. Instead it must understand a few rules:
  1. if the value of a file object is text then it is a static file and that string can be saved directly to the file. There is then no additional work to be done on this file.
  2. if the value of a file is "$0" this means that the content is dynamic and so there will be a template. (The $0 is used in GREP and Regular Expressions to indicate a found string, hence its use here.)
  3. every file has a location, if the location is null then it is not placed within any subdirectories and is instead inside the main (parent) folder
  4. a key such as "[chapters]" or "[images]" indicates an array of files not a single file, and not an actual filename. The words "chapters" and "images" are keywords and square brackets are used to illustrate an array.
  5. each dynamically created file has a template which will guide their construction, a value of null for the template indicates a static file with unchanging text not specific to the individual EPUB
One of the major choices made here was not to place file-location and and file-template information inside the file objects. There are a couple of reasons for this choice. First is to avoid filtering the content of the file objects. At the moment they each have a single key and value, so the programmer will know immediately without using != statements the file names. Second, JSON is intended to be as human readable as possible and so to have a a list of file-locations and file-templatest that are immediately visible and readable is a good thing, without being too repetitious. Third, once the filename is known, this can be used to quickly access location and template information.

So while it might seem almost wasteful in terms of space to not throw all file information in one place     inside the file objects, actually when it comes to decoding, it is likely to consume fewer lines of code. Similarly, we might not necessarily need to list all subdirectories, but it means they can easily be constructed ahead of time and that again a human can very quickly see the required subdirectories.

Conclusion

In general terms nesting often means more digging, more looping, more filtering, and so this will always add to the time it takes to decode JSON both for an app and for a human. Further, while it is often desired that a JSON file be complete and not require linked files, in something as large and complex as an EPUB, I have chosen to accept linked template files so that the content remains brief enough to be read and edited by a human if and when changes to the evolving EPUB spec occur. However, for the files that are static and brief (mimetype, container.xml and cover.xhtml) I saw no point in dragging out their creation and hence their strings are placed inline for their creation.

Finally, it should be noted that the purpose of this work, outlining an EPUB, is to prevent the need to hard-code knowledge of file structures and composition into an app and instead keep things fluid, cross-platform, and open to change.


Endorse on Coderwall

Comments