DOCX to JSON in Aldwych and Swift: The Second Parsing


Planning the ascent

The current aim in working with DOCX in Aldwych is not to be able to roundtrip and recreate DOCX files that will be valid Word files but rather to get at the essential content: text, paragraph and character styles, footnotes, and endnotes. (For now pushing tables to one side, but with the intent of returning to them once the rest is working well.) The first step to working with DOCX is unzipping the file. For this I will use SSZipArchive (which is written in Objective-C, so I'll need to get it working with Swift first).

Once the file is unzipped and saved to a folder on the disk, it becomes possible to read the text files. But first we need to note how to get at the required content: for this there's a top-level word folder, and inside there's a range of XML files, the most important of which for our current needs are: document.xml, endnotes.xml, footnotes.xml and styles.xml.

We'll worry about variations at a later date, for now I want to make the DOCX file look the same or very similar to an XHTML file structure that has been imported. How do we do this? First, there are some fairly simple name changes w:document becomes html, w:body becomes body, w:p becomes p.

Next we get to understanding some of the structure. The first structural observations are that a w:p contains runs w:r, inside these runs are text w:t, and sometimes run properties w:rPr, which is where local character styles are contained, e.g. w:i or w:b denoting that the text within that run is italic and bold respectively. If there's a character style then this will be nested inside a w:rStyle or range style tag where there will be an attribute marked w:val that will have corresponding values in the styles.xml file (these are the character style names).

Paragraphs themselves can have a w:pPr element, which contains paragraph properties, again these might contain local formatting and font names or a linked style in the styles.xml file. Note: If the font is italic or bold (and stored as a character style) then this information won't be stored locally as well using w:i or w:b, it will need to be ascertained from the style.

For the purposes of the JSON format, a w:val will be treated simply as a class attribute throughout.

A question of notes

When it comes to the appropriately named elements w:footnoteReference and w:endnoteReference, these will both become an inline note in the JSON format, and their content retrieved from the footnotes.xml and endnotes.xml files. The observant will know that note isn't an HTML element tag, but notes are not straightforward when converting for X(HT)ML output, especially in an EPUB scenario.

Explaining EPUB

When prepared for iBooks a note is wrapped in an aside but when presented in a regular EPUB or one used for Kindle conversion the notes are usually contained in a list or even paragraphs. In building Aldwych, I'm also organically growing the idea of a JSON-book format, a structure that is logical, repeatable and lightweight. One of the current assumptions is that all notes will be featured inline and either pop-up as in iBooks or in the other scenario the numbering and notes list will be auto-generated in translation to XHTML.

Paragraphs for all

One issue that I've ignored up to this point is that DOCX doesn't have h1, h2, h3 and so on. Every paragraph level format is presented inside a w:p and will then have a style w:val to identify Heading 1, etc. But a heading could have a value of anything, so even doing a text search might not assist.

There is a webSettings.xml file that provides some assistance, for instance in identifying blockquotes in a DOCX, but it doesn't appear to provide help with elements like headings. There might be some indicators for auto-detection, however, things like short strings on a single line, larger text font. Otherwise it will be down to a user to fix this (or for the content to be left in paragraph form).

Leftovers

Preserving fonts and font sizes I'm going to leave until the second phase. I'm also not interested in things like page margins and frame positioning yet. I've also not really mentioned the filtering of attributes, things that will be left out when going from DOCX to XHTML, although there has been mention of transforming some. There's probably a load of other things I'll find as I go as well, but for now this is the plan and now I need to write the code to incorporate into Aldwych.

Beyond DOCX

I've no current plans to tackle the old DOC format, but that doesn't mean it won't happen at some point. The thing that I'll face alongside DOCX is Markdown. There has been some work on a Swift parser for Markdown called MarkingBird but again I want a parser that will process content that is consistent with JSON created by parsing XHTML and DOCX so that its export for EPUB in particular is universal. For that I'll need to dust off my RegEx.

Endorse on Coderwall

Comments