Taming Wildcards in Word (for eBook Text Tagging)

Previous posts concerning regular expressions have focused on Find and Replace in OpenOffice. But Word has its own flavour of regex/grep called wildcards. The advantages of which are that you can search styles applied to text (e.g. Heading 1, Heading 2, etc.) and also note markers. Roles for which OpenOffice is not perfect.

Text indents are also possible to find, along with first line indents (both of which OpenOffice can handle too, but which I mention here for convenience). These make it possible to tag tricky things like blockquotes with accuracy.

So where do you start? Open Find and Replace and tick the 'Use wildcards' check box. Then take a look at this list of codes.

There are a couple of annoyances: (1) searching for italics across more than one word, like in a book title, it appears you can't instruct Word to find as many words as are italicized, so you need to italicize each word individually then search and delete each instance of </i> <i> afterwards [if anyone has found a way around this please let me know in the comments] (2) in Word 2004 (Mac) the ability to find a paragraph marker using ^13 (or ^p) doesn't work, it's a bug in the program, and so you need to work around it or use a different version (I don't know when or if this has been fixed for Mac, but there have been two versions of Word since 2004).

Now here's an example of one of the things you might want to do if you are manually marking up a Word doc to be used as XML for import into InDesign and repurposing as an ePUB and Kindle chapter (see earlier posts to learn why you might decide to do this).

First let's work around the Word 2004 bug I mentioned (since that is the version I am using).



With wildcards turned off, enter the following text (as shown above):

Find: ^p

Replace: </p>^p

Now we are ready to identify style text, such as headings. And for this example I'm going to find blockquotes.



In order to do this I need to open the extended Find and Replace options, tick the checkbox that has 'Use wildcards' next to it, and enter the following text:

Find: (*{1,})([/<][//]p[/>])

Replace: <blockquote>\1</blockquote>

But before you hit 'Find Next' or 'Replace All', there is one important final step. You need to select the 'Format' menu in the Find and Replace dialogue box. From here go to 'Paragraph...' and then under indentation enter 1.27 cm in the 'Left' box.



The measurement of 1.27 cm is the depth of indentation that Word typically uses, but if you have something different then you will have to highlight a blockquote on the page and go to the Format -> Paragraph menu to find out what the indentation is and use this instead.

In this post I've used </p> to mark the end of a paragraph because marking up for XHTML most paragraphs will end in </p> and those that don't you'll search and replace with something different as in the blockquote example here.

Ultimately what grep/regex/wildcards offer is an alternative to cleaning up output from InDesign or Pages by getting your markup right first time and exactly as you require it.

If you combine and save your regular expressions as a macro (or macros) it is even possible that most of the markup can be performed in an automated way saving even more time.

Comments