Open Office, Regular Expressions and eBooks

This post (and others that will follow) is for people who have decided that rather than let Pages do the ePUB conversion, or InDesign, or any other program, they want to convert their original Word doc or similar into XML themselves, so that it can then be used for import into InDesign as XML and can also be used as the basis for the XHTML for ePUB and Kindle versions.

More importantly it is for people who want to save time.

First question: why work on the Word doc in this way instead of the text file in a great program like TextWrangler or BBEdit?

Answer: most programs designed for editing text and code do not read Word files or RTF files, so all of the author's formatting (like italics and bold) is lost on conversion into plain text, as are automated footnotes, etc.

While you'll end up editing XML files in a text editor eventually, first we need to convert all the word processor specific stuff into something a text editor will retain and pass on to InDesign or Sigil, for example.

So this is the first step. Download a copy of OpenOffice, NeoOffice or LibreOffice if you don't already own one. They are free, but a donation is recommended.

Next open your document in OpenOffice, then go to the Edit menu and select Find and Replace. Here select 'More Options'.

In the find box enter the follow: .+

Then press the format button in the more options section and select italic

In the replace box enter: &

Your dialogue box should then look like this:



For info: the .+ means a character or series of words and characters, and the & repeats the contents of the "string", i.e. what was there originally. So if we had:



(click picture to enlarge)


And clicked 'Replace all', the result would be:



(click picture to enlarge)

This can then easily be adapted to replace bold with <b>bold</b> - the more complex stuff I'll save for the next post.

It should also be noted that Word contains similar functionality within 'Wildcards' - search the help indexes in OpenOffice and Word if you want to get stuck in and can't wait for the next post.

Comments

  1. Typo?

    "In the find box enter the following: +."

    shouldn't it be
    "In the find box enter the following: .+"

    ReplyDelete
  2. Thanks for pointing this out Anne-Marie.

    ReplyDelete

Post a Comment