Splitting a Large HTML File, Part 2


My previous post described my first attempt to split a large HTML file into several parts. The file in question is a cookbook-style tutorial for the Beacon online membership management system used by my local U3A organisation. The document was originally created with Apple’s Pages word processor app, exported to PDF and uploaded to the U3A website. I had re-written it in HTML so that other U3As could tailor it to their requirements without requiring Pages or the Adobe PDF editor.

In HTML form, the cookbook consisted of one large HTML file, a couple of small CSS files and about 25 image files (a mix of PNG and JPEG). Once I had an initial draft, I had planned to tweak each ‘recipe’, replacing images with updated screenshots to reflect changes to the Beacon screens since I had written the Pages version. This would have been easier if each ‘recipe’ was in its own file so that I didn’t have to search through a lot of difficult to read HTML text to find the image element I wanted to update.

Alternative Approaches

As noted in the previous post, the official method for including HTML snippets into an HTML file doesn’t solve my particular problem, so I continued to search for answers.

I have now tried these alternative approaches:

  • One large HTML file (this is the problem)
  • W3 Include HTML (see previous post)
  • Multiple HTML files
    • Dynamic HTML elements
    • Browser storage
  • Awk aggregation script
  • Iframe Navigation Bar

Unfortunately, though, none of these offers a completely satisfactory solution, as I will now explain.

Multiple HTML Files

If one large HTML file is awkward to edit and there’s no easy way to include HTML snippets, perhaps we should just accept that HTML is not suited to this task. Perhaps it would be better to structure my cookbook as a loose collection of HTML files, in the same way that a website is a collection of HTML-based pages.

We can, of course, do this. The trouble is, in order to look like a single document, some parts of every cookbook page need to look the same. The navigation bar, for example, should be identical on every page. We could just copy the HTML for the navbar into each of the ‘recipe’ files. But this is bad programming practice, and we’d like to avoid it if we can.

Dynamic HTML Elements

Problems like this come up all the time in software development, and the usual solution is to factor out the common part and somehow ‘include’ it where it is needed in the program. We might define a function in a common file, add it to a library, and call it from many places in the program. Or we might define a class in a header file and include it in our source files. But those mechanisms are not available to HTML files.

We can, however, put JavaScript code into .js files, include them in HTML files and use the functions and classes they contain. If we use a JS script to create the navbar, we can call it from every ‘recipe’ page, guaranteeing that it is the same throughout the document.

This idea does work. But it means the document writer has to do some programming, and very few members of U3As have the necessary skills. Something simpler is needed.

Browser storage

Instead of creating identical navbar elements in every recipe page, we can create it in the first page using simple HTML syntax, save it in browser storage and load it when the reader visits each recipe page.

For this to work, we need two JavaScript functions: one to save an HTML element and another to load a previously saved element. This, of course, requires some programming, but the functions are not specific to the cookbook, so they can be written by a suitably skilled software developer and supplied in a script file along with the document’s HTML files. A non-programming cookbook editor doesn’t need to understand them.

The HTML 5 standard specifies two types of browser storage: local storage and session storage. Local storage keeps information indefinitely; session storage lasts only until the web page is closed.

I tried local storage first. For reasons that were not immediately obvious, this did not work in Firefox. So, then I tried session storage, and that works quite nicely. But this is still not what I was looking for.

Both variations of the multiple files approach rely on JavaScript functions. Some particularly nervous or security-conscious web surfers disallow JavaScript in their browsers, and different browsers handle local/session storage in different ways. Our cookbook readers will see different behaviour depending on which browser they use and how it is configured. And that’s something we would like to avoid in something aimed at a largely computer-averse audience.

Awk Aggregation Script

My next idea was to write a tool that will aggregate a collection of HTML files into a single file. The cookbook author would create one template file and a number of recipe files. The template would contain introductory text and an include directive for each recipe. The tool would replace the include directives with the corresponding recipe file, turning multiple small HTML files into one large HTML document.

This is equivalent to the w3-include-html mechanism from my previous post, but instead of using a JavaScript function that sends requests to a web server, it would use local command-line functions.

The Unix family of operating systems provides a tool that does most of the work; it is called ‘awk’. Awk’s actions are governed by a script. Each line in an awk script consists of a pattern and an action. Awk reads an input file a line at a time and executes any actions whose pattern matches the line it has read from the input file.

Here is the awk script that I used:

BEGIN { system("rm Cookbook.html") }
/w3-include-html/ {
    s = match($0,("[a-z]*\\.html"))
    fn = substr($0,s,RLENGTH)
    cmd = sprintf("cat %s >> Cookbook.html", fn)
!/w3-include-html/ { print $0 >> "Cookbook.html" }

The first line tells awk to remove the Cookbook.html file. The last line tells awk to append lines in the template file that do not contain “w3-include-html” to the Cookbook.html file. And the lines in between tell awk to look for lines in the template that do contain “w3-include-html” and append to the Cookbook.html file the whole of the recipe file it references.

This is a very naive awk script, but it is enough for a programmer with a little knowledge of Unix-like operating systems to turn a collection of small HTML files into a single HTML document. And it doesn’t use JavaScript or require a local web server. The downside of this, though, is that the cookbook author/editor has to have access to the ‘awk’ utility, and they have to be familiar enough with the command line to be able to execute the appropriate ‘awk’ command. Most U3As don’t have that.

IFrame Navigation Bar

One more option for putting the common parts of my cookbook into their own HTML file occurred to me: iframes. According to the W3C HTML Tutorial, “An HTML iframe is used to display a web page within a web page”. Unfortunately, though, that is exactly what it does.

To put my navbar in a separate HTML file, I would have to create a whole new document for the navbar and use an iframe to refer to it in the main cookbook HTML file. By default, the two documents would have nothing in common. This takes us in the opposite direction from where I want to go. Instead of a single, consistent product with multiple components, we would have multiple products with only ad hoc relationships between them.


What I assumed was a simple problem turned out to be more difficult than I had imagined. It is possible to make an easy-to-edit version of my cookbook by converting it to HTML, but only if the editor has I.T. skills that most U3As lack.

The fundamental problem is that the HTML standard is designed to create websites containing multiple pages; it is not designed to create documents composed of multiple HTML files.

By stoneyfish

Humanist and retired software engineer with a love of music.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: