PDF Generation: A long journey with a successful end

07.05.2020
title image pdf

As a survey software vendor, Survalyzer is dealing with PDF document creation to provide survey participants with the possibility to download the given answers as a document. The document needs to reflect exactly what the user has seen in the survey. Furthermore, the company which has invited to the survey wants to brand the PDF document with their corporate identity. It was a bumpy road with a lot of dead ends to develop a suitable solution. Therefore, we decided to share the experience regarding PDF generation to help developers and architects taking the right decisions on the first run.

Starting point:
Our customers have the requirement to offer a PDF confirmation containing the given answers of the survey. Especially in governmental and educational environments fulfilling this requirement is essential. Our first approach was to build a HTML page which showed the given answers as a summary only with the possibility to use client-side printing. When this was in place, the decision was made to turn this page on the server-side into a PDF document. The decision was driven by the idea that we already have a rendered version of the answers and we don’t want to build a second one.

First technical implementation:

Since the summary page was a static rendered HTML page the approach was chosen to take the browsers PDF rendering capabilities to generate the PDF document. To ensure that the document looks the same for all clients and not being dependent on the users Browser the rendering was done on the server side. The following steps were executed:

pdf generation process

Implementation issues:

The implementation was done pretty fast since we leveraged the browser capabilities and used a famous library called wkhtmltopdf (https://wkhtmltopdf.org/). This library took over the browser handling. As long as the utilization of this new functionality was low everything looked pretty good. But when our customers started to use PDF rendering more and more the problems began…

The main issue was that the whole process is very CPU and memory intensive and limited in parallelization. I guess you can imagine even on your local PC what happens if you open 100 browser windows. Even with the best Laptop your operating system will get stuck. In such a situation, applications don’t work appropriately and not even deterministic. In fact, it is the same on servers as on your local laptop in a slightly larger scale. We faced with zombie instances of browsers which were not freeing the memory again and even worse which consumed CPU with no outcome in endless loops. The final death stroke of the solution was a breakdown of the whole system caused by huge load during a very large and equally important survey.

We quickly mitigated this by moving the PDF rendering to an isolated server that at least the performance of the whole web application was not affected any longer. At the same time, we advised our customers to avoid the usage of the PDF functionality, at least in large surveys.

This day was my personal Waterloo, since I took over all of this from my predecessor and I was only able to fight fires. But when the smoke cleared, my eager was awaked to get rid of the current solution and implement the best possible way to serve our clients with high quality PDFs and protect our systems from outages caused by high load.

Technical evaluation – Usage of a rendering library:

The decision to move the frontend from a static rendered page (ASP.NET MVC) to a single page application using the newest Angular Framework was a game changer. We evaluated several libraries which advertise to be capable use HTML as source for PDF generation. The candidates have been:

None of them was able to correctly render our HTML5 + CSS3 page. Mostly the outcome was empty pages or crippled layout.

Evaluation Summary:

After some investigation I understood why all the libraries failed:

The DOM of the HTML doesn’t contain all information which is necessary to render the page correctly. The browser has a mechanism called shadow DOM which not part of the HTML Document. Modern SPA frameworks like Angular make heavily use of the shadow DOM to optimize the rendering. This outcome brought me back to the initial solution and I tried to find a way to deal with the performance issues of the browser-based solution using a modern cloud architecture.

Technical evaluation – Usage of Azure functions:

The focus now was to build a scalable browser-based solution which could deal with every given load and scales with the demand. The technology of choice was the usage of NodeJS Azure functions based on a Windows image. To deal with the browser I choose the library Puppeteer (https://pptr.dev/). On the first glance the approach is looking charming and easy.

puppeteer source code

Locally this simple solution for PDF generation worked like a charm so I decided to deploy it on Azure. But the deployed version didn’t render PDFs, instead it crashed. After a closer look and some googling, I found out that Azure Functions are working in an isolated sandbox mode. The base for all rendering in Windows is the GDX+ library. Unfortunately, this library is restricted for security purpose. The articles and whitepapers advising developers to use a Linux image instead. I followed the advice and commissioned a Linux Azure function. Fortunately, the code was based on NodeJS and was therefore platform independent. Commissioning and deployment was a matter of minutes but the next strike was following immediately.

Microsoft reasonably bases Linux Azure functions on slim images to reduce loading times and increase performance. This makes perfectly sense but for my use case a lot of Linux shared libraries were missing and even starting a headless browser was impossible. After googling again, I found out that I was neither the only one with this architectural approach nor with the problem of missing libraries. It was clear that the default Linux images are unusable for my intend.

Technical evaluation – Usage of Docker containers:

The praised solution was Docker containers. I was familiar with this technology and consumed published containers previously but I never built one myself. Google helped again to come up to speed and I created my first own container soon. I used public templates as a base but was eager enough to build it completely myself to understand the technology and the involved steps in depth. Finally, I found all the necessary missing packages and build my docker container:

docker image

This was the first approach which worked. It was pretty slow with 45 seconds for an average document but at least it worked.

pdf document first approach

Compared to the initial solution Azure Functions are scaling with the demand and with pay per use we can deal with the in-deterministic demand. In the meanwhile, the following new requirements were more pressing than previously.

  • PDF generation with a corporate design template (Cover page, header, footer)
  • Inserting a TOC
  • Accessibility of the PDF document

I tried to cope with these additional requirements by embedding the already created PDF document as inner document in an outer Template document. Finally, the document looked terrible, was not accessible at all and the rendering time added up to 90 seconds. I keep this attempt very short because it was the dead end of the complete solution and I went back to step 1 and I questioned everything I knew so far about PDF generation.

 

Conclusion and final solution:

There is a saying: The night is darkest before the dawn!
Left behind with an unusable solution I was frustrated and asked myself where on the road I had made a wrong turn. I realized that I was more far away then ever delivering an outstanding PDF solution to our customers.

The reflection about what has led to this desperate situation pointed me to the solution. This whole effort was taken because we didn’t want to handcraft a PDF representation for our survey elements. All the pain based on that assumption that we need to take the HTML as a base.

With that in mind I looked into a different direction. We already had a working PDF generation solution for exporting surveys. The approach based on a word template which could be designed by every customer himself with the given corporate identity requirements. The document only needs to have a bunch of fields with placeholder names. These fields are replaced during the rendering phase with the real content. Using this approach cover pages, TOC, header & footer and even a disclaimer on the last page could be created with ease. This approach was an enabling strategy for our customers. This approach and the resulting document fulfilled all the requirements that the answers PDF also should implement.

Also, the mechanic of traversing the survey document and rendering all the nodes was already there. The only missing piece was to add the answers to the document and in some areas format it a bit differently. The resulting code was of course more than what would have been necessary with the HTML-based approach but with around 600 Lines I was able to create a PDF which is looking nice and could be rendered, depending on the survey size, in about 5 seconds. This is 18 times faster than the previous approach which was not even fulfilling all requirements. Also, the consumed CPU and Memory decreased dramatically since no browser is involved any longer.

For building up the word template we use the Aspose Words library. This library is very stable, fast and provides all required features to create high sophisticated documents like custom defined styles and much more.

This is how the document now looks like:
pdf document final

Also, this library has the capabilities to save a document as PDF. Aspose additionally has capabilities to achieve a very high grade of accessible PDF documents under the condition that the customer provided template follows the PDF-UA requirements. For the given example this is what a single choice question would look like for a screen reader:

pdf screen reader

(Source: PDF Accessibility Checker 3)

As you can see the first option Amazon was checked. The shown example has a PDF-UA compliance rate of 96,5% (20316 of 21050 elements). All elements which are not compliant are decorative elements like borders in tables with no relevant content.

For being able to scale with every demand we keep the Linux based Azure functions. Finally, after all this long journey, we ended up with a reliable, scalable and easy PDF generation solution in a reasonable time and fulfilling all requirements. The conclusion is once more: Simplicity is the result of maturity.