An Open Source ASP.NET Solution of Html to PDF conversion

One of the solution of html to pdf conversion for .Net is using iTextSharp, and thanks to this article where a html to pdf builder class is provided to make the use of iTextSharp bit easier. But I found that the iTextSharp(v5040) has very basic css support. If you do not have complex css structure or styling of your html doc is mainly done through in-line markup, I do not see any problem about this solution, but if you want to print a heavily css styled html content which unfortunately is the most likely scenario, this solution is slightly unsatisfactory.
Having said that it does not mean the end of pursuit of an open source solution for this task. Here I am happy to share with you a high quality open source html to pdf converter: wkhtmltopdf.

My solution is the result of inspiration by this tutorial in linux.

‘By leveraging the power of the webkit engine through QtWebKit module, this thing is converting HTML with full CSS support to PDF the same way you “Save as PDF” from your browser’

This is why wkhtmltopdf is superior compared with iTextSharp, a converter must speak both languages, who can understand html better than a browser engine?, and this is also where iTextSharp is found wanting.

Do not get too excited yet, the thing is we can only find an exe version for wkhtmltopdf for windows and the exe can only do file based conversion, it turns out that there are quite a few technique issues to sort out before we can really see the benefits this tool.

I will not present the whole solution here, but will show you how to deal with the few most challenging issues of using wkhtmltopdf in asp.net.

The first and foremost issue of using wkhtmltopdf is to figure out how to run exe in command shell in an asp.net project, if you know how then you can skip the following script.

Here is the code example of executing wkhtmltopdf.exe, which is assuming that the html has been saved in a local file, after the conversion, the pdf file will be created and ready to be sent back.

private bool ConvertHtmlToPDF()
{
            bool result = false;
            string message = null;
            // to build command argument
            StringBuilder argument = new StringBuilder();
            // input html file
            argument.Append(" " + _TempDir + "/" + _TempFileName + ".html");
            // output pdf file
            argument.Append(" " + _TempDir + "/" + _TempFileName + ".pdf");

            try
            {
                // to call the exe to convert
                System.Diagnostics.Process p = new System.Diagnostics.Process();
                p.StartInfo.UseShellExecute = false;
                p.StartInfo.CreateNoWindow = true;
                p.StartInfo.FileName =
                         Server.MapPath("bin/wkhtmltopdf.exe");
                p.StartInfo.Arguments = argument.ToString();
                p.StartInfo.RedirectStandardOutput = true;
                p.StartInfo.RedirectStandardError = true;
                
                p.Start();
                p.WaitForExit();

                message = p.StandardError.ReadToEnd();

                if (string.IsNullOrEmpty(message))
                {
                    message = p.StandardOutput.ReadToEnd();
                }
                else
                {
                    // sometime even there is some error message
                    //the conversion still succeeded
                }
                result = true;
            }
            catch (Exception ee)
            {
		//logging
            }
            return result;
}     

A few points:

1. It is not complete code, instead a part of helper class that does the conversion.

2. The key of above code is using System.Diagnostics.Process to execute an exe within your program thread, it demotrated how to pass in the executable file, the arguments, and how to supress the black command shell window, how to catch the output of the execution. It goes without saying that the thread process must have execution permission over the exe, if it is under bin directory,then there is no worry about this.

3. A pdf file will be created under _TempDir, so thread process must have write permission on this directory, it does not have to be under the published directory, could be anywhere on the server, acutally this is recommended as it would be more difficult for some one to access this folder if it is not published.

4. I am sure you would easily figure out in c# how to open this new pdf file and send the stream to a download stream, after file being downloaded, delete pdf and html file.

5. It assumes a html file present under _TempDir to be converted, if you only need to convert a html string, then you still need to manage to create a html file with that string as the main html content. You can add as many css files or css entries in the head part of this html file. Full css support comes from the html parse engine : webkit.

6. For an asp.net application,the html to be converted could be from browser or on serverside, it will be much easier if html is available on server side, in that case you just need to pass it to your conversion function, but most of time, html contents are in browser,so you need to pass that html to server, it is very similar to the implementation of asynchronous file download, what is special in this case is you have to post a big html string back to download page, this is not acceptable by asp.net as it breches the security rules. So what I have done is to url-encode the html content using javascript function encodeURIComponent, then decode it on server side with Server.UrlDecode before passing it to the conversion function.

Added on 30/09/2010:

1. The worker thread of wkhtmltopdf.exe is always IIS worker thread, no matter the impersonation is on or off. so make sure that IIS worker thread (Network Service default) has the write permission on that folder.

2. In your html file that is about to be converted into pdf, to ensure the access of css files in header are accessible, otherwise the conversion will fail due to deny of access of css files, and authentication of your host applicaiton will not be passed on to wkhtmltopdf which acts like a borwser to view your html, this is especially so when applicaiton needs windows authentication or protected by login.
To void that, I changed my code bit instead of feeding the css url to generated html file, I feed css entries.

Added on 21/09/2011:

1. Relative path of resources in html not picked up by converter
Good news is relative resources like css files and images CAN be accepted as long as they are relative to html file, NOT relative to wkhtmltopdf.exe
2. For resources access that needs authentication, you can pass the cookie ASP.NET_SessionId to converter to resume the session of main thread.

Added on 26/01/2012

Images used in css
As images url in css are relevant to where css file is, most likely css file is somewhere different from the content html file, this causes images defined in css are not showing up in pdf, you can change images urls in css as absolute, or like me I just have another copy of all those referenced image files in folders the same way relevant to html content file as they are to css file

This entry was posted on Wednesday, September 22nd, 2010 at 1:51 am and is filed under ASP.NET. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

7 Responses to “An Open Source ASP.NET Solution of Html to PDF conversion”

  1. LoganWolfer says:

    Have you ever tried to use wkhtmltopdf on an ASP intranet website, using Windows Authentication (read here : NTLM) on your website, for both database connection and web page access ?

    Haven’t make it work yet, tried with a hard-coded user/pass (present on the domain, on the database, and the server machine) with no luck.

    Using directly wkhtmltopdf from a local console on the server with the same directories as from the server do exactly the same thing : write a PDF file with forbidden access 403 “Yellow Page of Death” in it.

    BTW, the hard-coded user have full access to the website currently, and is able to query database both from SQL Management Studio and from the ASP.NET webpage.

    Any ninja tip on this ??

    • CleanCodeNZ says:

      I have actually, there are three main issues, and I managed to solve one.
      1. I have never figured out how to run wkhtmltopdf with user login.
      2. One issue for windows authentication site is css files in your html file are not accessible by wkhtmltopdf, I managed to solve this one by embeding css entries into the html file, rather than file reference.
      3. Another issue is image files in css, this proves to be insurmountable for me.

      I did manage to make the conversion work in intranet by making those css and images files accessible through public urls. like you have a http://www.testsite.com, you then publish those css and images files to that site. in this way to bypass windows authentication.

  2. LoganWolfer says:

    I found another workaround, I did this to make wkhtmltopdf works :

    1. Create a folder on the server where I_USR have read-write access.

    2. Drop in CSS content and images to this location, in a way to reproduce your relative links to content in your html page.

    3. Write an html file to this folder (in my case, using TextWriter class and calling a StreamReader from the url of my webpage to print to PDF)

    4. Open a process on the server to run wkthtmltopdf to this location

    5. Delete your html file (optional), and move the PDF file to wherever you want to, as long as I_USR have read-write access to this folder.

    Have fun !

  3. bishnu biswal says:

    StringBuilder sb = new StringBuilder();
    sb.Append(@”” + total +
    “”);
    Response.AppendHeader(“Content-Type”, “application/html”);
    Response.AppendHeader(“Content-disposition”, “inline; filename=report.html”);
    Response.Write(sb);

    StringBuilder sb = new StringBuilder();

  4. seo says:

    seo…

    […]An Open Source ASP.NET Solution of Html to PDF conversion | CleanCode NZ[…]…

  5. Amit says:

    is it possible to move the control also from html to PDF. let save I design one form in html/aspx page and i need the same form in pdf form for example Tax retrurn documents or w8/w9 docs then can we push all the control ???? like input box check box/Richtext box etc…. ?

    • CleanCodeNZ says:

      I am not too sure, you can obvious try to feed html with controls on it to converter to see what comes out from the other end.

Leave a Reply

*