News

News and announcements including new releases, bug fixes and anything newsworthy.

Featured article:

Ready to get started?

Compressing and Optimizing PDFs in Pure Golang using UniPDF

Jan 30, 2020

With the release of UniPDF v3, the library included support for optimizing PDFs, composite fonts (Unicode characters), digital signatures and, a powerful text and image extraction feature. The adoption of Unicode characters now allows the library to handle the processing and creation of more complex PDF documents that contain Unicode text and symbols. Some minor updates of v3 were styled paragraphs, invoice generation, table of contents and many more that you can read about in the v3 press release.

The ability to optimize (compress) PDF output was a fundamental update and also a difficult one. It involves a multi-step procedure, which consists of a few mostly independent optimization steps:

  1. Combine duplicate objects and streams (lossless)
  2. Combining indirect objects to compressed object streams (lossless)
  3. Reducing resolution of images (near lossless for specified display resolutions)
  4. Higher compression of images and objects (lossy)

The compression feature allows you to select the optimization level that best suits your need. You can opt for lossy optimization that compresses the PDF really well but can lead to degraded text and images. That is why we generally recommend that you select non-lossy optimizations and customize the settings to best fit your application.

How to Optimize PDFs using UniPDF

Let's see how we can optimize a PDF by using the library. First of all, go to the unipdf-examples GitHub repository and download the compression example code.

The pdf_optimize.go file contains the optimization code that we will be using in this tutorial. The optimization of PDF output is implemented in the writer method of the UniPDF library and it contains the following options, accessible using (optimize.Options):


// Options describes PDF optimization parameters._
type Options struct {
        CombineDuplicateStreams         bool // lossless
        CombineDuplicateDirectObjects   bool // lossless
        ImageUpperPPI                   float64 // Choose PPI (points per inch) for intended application.
        ImageQuality                    int // lossy, typically 90
        UseObjectStreams                bool // lossless
        CombineIdenticalIndirectObjects bool // lossless
        CompressStreams                 bool // lossless
}

You can select the options according to the level of optimization needed. We allow you to select the quality of images, which ranges from 1(lowest) to 100(highest). You can even be more selective and select the Pixel Per Inches (PPI) of images in the PDF. This provides you with a fine-grain control over the quality of your PDFs. Other options include allowing the compression of streams and objects and combining duplicate streams and objects.

If you simply want to run the default example, just download the pdf_optimize.go file from the GitHub link and run it with the proceeding command. Just make sure that your system supports Go language.


go run pdf_optimize.go  

Code Breakdown

Let's breakdown the code into chunks so that it is easier to comprehend. We'll be using the example found in the repo and explain how you can customize it according to your requirements.


package main

import (
    "fmt"
    "log"
    "os"
    "time"

    "github.com/unidoc/unipdf/v3/model"
    "github.com/unidoc/unipdf/v3/model/optimize"
)

const usage = "Usage: %s INPUT_PDF_PATH OUTPUT_PDF_PATH\n"

At the start, we import the relevant packages, including the UniPDF model package, which contains the reader and writer method. Including model package allows you to easily work with PDFs, provided you have an understanding of PDF format and structure. You can read more about the model package in the v2 release. The usage variable describes how the executable file will run and accept parameters.

Reading and Preparing Writer

The main function starts by reading the information of the input file, which we use at the end to provide statistics of compression. The reader is then used to read the PDF file and get the number of pages in the input PDF.


func main() {
    args := os.Args
    if len(args) < 3 {
        fmt.Printf(usage, os.Args[0])
        return
    }
    inputPath := args[1]
    outputPath := args[2]

    // Initialize starting time.
    start := time.Now()

    // Get input file stat.
    inputFileInfo, err := os.Stat(inputPath)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }

    // Create reader.
    inputFile, err := os.Open(inputPath)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }
    defer inputFile.Close()

    reader, err := model.NewPdfReader(inputFile)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }

    // Get number of pages in the input file.
    pages, err := reader.GetNumPages()
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }

After we've stored the number of pages in the pages variable, we create a new PDF writer: writer and use it to store all of the pages of the input PDF. After the loop is completed, all of the pages are stored in the writer variable.


 // Add input file pages to the writer.
    writer := model.NewPdfWriter()
    for i := 1; i <= pages; i++ {
        page, err := reader.GetPage(i)
        if err != nil {
            log.Fatal("Fail: %v\n", err)
        }

        if err = writer.AddPage(page); err != nil {
            log.Fatal("Fail: %v\n", err)
        }
    }

    // Add reader AcroForm to the writer.
    if reader.AcroForm != nil {
        writer.SetForms(reader.AcroForm)
    }

If the input PDF has AcroForms, then the writer will transfer the AcroForms to the output PDF.

Set Optimizer

Now comes the code that does the magic of optimizing the PDF. It's as simple as calling the function of SetOptimizer(optimize.Options{...}).


// Set optimizer.
    writer.SetOptimizer(optimize.New(optimize.Options{
        CombineDuplicateDirectObjects:   true,
        CombineIdenticalIndirectObjects: true,
        CombineDuplicateStreams:         true,
        CompressStreams:                 true,
        UseObjectStreams:                true,
        ImageQuality:                    80,
        ImageUpperPPI:                   100,
    }))

In the code, we've called the optimizer function and set its parameters. This effectively optimizes the input PDF according to the set parameters. You can adjust the parameters according to your requirements.

After the optimizer has been set, we simply use the os package to create the output file. The file is based on the output path provided in the command line.


// Create output file.
    outputFile, err := os.Create(outputPath)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }
    defer outputFile.Close()

    // Write output file.
    err = writer.Write(outputFile)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }

If the file has been created successfully, the writer will write to the output file.

Optimization Statistics

The last few lines of the code highlight the result of optimization. The code displays the compression ratio, the time it took to complete the optimization and a few other details. It does so by getting the output PDF info and comparing it with the input PDF info, which was extracted at the start.


// Get output file stat.
    outputFileInfo, err := os.Stat(outputPath)
    if err != nil {
        log.Fatal("Fail: %v\n", err)
    }

    // Print basic optimization statistics.
    inputSize := inputFileInfo.Size()
    outputSize := outputFileInfo.Size()
    ratio := 100.0 - (float64(outputSize) / float64(inputSize) * 100.0)
    duration := float64(time.Since(start)) / float64(time.Millisecond)

    fmt.Printf("Original file: %s\n", inputPath)
    fmt.Printf("Original size: %d bytes\n", inputSize)
    fmt.Printf("Optimized file: %s\n", outputPath)
    fmt.Printf("Optimized size: %d bytes\n", outputSize)
    fmt.Printf("Compression ratio: %.2f%%\n", ratio)
    fmt.Printf("Processing time: %.2f ms\n", duration)
}

Compression Example

Now let's test the code on a real life example. We downloaded United Nation Secretary-General's report on the climate action summit 2019 and passed it through the pdf_optimize.go code.

These were the results:


$ go run pdf_optimize.go un_climate.pdf un_climate_opt.pdf
Unlicensed copy of unidoc
To get rid of the watermark - Please get a license on https://unidoc.io
Original file: un_climate.pdf
Original size: 8582092 bytes
Optimized file: un_climate_opt.pdf
Optimized size: 1075480 bytes
Compression ratio: 87.47%
Processing time: 4479.00 ms

The example ran within 4.5 seconds for a 38 pages long report that includes colorful graphics at every page. The UniPDF library compressed the report by 87.47% from 8 mb to approximately 1 mb. Note that this uses the default parameters, one can then play around with the optimization parameters to see the influence on the output quality as well as the processing time.

Optimization while Creating or Modifying PDFs

If you're using UniPDF to create or modify PDF documents then you can optimize the newly created or modified document by using the same SetOptimizer(...) function. The current code examples of creating documents using UniPDF do not include the optimization bit but it can be added quite easily.

In the create new document code example, the creator is creating the new document. We can simply call the optimizer function using the creator. This will be best cleared by looking at the proceeding code, which shows a portion of the example pdf_report.go.


// …
        strPage := fmt.Sprintf("Page %d of %d", args.PageNum, args.TotalPages)
        p = c.NewParagraph(strPage)
        p.SetFont(robotoFontRegular)
        p.SetFontSize(8)
        p.SetPos(300, 20)
        p.SetColor(creator.ColorRGBFrom8bit(63, 68, 76))
        block.Draw(p)
    })

    //Set optimizer.
    c.SetOptimizer(optimize.New(optimize.Options{
        CombineDuplicateDirectObjects:   true,
        CombineIdenticalIndirectObjects: true,
        CombineDuplicateStreams:         true,
        CompressStreams:                 true,
        UseObjectStreams:                true,
        ImageQuality:                    80,
        ImageUpperPPI:                   100,
    }))
    //Optimizer Finished.

    err = c.WriteToFile(outputPath)
    if err != nil {
        return err
    }

    return nil

This is near the end of the example code where we are setting the footer. We can simply use the creator c, which has been created earlier, to set the optimization of the file that is about to be created in the next step. The creator is well equipped to handle everything. The parameters can be adjusted to get the desired level of compression. This feature might become the default way of operating in the future.

Use UniCLI to Try Without Writing Any Code

You can use UniCLI if you want to avoid interacting with the code. The UniCLI is another feature offered by the UniPDF library that enables users to use the libraries functions without having to interact with too much code.

To start using UniCLI, simply clone the relevant repoand build it using Go language. Having a system that supports Go language is a requirement for using any of the UniPDF libraries. You can read more about how to install and use the CLI by visiting its repository page.

Optimization using UniCLI

To get started quickly, you can use UniCLI, which also allows you to optimize PDFs in a batch by selecting a directory as input. The CLI will then handle the rest and optimize all of the PDFs found in the directory. If you want the CLI to process files in subdirectories as well, simply pass the recursive flag -r while writing the command. The CLI is mostly intended for prototyping and a handy tool.

You can run the optimization by simply running the following command in the CLI:


unipdf optimize file_1.pdf file_n.pdf

The command will optimize the files using the default parameters.

What's Next?

We're adding more optimization options in the near future and are particularly focused on scanned documents. We have already added support for CCITT encoding, which has improved our ability to implement lossless compression of image files. We are also currently implementing JBIG2 encoding, which will further improve the compression ratio of PDF documents without loss to quality and is particularly good for scanned files and image masks. We will be adding more optimization options in the future to take advantage of those.

You can check out the example scripts on the UniPDF GitHub page. The examples will help you get started with using UniPDF. If you feel more examples are needed or found a bug, open a new issue in the examples repository or contact us.

Ready to get started?