banner
Previous Page
PCLinuxOS Magazine
PCLinuxOS
Article List
Disclaimer
Next Page

PDF Part One: Creating The Universal Document


by Paul Arnote (parnote)

Way, way back in ancient computer times ... in this case, 30 years ago (1992) ... Adobe created the PDF file. The letters of the file extension stand for "Portable Document File." Up until that time, the sharing of documents was more difficult. There were various word processing programs around, and there was no guarantee that one word processor's files could be read by a different word processor. Plus, even then, not having the proper fonts installed on the "guest" system meant that the document might not display as it was intended.


Background

For all of its faults as a software company, Adobe did manage to create a truly portable document system that displayed properly on all systems. All you had to have was a reader capable of reading PDF files. Adobe also gave away the reader software, and for quite a while, it was just about the only reader available. The PDF format is based on the PostScript language (also developed by Adobe between 1982 and 1984), and each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it.



In 1993, Adobe made the specifications for the PDF file available free of charge. For the first 16 years, the PDF format remained a proprietary format, controlled by Adobe. And, I can remember Adobe vigorously and enthusiastically controlling the PDF format, to the point of discouraging other software vendors from producing PDFs or PDF readers. Then, in 2008, Adobe released the PDF format as an open standard, which coincided with the International Organization for Standards granting the open standard, as ISO 32000-1. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe that are necessary to make, use, sell, and distribute PDF-compliant implementations, according to Wikipedia.

Fast forward to today, and the PDF document format has become the predominant de facto standard for documents. It can be read and used on just about every modern platform imaginable. Its use has been adopted throughout every corner of the computing world, from the distribution of official government forms to even this very magazine. Office suites, from Microsoft Office products to LibreOffice to Google Docs, offer document output as PDF files. The list of PDF readers is simply too long to list, with even the major web browsers (Firefox, Chrome, etc.) capable of displaying PDF files natively within the web browser.

PCLinuxOS users typically have several options for reading/viewing PDF files. KDE users have Okular, which is an excellent and a very adept reader/viewer for not only PDF files, but other document formats. Mate and Gnome users have Evince, which does an admirable job of faithfully displaying PDF files. Qpdfview is another PDF reader/viewer (and one that I use quite regularly), that stands out because of being very lightweight. Evince is a popular Gtk-based PDF reader. There are many, many more PDF readers/viewers available, and each have their devoted fans.


Creating PDF Files

When the PDF format was first released, only those who paid Adobe insane and obscene amounts of money for the PDF creation software were about the only ones who could create PDF files. However, since 2008 when the PDF format became an open standard, nearly anyone could then create PDF documents.



LibreOffice Writer ... 4th icon from left.


Scribus ... 7th icon from left.

Under LibreOffice, you can find the PDF Export icon on the toolbar (and in the File menu) of ALL of the programs that make up the LibreOffice suite. I just showed the toolbar from LibreOffice Writer above as an example. Even the copy of Microsoft Word and Excel that I sometimes use at work (I plead for mercy every time) can export its documents as PDF files.

Scribus (which is used to create The PCLinuxOS Magazine PDF every month) is a multiplatform desktop publishing tool, and its primary output is files in the PDF format.



Google Docs ... download your document as a PDF file

Even with the online office tools, like Google Docs, you can download your documents in the PDF format.



CUPS-PDF print dialog

Creating PDF files is extremely easy by installing CUPS-PDF from the PCLinuxOS repository. CUPS-PDF installs as a print driver, so you can create a PDF file from virtually ANY program that you can print from. Just select the CUPS-PDF print driver, and select the "Print" button. (Pssst! Here's a tip ... create a directory named PDF in your /home directory, and CUPS-PDF will save all of your PDF files it creates to that directory. Otherwise, they will just pile up in your /home directory. This makes it much easier to find your PDF files later on.)


Other Ways To Create PDFs

Of course, it's a relatively trivial task these days to be able to create a PDF file from the Linux command line. In some cases, it's easier and faster to do from the command line than from a program with a full-blown GUI.

This is where pandoc comes in. The command line program, while not installed by default on a PCLinuxOS installation, is easily installed from the PCLinuxOS repository.

Here is the modest "help" text for pandoc:



Here's a description of pandoc from the man pandoc man page:

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library.

Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx. For the full lists of input and output formats, see the --from and --to options below. Pandoc can also produce PDF output: see creating a PDF, below.

Pandoc's enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. See below under Pandoc's Markdown.

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer. Users can also run custom pandoc filters to modify the intermediate AST.

Because pandoc's intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc's simple document model. While conversions from pandoc's Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc's Markdown can be expected to be lossy.

In its simplest implementation, pandoc can create a PDF from a simple text file on your computer (with the *.txt file extension). As part of the release notices sent out every month for the magazine's release, we create a simple text file with all of the information that makes up the magazine's release notice.

So, as an example, I grabbed the release notice for the July 2022 issue of The PCLinuxOS Magazine. I went to the directory where I had it stored, and opened a terminal window for that particular directory.

I then entered the pandoc command as follows:

pandoc -o July-2022-Release-Notice.pdf July-2022-Release-Notice.txt

This tells pandoc that the output file (the "-o" command line option) will be the same as the input text file, but with the ".pdf" file extension. That is, we'll create the July-2022-Release-Notice.pdf file from the July-2022-Release-Notice.txt input file. For what it's worth, you can list multiple input files, separated by a space. This allows you to take multiple input files and create one PDF output file with the contents of all of the input files. Be aware that by just listing multiple input files, pandoc will just concatenate the second file right at the end of the first file, and the third file at the end of the second file, and so on. Pandoc will simply insert a blank line by default, before appending each subsequent input file to the end of the resulting PDF.



Screenshot of pandoc created PDF in qpdfview PDF reader.

Pandoc is a mighty little program. I've just illustrated its use in the simplest way possible. But pandoc is more. Much more. It is capable of outputting files in many different formats. Some of the more common output formats include Oo/LO ODT files, PDF files, HTML, Microsoft Word DOCX files, DocBook, DokuWiki, EPUB, FB2, JSON, LaTeX, MediaWiki, and man, to name a few. It can also convert between the various formats that it supports. In my testing, I was unable to trip up pandoc in its conversions, and it worked quickly and flawlessly. But for this article, we're going to restrict ourselves just to PDF files. If you want to know more about some of the other uses for pandoc, I recommend reading the pandoc man pages as a good starting point, by typing man pandoc at a command line prompt.

Pandoc uses a special markup language, called Markdown, for direction/instructions on how to create the output file. So, using the markdown language, you can specify what parts of a document you want displayed in bold print, italic print, and so forth. To complicate matters somewhat, there are several different "formats" or "dialects" of markdown. Fortunately, pandoc is also able to convert between many of these different markdown formats. Here is a good discussion of markdown to get you started, should your curiosity demand that you learn more about it. But, without the markdown code, pandoc will still create your new PDF document. Without the markdown code, your output will be a bit plain-jane, so that's one thing to keep in mind. But even then, it looks a LOT better than the even more plain-jane looking text file.

Finally, pandoc uses LaTeX to create PDFs. Thus, they tend to be fairly decent quality PDF files. Because it uses LaTeX, you will need to have a version of LaTeX installed on your computer. If you go to the PCLinuxOS repository to install pandoc, but don't have the requisite LaTeX packages installed (I didn't when I installed pandoc), they will be installed with pandoc as dependencies.


Summary

We often take them for granted. I think that's because they've become so prevalent in our computing world. But, where would we be without PDF files?

The PDF file format has a LOT of good things going for it. The first, of course, is its near universal portability between just about every computing platform out there (yes, there is even a PDF reader out there for DOS).

Second, PDF files use a tokenized PostScript language to retain the proper formatting of the document, while also embedding the fonts into the document. This ensures that no matter which platform a PDF file is viewed on, it will appear the same on all of them.

Third, PDF files are a bit more secure than just sending along a word processor document file, which can be altered with the greatest of ease. Can PDF files be edited? Yep, they sure can. But, it's generally not a trivial task to do, so many people won't even bother.

In upcoming articles, we'll take a deeper look at how to manipulate and edit PDF files. Some of these tasks utilize command line tools, while others use GUI programs. Along the way, we'll learn some new tips, tricks and tools for our Linux toolbox.



Previous Page              Top              Next Page