Archiving Documents

Some years ago (since ca. 2010) I decided to start archiving my documents (invoices, letters, certificates, manuals, and much more...) in electronic form. At first mostly for convenience, but now also as the main backup of the paper versions.

In this post I describe my approach and the thoughts behind some of the decisions.

Goals

Before going into the details, here are the goals or requirements related to archiving my documents.

  • Use open-source tools and open standards for reading, writing, handling, managing and processing the documents. This point applies to both each single document file as also to the document archive itself.

    I think this helps to make sure that the archive is readable for a long time and I will not be locked in with some proprietary tools or file formats, which might not be supported in future anymore.

  • Documents should be stored as simple files.

    If the data structure of the archiving software breaks down, the single files shall still be available as files.

  • I want to prevent modification and deletion of documents.

    It's the purpose of an archive, to preserve the state of a document. It should mostly be write-once and then read-one.

  • It should be easy to back up the whole archive by copying / synchronizing it to another computer.

    I keep a backup of the archive on at least one other computer.

    Regulary copying the whole archive takes too much time, so there must be a way to easily synchronize changes to the backup system.

  • I would like to be able to trace any change or modification to archive files, i.e. get the log of modifications.

File Format

Allmost all documents are stored in PDF files. If I receive PDF files e.g. from a company by mail, I directly archive the file as is. If it is a paper document, I scan the file and then convert it to PDF, trying to comply with PDF/A with one embedded image per page, usually in JPEG format.

PDF/A is a restricted PDF variant specifically intended for archiving. It does e.g. not allow any interactive features (embedded javascript), references to external source (often used for fonts, etc.), no forms, so it is very likely that the document can still be read in the future without issues.

Currently I do not use OCR yet to make the PDF contents searchable. This is something on my todo list, but not high priority yet.

File Naming

I try to restrict filenames to only use ASCII characters for file naming, as far as possible the subset a-z A-Z 0-9 - _, so that I do not run into any trouble with character set encoding issues or tools failing to handle special characters.

Allmost all filenames start with a date in ISO format, i.e. yyyy-mm-dd, so they can easily be sorted chronologically.

Next part of the filename usually is the creator, e.g. the company creating or sending the document. After that follows some description of the content.

Here's a simple example:

2019-07-03 Amazon AWS Invoice.pdf

Directory Structure

This is the directory tree for sorting the document. The top-level directories are the main categories of documents like education, finances, insurance, invoices, identdata (scans of passports, ID cards, drivers license, etc.), work, etc.

Below each of these top level folders I have further, more or less complex directory structures. But over time I realized, that the more complex the structure gets, the more difficult it is to find the right place to store a document.

In any way, I usually search documents using find and grep on the command-line. So now I try to keep the directory structure as simple as possible.

Scanner Hardware

The scanners are not specifically related to arhiving documents. It's just the hardware I currently have and use.

The hardware for scanning is quite basic. I use an old Canon Lide 110 flatbed scanner and a Brother ADW-1500 document scanner. Both are supported well under Linux (currently Xubuntu 19.10).

The Brother scanner has integrated Wifi and a small display with touchscreen for configuration. But it seems the humid climate in Shanghai damaged the touchscreen, so now I can only use it by connecting to USB.

Software for Scanning

For a long time, I experimented with using the sane command-line tools for scanning documents. From time to time I re-created or reworked some shell scripts to improve the scanning process, adjust image processing with ImageMagick and conversion to PDF. But I never reached a state where I had a really well working script. At some point I was just annoyed, that for almost every document to scan I first had to do some shell scripting and some retries until the document was successfully scanned.

So I had a look for other available tools. And found a really simple one: simple-scan is a very basic GUI-based scanning application. It scans pages, adjust brightness and contrast, use either color or grayscale mode and saves the result as a PDF or a set of JPEG files.

Archiving

Over the last years, the system to archive documents changed a bit. The first approach used for many year based on encfs. Since some time now I changed to git.

EncFS

encfs is a FUSE (filesystem in userspace) filesystem driver, which encrypts a directory. This means your data is stored in a directory structure with encrypted files. Even file and directory names are encrypted. When mounting the file system, you enter a password and the data becomes accessible decrypted in another directory. Any modification of this data is immediately encrypted and reflected in the encrypted directory tree.

When I run my computer, the data is not mounted, so I cannot access any files in the archive. I wrote two simple scripts called encm and encu to mount or unmount the archive data.

So during normal operation, archive data is not easily accessible. When I need access, I mount the filesystem with encm, access or modify the files and later run encu to unmount it again.

I backup the encrypted files to another computer by running rsync.

This setup worked fine for me for many years. Still roughly one year ago I stopped using it, because it has a few disadvantages.

  • It seems that encfs is not further maintained and has a few security issues.

  • Backup is difficult. I do not want to use the --delete flag with rsync. This means whenever I rename or remove a file, the old file stays in the backup. So over time the backup became out of sync with the main data and I could not easily resolve this anymore.

  • I'm not protected against modification of files and I cannot trace changes at all.

Git

After I extracted all files from the encrypted archive, I kept them as a normal directory structure on my computer for some time and started searching for another solution. And along comes git.

I had already used git for many years for software development, but never considered it for handling my document archive. But it seems a quite good fit, as allmost all my requirements seem to be fulfilled.

I have a working copy of the archive on my computer, so all the files are accessible as plain files. If something goes wrong with the git data structures, at least the files are all still accessible.

When I commit files, the committed version cannot be changed anymore and I can even go back to any older version of every file. I can also trace all modifications of the archive in the git log.

Creating backups is fun and easy, as I now just have to push the repository to another computer. All changes are automatically synchronized and I do not have and hassle with duplicate files, etc. anymore.

As i run a [gitea] server, I even have a nice web interface for browsing the archive.

So far, I'm really happy with this solution.

Statistics

Currently the archive has ca. 2500 files. It's size (including the .git folder) is 4.1GB.

LinkedIn logo mail logo