information digitalization

Document integrity in the digital environment: from depositing the media to the hash function: Exploring the Digital Frontier III

In the previous post in this series, I tried to explain how digitization is a unique way of converting information to a binary digital code (a “linguistic” question of how data is coded) so that management of that information can be mechanized electronically (a technological question).

The use of language to convey information

Going one step further, I would now like to introduce the idea that, as jurists, we have a very special relationship with information or, what is one and the same, with language. For us, language is not just a way of communicating or transmitting information about the world or of expressing and sharing our feelings or emotions. Rather, with language, we do things: we enter into contracts and commitments, we waive rights, we organize what is to be done with our assets when we die, we get married, we enact laws, issue judgments and impose penalties, etc. All these specific actions form part of what language philosophers call performative language: actions that are only performed by uttering or writing certain words and that has an effect in that particular sphere of reality that is the sphere of legal validity or effectiveness.

Precisely because of this, for us, “documentation” – establishing the certainty of who said exactly what and when, so that, if we forgot it, it can be recalled and evidenced at any time – is not a trivial matter. Indeed, it lies at the very core of our field of activity and concerns.

Also because of this, since we can now document in the digital, paperless environment to which we are inexorably moving, it has become a matter of great legal importance.

A document system must meet three basic requirements: integrity, authorship and time stamp (as I said above: what was said, by whom and when).

In the world of paper, integrity is guaranteed through the inextricable physical link between the ink used to draw certain alphanumeric characters and the fibers with which a specific sheet of paper is made. Because of this, a document, in the classical sense of the word, is always an object belonging to the material world. It is identified as that individual specimen that comprises certain specific written pages. It is something we can destroy, but is not something that can be easily adulterated, or at least not in a way that is hard to detect (inserting, correcting or eliminating a word or figure to change the original written meaning ).

The authorship of documents and it’s fakeness

As far as authorship is concerned, that is, the possibility of attributing responsibility for what has been said to a specific person or persons, the quintessential tool used in the paper universe has always been a written signature. It is a sign that is recognizably or verifiably linked to a certain person, who must physically intervene in creating the signature. Its inclusion at the end of a written text is attributed with the legal meaning of voluntary assent and assuming ownership of the statements contained in the text (something which in itself can be independent from both the intellectual and material ownership of the document).

As for the time stamp, the document itself and its signature can reveal how old it is. However, recording of the precise moment that a legally relevant document was created and signed has usually been entrusted to a reputable official or public authority (a notary or a public registrar).

How do we satisfy these same needs for certainty of a document—which are vital from a legal standpoint—once we abandon paper and replace it with something as evanescent as an electronic computer file? How do we identify the exact contents of a specific file? How do we attribute authorship to a specific person? How do we evidence its date beyond all shadow of a doubt?

How to know the truth indeed a document

In this post, I will look at some answers to the first of these questions, the problem of integrity.

I must start by saying that there is no one answer or solution to the question, but rather many possible ones, some of which are more rudimentary than others.

The first solution (the least sophisticated) is to identify the file (that is, the fragment of digitalized information) by the storage media. The file whose content we wish to be certain about is copied onto a specific physical storage media, and that media is placed in the hands of a trustworthy agent: par excellence, a notary. For several years now, notaries have been taking deposit of floppy disks, CD-ROMs, DVDs, pen drives and even hard drives and entire computers, as a way to be able to subsequently attest to the content of a specific file or files recorded on that media.

The procedure is very rudimentary because, ultimately, it is based on safeguarding a particular storage media, on placing it with a trusted third party, and on a chain of custody that is broken at the very moment the media is returned by the notary with whom it was deposited or is handed off to any other person, except if this same notary, before doing so, accesses the content of the media and prints and witnesses a paper copy of the corresponding content (which means, in short, having to again return to the world of paper and to the authentication tools that inhabit it).

A second solution entails establishing a reference to information recorded on a trustworthy website. This system is used to verify the authenticity of certificates, permits, licenses and other types of electronic documents issued by certain public authorities and agencies, such as certifications of reserved company names issued electronically by the Spanish Central Commercial Registry. The text of the file includes a secure verification code, which is an alphanumeric code identifying the document and making it searchable in the online repository of the issuing authority or entity.

This concept is referred to in article 18.1.b of Law 11/2007 on Electronic Access by Citizens to Public Services, as follows:Secure verification code linked to the public administration, body or entity and, as the case may be, to the person signing the document, allowing in all cases the integrity of the document to be verified through access to the corresponding online site”. Article 30.5 of the same law states that “Printed copies of public administrative documents issued through electronic means and signed electronically will be considered authentic copies provided they bear an electronically-generated printed code or the mark of other verification systems through which authenticity can be verified by accessing electronic files of the issuing public administrations, body or entity”.

In a system such as this, the integrity of the document is verified by contrasting the electronic (or printed) copy presented as authentic against the file accessible on the corresponding website, therefore relying on the reliability of the website and its online access system. If the copy stored on that official website and used for the verification disappears, there is no way of knowing whether or not any other purported electronic or paper copy of the same file is authentic and complete.

The fact that these secure verification codes are randomly generated alphanumeric strings has nothing to do with the content of the document itself (in contrast to what we will see next, for the hash function). Rather, it is a way of safeguarding the confidential nature of the document repository that must necessarily be accessed by the public on the corresponding website. In order to access a particular document on the website, one has to enter a given “metadata”, i.e., that particular secure verification code, that only a person who is already looking at a purported copy of the corresponding file would know (because it is transcribed thereon).

The third solution devised to ensure the integrity of computerized files is much more technical: the hash function.

The term “hash” offers a natural analogy with its non-technical meaning (to “chop” or “scramble” something). A hash function a mathematical algorithm that, applied to a file or any other digital item, yields a particular fixed-size string of text of approximately thirty alphanumeric characters. In reality, this string is a number, usually expressed in hexadecimal and not decimal numbering (that is, using 16 digits, namely numbers 0 through 9 and the first six letters of the Roman alphabet (a through f)). An example of a hash value (SHA-1 formula) is as follows: 8b9248a4e0b64bbccf82e7723a3734279bf9bbc4.

The hash function has the amazing property that whenever it is applied to the same file, the resulting hash value will always be the same, yet by changing a single bit of the file, the resulting hash value will be completely different. Moreover, the likelihood that two different files would yield the same hash (called a collision) is very remote. Bear in mind that while there are infinite possible inputs in the algorithm (any string of digits, however long), there is a finite number of possible outputs: since the hash is fixed in length, there are many possible combinations of 0s and 1s, but not an endless amount. Accordingly, by definition, two different inputs could potentially yield the same hash.

However, another extremely important property of the hash algorithm is that it only goes in one direction; that is, it cannot be reversed. In other words, you cannot use a hash value to reconstruct the original file.

Consequences fo the property

Firstly, a hash value on its own does not mean or symbolize anything; it does not have semantics, and it does not transmit or store any information because, as I already said, you cannot reconstruct the original from a hash value. The hash value can only be used to ensure that a specific file has not been altered. To be clear, the hash value does not prevent a file from being altered, but it does allow us to detect such alteration; therefore, it can be used to evidence that no changes have been made. If at a given time the hash value of a certain file is generated and recorded in a reliable, trustworthy manner, we can determine whether any purported new image or copy of the same file presented at a later date corresponds exactly to the original file. To do so, we simply need to generate the hash value of the new file being presented: if the new hash is the same as the hash obtained previously from the original file, then the file has not been altered and has the exact same content. The hash function cannot be used to save and store information (if the original file is lost or destroyed, having its hash value does not help us at all) or to evidence where information came from (the hash function is anonymous, anyone can apply it to a file). But it does ensure the integrity of the file and evidence that its content has not been altered, provided, of course, that we are certain we have the hash value that corresponded to the original file. Therefore, for the purely technological guarantee the hash function provides to be truly effective, someone must certify which hash value was initially obtained from a given file. Without this legally reliable certification, any comparison of hash values could be very secure from a mathematical standpoint, but pointless from a legal perspective.

The second consequence of the one-way nature, or computational asymmetry, of the hash function is the security it offers against deliberate attempts to create a hash collision. Given a particular file, a computer can obtain its hash value in an instant. However, there is no formula or algorithm for reconstructing the original file from the hash value. This can only be done through “computational brute force”, that is, trying, one-by-one, all the infinite combinations of bits that, as an input, would generate a given hash. This feature is absolutely essential for the security of this tool, because it is what precludes or extraordinarily hinders intentional generation of a hash collision. If somebody could deliberately generate a file with the exact same hash as a different file (but that is sufficiently similar so as to allow one file to be mistaken for the other), the uniqueness of the hash metadata would be jeopardized and, with it, all the security the hash function tool has to offer. However – and this is the most important thing and where the computational asymmetry comes into play – one thing is that a collision is theoretically possible, but it is quite another to be able to intentionally create a collision for a given file, which is what would allow someone to maliciously manipulate the information stored or certified using the hashing tool. The computational difficulty of a maneuver such as this would be astronomical, and the computational time complexity would not be polynomial but rather exponential.

Gracias por compartir!