Encryption: Part 2

Yes, I promised stream ciphers but we'll look into substitution ciphers first!

Substitution ciphers

Substitution ciphers are much more secure than Caesar's cipher (although technically Caesar's cipher can be thought of as a substitution cipher as well). In substitution ciphers we replace every letter (or byte) with another letter (or byte) for example we replace 'A' with 'Z', 'B' with 'F', C with 'A', D with 'Q' and so on. To decrypt we just reverse the mapping of the letters (mapping 'Z' to 'A', 'F' to 'B' etc.). Why is this much more secure? In Caesar's cipher the key is essentially an offset which there can only be 26 (for letters) or 256 (for bytes) (although offset 0 makes little sense) possible offsets. In substitution ciphers on the other hand the key is a permutation of the alphabet which means there are 26! (for letters) (where ! indicates factorial) or 256! (for bytes). How huge are 26! and 256!? HUGE:

Prelude> fac 26
Prelude> fac 256

You can forget about brute-forcing that! Still, this is not a secure encryption scheme by modern standards. Why? Let's look at some text encrypted with a substitution cipher:

dzhr hr dzk isvhodkpd cgotuvdqsvdhgor ha mgq zvwk 
jvovtkn dg nkcumid dzk chizkudkpd dg rkk dzhr rkcukd 
jkrrvtk h yhrz mgq v zviim nvm

The first weakness of ciphers like this is that the same letter in the plaintext ALWAYS maps to the same letter in the ciphertext meaning that for example this means that we know that 'zviim' contains a repeated letter. This can allow us to guess words if we know the language of the cipher text. Also, we can calculate the frequence of letters and then based on the language start to guess letters and words. In english the most common letters are 'e', 't', 'a' and 'o' (in that order). In our ciphertext the most common letters are 'd', 'k', 'v' and 'h'. We also know that 'dzk' appears twice in the ciphertext which means we can attempt to guess that this is 'the' in plaintext. We now have 'd -> t', 'k -> e' and 'z -> h'. Let's fill this in:

THhr hr THE isvhoTEpT cgotuvTqsvThgor ha mgq HvwE 
jvovtEn Tg nEcumid THE chiHEudEpd Tg rEE THhr rEcuET 
jErrvtE h yhrH mgq v Hviim nvm

We guess that 'Tg' is probably 'TO'. 'THhr' can't be 'THAT'. Based on the letter frequencies and the appearances of 'h' in the ciphertext could mean that 'THhr' might be 'THIS'. Since we then now that 'h' is 'I' we can also guess that the single 'v' is an 'A'. Let's fill this in:

THIS IS THE isAIoTEpT cOotuATqsvThOoS Ia mOq HAwE 
jAoAtEn TO nEcumid THE cIiHEudEpd TO SEE THIS SEcuET 
jESSAtE I yhSH mOq A HAiim nAm

One can continue guessing new letters and new words. The longer the ciphertext or the more ciphertext available the easier it is to guess (and thus decrypt) words and letters.

Substitution ciphers are also very vulnerable to known plaintext attacks. If we know that a certain ciphertext always starts with "dear sir or madam" we already get a lot of letters for free. In general the way to break substitution ciphers is to run frequency analysis of letters and words on the ciphertext to guess the language (if not already known) and then to guess letters and words abusing facts that for example there are only three possible three letter words in one language or that only two words end with a certain combination of letters or that certain letters in a language are always followed by a specific other letter. Of course, there are alreday lots of tools available to automate this so we don't even have to do this manually.


What next?

As promised... we'll DEFINITELY look into stream ciphers in the next part!