Bits: They don't mean shit.

I've had this discussion plenty of times with people online and in real life so I thought I'd write a blog post about it. The TL;DR is: Bits don't mean anything without knowing how to interpret them.

Let's look at these 32 bits here:


Can you figure out what these bits mean? Spoiler: You can't. Nobody can. Now let's say I'll tell you that this is an array of four 8-bit unsigned integers? What do they mean? Well, they mean [65 66 67 68]. Now I'll tell you that they are ASCII text. What do they mean? Well, they mean ABCD. Now I'll tell you that this is a 32 bit vector. What do they mean? Well, they mean 01000001010000100100001101000100. Now I'll tell you that this is an array of two 16-bit unsigned integer in big endian byte order. What do they mean? Well, they mean [16706 17220]. You see what's going on here?. Nothing has changed: the data stayed exactly the same but it means different things? You've learned a valuable lesson now: Bits only mean something if you also know how to interpret them. And this has serious consequences. See, the truth is that if you host a website that allows people to write comments... they can upload images to it, videos to it. They can upload illegal content. But it gets even worse: You might not even know it. It's all about encoding. What stops me from translating a red pixel into the english word "the"? Nothing. Of course, that's pretty dumb because if somebody writes a comment "the the the the the" it'll look suspicious but nothing stops you from thinking of a more clever way to translate images into english sentences such that it's practically impossible to spot. This is something that you really need to wrap your head around. There's data and there are things that mean something to us humans. In windows there are file extensions which hint towards how to interpret the data in the file. If it's a .BMP then 111111110000000000000000 would be a red pixel but if it's not... then it's something else but the bits still stay the same. We just tell our computer to interpret the data differently but again: Given only the data... there's no real way of figuring out how to interpret them. Of course... common file formats can be recognized because they contain some magic bits that allows us to make a pretty good guess as what it could be but that's an exception to the rule and not the rule. This is why it's impossible to prevent people from uploading illegal content to websites and why it's impossible to tell apart illegal content from legal content. You can do it in a few circumstances but generally speaking this is impossible. Look for example at this image of a mandelbrot (warning: it's slow). There's an image on that webpage... or is there? You see an image... but it's not an image. As far as your browser is concerned... it's not an image. As far as google is concerned... it's not an image. There's no image on that webpage but you see an image. This perfectly demonstrates my point. Try right clicking on the image to use "Save image..". It won't work because there's no image on the page. Try inspecting the source code of the webpage. There's no image there. What we think is an image and what the computer thinks is in image are two entirely different things. In this text I've hidden a secret message... or have I? The only way for you to verify either of this claims is if I'd tell you what method I used to hide it or would have used. It gets even worse when you encrypt something. When you look at encrypted data... it looks completely random. You can't tell whether it contains illegal data or not unless you know how to decrypt it... but who's to say it's actually encrypted data and not just random data? The only way to know for sure whether some random data is encrypted data is to know how it was encrypted... but what happens if I encrypt random data? Then how will you know whether you've successfully decrypted it? But let's go further: What if I have a method to convert random data into what looks like normal english text? How will you even know that this normal looking english text is in fact an encrypted image? You don't. You just don't. Again, if a website allows you to upload content in one format (be it writing comments, images, word documents, videos) then you can upload ANY content and it's impossible to tell illegal content apart from legal content.