Obfuscation, fear and loathing.
October 3, 2011
Warning; legal and moral clap-trap ahead.
First, a bit of history. You might be surprised to learn that programming pre-dates computers by about a century and a half. The two are so synonymous these days that it’s hard to imagine a computer without programs or a program without a computer. But machines have been controlled using honest-to-goodness code from the early 1800’s (think of automated looms and pianolas). Most commonly the program was stored on punch cards that were fed into a machine which executed the instructions. But even a hand-cranked music box is an example of programming.
With the advent of the modern (read: electronic) computer in the 1940’s, programming as we understand it today makes its first appearance. Basically a program is a series of instructions. Some instructions pertain to reading or writing bits in memory, others are concerned with computational steps such as addition of two numbers. The main reason computers became so popular is that they are unwaveringly consistent in the application of their programs. You can always trust a computer to do as it’s told*.
Originally computer programs were stored on punch cards using machine code, which was quickly superseded by Assembler type languages. The last 2~3 decades have witnessed an amazing evolution in terms of programming languages and programming platforms. Languages have become both more human-centric and more versatile. And programming platforms like MFC, Cocoa, .NET or Google Apps provide a plethora of useful functions that allow programmers to create new applications in a matter of minutes.
Programs these days are often distributed, meaning the instructions are given to the end-user, whose computer will then execute them. Of course there are exceptions; a lot of programs run server-side, meaning they are executed elsewhere and only the output of the program is pumped to the end-user machine, but in my business the bulk of programs run client-side. There has been an on-going debate about ownership and patents and rights and how they pertain to software and while that discussion is both interesting and frustrating, it is not the focus of this post.
When one writes a piece of software one typically owns copyright on the source code**, just as one would own copyright on any other piece of text. When someone else uses your source code to make a different application they are guilty of copyright infringement, essentially theft. Though I find ‘plagiarizing’ a better parallel. This is not legal, but depending on the code stolen it can be very difficult to detect, let alone prove. This is the first legal problem I want to highlight.
It is also usually up to you whether or not you want to sell your software***. If you choose to sell your software, only people who have paid you money are allowed to use it. If someone is using your software without payment or explicit permission, they are breaking the law and are liable to prosecution. This is also often referred to as theft and it is the second legal problem pertinent to the topic at hand.
Both issues outlined above have the potential to reduce income for the original developer, which is why sentiments run so high over this. A good case can be made that illegal copies have positive as well as negative consequences for the author, but that, too, is not the focus of this post. It is not difficult to see that access to source code is the single most valuable asset to anyone looking to infringe copyright or circumvent license agreements. When one can read source code, algorithms can be extracted and license protection can be punctured.
Up until recently practically all large scale applications were distributed to users as machine code instructions. Source code was written by the developer on her computer, then turned into a collection of assembly-level calls by her compiler. The result is a set of instructions that can be executed very efficiently by the client machine since it is stripped from all human-centric content. Basically a compiler is a translator that converts code written by humans into instructions readable by machines. Compilers often use tricks to increase performance or reduce memory usage, so instead of just removing all superfluous content they also modify the logic of source code. Compiling is quite difficult and those who write compilers are often lionized by their fellow programmers. The opposite of a compiler is a decompiler, which takes machine instructions and translates it back into a human-readable language. This process is even more difficult as a lot of the ‘squishy’ stuff needs to be put back in.
I said “up until recently” because when you write .NET code this is not what happens. .NET programs are no longer fully compiled by the developer and then distributed to users. Rather, the compilation process has been broken up into two parts, one to be performed by the developer machine and the other by the client machine. Text-based, human-readable source code is converted to a language called IL (Intermediate Language). This process validates the code (which is a time consuming process) and turns it into a form which is more easily parsable by computers. You can think of it as binary source instead of textual source if you want. It is this binary code that is given to users and only when they run the program does it get fully compiled to machine code. There are a lot of really big benefits to this approach, but there’s also a down-side; it makes it much easier to decompile the program into high-quality source. This worries a fair few developers for reasons that are almost always, in my opinion, misguided.
There is a process called obfuscation which changes compiled (or semi-compiled in the case of .NET) code and in so doing makes it harder for a decompiler to make sense of the instructions. This is a very effective defense against the casual ‘code thief’ as it can fool most commercial decompilers. Although I agree that there do exist valid reasons in favour of obfuscating code, these reasons are few and far between.
My main objection is an idealogical one. It seems to me obfuscation is often used because “the competition is catching up”. Competition is always catching up, that’s its job. You won’t defeat the competition by flailing around and yelling “Mine!”. The only way to get on top is to write better code than the competition and the only way to stay on top is to keep writing better code than the competition (yes, I know there are marketing and legal departments involved as well, I concede the world is not quite as neat as portrayed here). Time and effort spend securing code can no longer be spend improving code. This applies to obfuscation, but also to elaborate licensing mechanisms such as on-line validation and (shiver) dongles.
Another objection is that obfuscation removes any possibility of useful exploration. “Useful” to legit users and authors that is. It can allow power-users to track down bugs and come up with work-arounds without the need for the official developer to fix, re-compile and re-release. It helps plug-in developers to see what’s going on when they rely on your code. It helps other programmers to see what’s going on when you run into problems and need help. These are real benefits and you might be sacrificing them in favour of imaginary ones when you decide to obfuscate.
* With the exception of hard-ware failures such as corrupt storage of course.
** I am not a lawyer and this is not legal advice.
*** Again, I am not a lawyer. If you make legal decisions based on this post I will sue you.