Playing with CompiledMethod

The today’s stop of this Journey through the VM is about CompiledMethods. In the previous post I explained the different class formats and specially, the unique format of CompiledMethod. Today we are going deeper with them and we will see why they are even more special 😉

Summary of the previous post: CompiledMethod instances are internally represented in the VM as bytes. However, CompiledMethod is the only class in the system that mixes pointers (for the literals) with bytes (for the bytecodes). So those bytes encodes both things.

Inspecting a CompiledMethod

What is the normal way to learn something in Smalltalk? Open your image and check senders, references, or someone who does more or less what you need and try to understand it. In the previous post, I showed you how inspecting or exploring a CompiledMethod give us a lot useful information like the header, the literals, the bytecodes and the trailer. Example:

So this means that at least the Inspector and the Explorer can have access to the CompiledMethod and understand its internal. Let’s take the Inspector (we could have taken also the explorer in which case take a look to CompiledMethod >> #explorerContents). When we inspect a CompiledMethod, the inspector class that is used is CompiledMethodInspector. So, first point, there is a special inspector class for CompiledMethod. Otherwise, if we inspect it with a normal inspector, for example if we do “BasicInspector openOn: (MyClass >> #testSomething)” we have something like this:

CompiledMethodInspector has two important methods:

CompiledMethodInspector >> fieldList

| keys |
keys := OrderedCollection new.
keys add: 'self'.
keys add: 'all bytecodes'.
keys add: 'header'.
1 to: object numLiterals do: [ :i |
keys add: 'literal', i printString ].
object initialPC to: object size do: [ :i |
keys add: i printString ].
^ keys asArray
CompiledMethodInspector  >> selection

| bytecodeIndex |
selectionIndex = 0 ifTrue: [^ ''].
selectionIndex = 1 ifTrue: [^ object ].
selectionIndex = 2 ifTrue: [^ object symbolic].
selectionIndex = 3 ifTrue: [^ object headerDescription].
selectionIndex <= (object numLiterals + 3)
ifTrue: [ ^ object objectAt: selectionIndex - 2 ].
bytecodeIndex := selectionIndex - object numLiterals - 3.
^ object at: object initialPC + bytecodeIndex - 1

So…as you can see in the code, “keys add: ‘all bytecodes’.”  maps to “selectionIndex = 2 ifTrue: [^ object symbolic].“, and “keys add: ‘header’.” to “selectionIndex = 3 ifTrue: [^ object headerDescription].“.  What we should learn from this, is that CompiledMethod >> #symbolic answers a string which nicely shows the bytecodes. So for example, if we have the method:

MyClass >> testSomething
TestCase new.
self name.
Transcript show: 'The answer is:', 42.

Then, “(MyClass >> #testSomething) symbolic” answers the following:

41 <40> pushLit: TestCase
42  send: new
43 <87> pop
44 <70> self
45  send: name
46 <87> pop
47 <43> pushLit: Transcript
48 <25> pushConstant: ''The answer is:''
49 <26> pushConstant: 42
50  send: ,
51  send: show:
52 <87> pop
53 <78> returnSelf

Don’t worry for the moment about the first number in each column (for the interested guys it is the PC -> program counter) and the hexadecimal between <>  (it is the bytecode number in hexa). I will explain that in a future post.

This method #symbolic could be the same used by the SystemBrowser when you select “View” -> “Bytecodes”.  From the previous example, we can also learn that CompiledMethod implements methods like #numLiterals, #objectAt:, #initialPC, etc.  Imagine the CompiledMethod as an array of bytes…how can you determinate which part is literals and which one is bytecodes?  How the #numLiterals can be implemented in CompiledMethod if it is just an array of bytes?

CompiledMethod header

It may be already obvious that CompiledMethods have a header. But be careful, CompiledMethod have both, the normal object header every object has, and then a special header which is just the first word (32 bits -> 4 bytes) of the byte array. So this header is just before the literals and the bytecodes. As we can read in the class comment of CompiledMethod:

“The header is a 30-bit integer with the following format:

(index 0)    9 bits:    main part of primitive number   (#primitive)
(index 9)    8 bits:    number of literals (#numLiterals)
(index 17)    1 bit:    whether a large frame size is needed (#frameSize)
(index 18)    6 bits:    number of temporary variables (#numTemps)
(index 24)    4 bits:    number of arguments to the method (#numArgs)
(index 28)    1 bit:    high-bit of primitive number (#primitive)
(index 29)    1 bit:    flag bit, ignored by the VM  (#flag)”

Ok, with this comment you may notice the limits imposed in methods. For example, 9 bits for a primitive it means (2^9) -1=511. BTW, I think this class comment is outdated and now there are 11 bits for primitive index, so it is (2^11) – 1 = 2 047. But you get the idea…. anyway, it is not likely that you have ever reached any of these limits.

Who is responsable of generating such header in the CompiledMethod?  In this post, I told you that usually the input for the Compiler was a string representing the source code and the result was a CompiledMethod instance. Hence, the Compiler takes care about creating such CompiledMethod header. Notice that this header is not only used from the image side but also from the VM. Check (in the VMMaker) implementors and senders of #argumentCountOf:, #literal:ofMethod:, #primitiveIndexOf:, #tempCountOf:, etc.

CompiledMethod trailer

Something you should be asking yourself is where the source code is stored?  I mean, when you open a browser and see the source code of a method, where does it come from?  because in the CompiledMethod we saw that only literals and bytecodes are stored, not source code. So???  Ok… the source code is stored in two files: .sources and .changes. The “old” methods’ source code is in the .sources file and the “new” method’s source code in the .changes. You can browse #condenseChanges and #condenseSources for details. So far so good. But…. how a CompiledMethod instance is map to its source code in the file?  Excellent question Mariano 🙂

The same way there is a special header for CompiledMethod, there is a trailer. So far the trailer has been used only for getting the source code of the method. Some time ago, this trailer was one word size (4 bytes) and it encoded a number which was the offset in the .sources/.changes file. That number represent both things: the offset in the file, and a flag to say from which file (if .changes or .sources). Check for example the method #filePositionFromSourcePointer:. In addition, the logic of encoding and decoding the trailer was implemented in the CompiledMethod class.

In today’s Pharo images (and Squeak), this is not true anymore. There are two big differences with the “old” approach:

  1. The trailer was reified with the class CompiledMethodTrailer.
  2. There are different kind of trailers implemented and up to 255 possibilities. The implemented kinds are: normal source pointer, temp names (the decompiler can use such temp names when getting the source so that to generate a source code more similar to the original one), variable length (for example when .changes is bigger than 32MB), etc. For more details, check #trailerKinds. Of course, the most common type is “SourcePointer”.

CompiledMethod is a chunk of bytes (this is why it is a subclass from ByteArray), and its format is “bytes”, so it means it cannot define normal instance variables.  So how can it have a CompiledMethodTrailer?  Ok, it works this way: when a CompiledMethod is being created (usually by the Compiler), a specific CompiledMethodTrailer instance is also created. That instance of CompiledMethodTrailer has to be created with a specific type (source pointer, temp names, etc). Once the CompiledMethod is almost ready the trailer instance is encoded as bytes in the CompiledMethod instance, and then it is garbage collected. Later on, when someone ask to the CompiledMethod for its source code (using the method #getSource), it delegates to a trailer instance. But there is not trailer instance as this moment. So….every time the source code is needed, the CompiledMethod creates an instance of a trailer. But notice that it is up to the CompiledMethodTrailer to know how many bytes are the trailer, how to decode it and what do the bytes represent (if a source pointer, an array of temp names, etc). Finally, the trailer answers the source code of the method. So, the CompiledMethod just has:

CompiledMethod >> trailer
"Answer the receiver's trailer"
^ CompiledMethodTrailer new method: self

The CompiledMethodTrailer just read the last byte, it checks in an internal table to see which kind of trailer is it, and then perform the correct method to decode the information. The amount of bytes used by the trailer and what they represent, depends on the kind of trailer.

method: aMethod

| flagByte |

data := size := nil.
method := aMethod.
flagByte := method at: (method size).

"trailer kind encoded in 6 high bits of last byte"
kind := self class trailerKinds at: 1+(flagByte>>2).

"decode the trailer bytes"
self perform: ('decode' , kind) asSymbol.

"after decoding the trailer, size must be set"
[size notNil] assert.

Depending on the type of trailer,  CompiledMethodTrailer will finally execute one of the methods encode* when the CompiledMethod is being created, and decode* when asking its source code.

A question to all of you….wouldn’t it make sense to rename CompiledMethodTrailer to MethodSource ? because trailers has been always use only for that….

Decompiling CompiledMethods

Why source code is not stored in CompiledMethod? From my point of view, there are 2 main reasons:

  1. Because as its class name suggests, they reify COMPILED methods, not source methods or whatever name you want to use.
  2. Memory and security reasons. When you deploy an application written in C, do you include source code? no. And in Java? no. So why we would do it in Smalltalk? Remember that the way to “deploy” a Smalltalk application is providing an .image.

The ideal approach would be to have the sources in development and to be able to remove them when deploying. Smalltalk allows us that. Just remove the .sources file and that’s all 🙂 Your image continues to work as if nothing has happened. But sometimes we have a bug in our application and we want to be able to browse the code. Guess what? Smalltalk provides that also 😉  Let’s try it (don’t try in Pharo1.3 because there is a bug. Use anyone before 1.3). Create a method anywhere, for example:

testSomething: aaa with: bbb and: ccc
| name |
TestCase new.
(4 = 3)
ifTrue: ["I am a nice comment, don't remove meeee pleaeee!"]
ifFalse: ["I like this way of formatting my code"].

name := self name.
Transcript show: 'The answer is:', 42.

Now, close your image. Rename the .changes file (creating a new method will ensure that the source pointer points to the .changes and not to .sources) so that it is not found. Open your image again, and you may have the popup saying that the .changes couldn’t be find. No problem. Accept it.

Now, if you open the system browser, you can browse any method of the image. And only whose source were in the .changes will look similar to this:

testSomething: t1 with: t2 and: t3
| t4 |
TestCase new.
4 = 3.
t4 := self name.
Transcript show: 'The answer is:' , 42

What you are seeing is not the source code of the method but instead the decompiled one. The compiler is able to decompile a CompiledMethod (using the bytecodes and literals) and get the possible “source code”. However, the decompiler source is not exactly the same as the original source. Note that the decompiler the only thing it has for a method is the bytecodes and the literals. Hence, the decompiled code does not have:

  • Temporal variable and parameters  names. Since they are not stored in the CompiledMethod they are lost. Both temps and parameters are replaced in the decompiled code with “t1”, “t2”, etc.
  • Comments are lost (they are not stored in the CompiledMethod).
  • Code formatting (tabs and spaces) is lost.

The cool thing is that even with the decompiled code we can get an idea of the code, debug it, and probably find the bug we were looking for.

Notice that the source code of “old” methods are stored not in the .changes but in the .sources. So, even removing .changes there are methods which get the source from the .sources file. Therefore, we can also remove the .sources and that way all methods in the image will be decompiled if you try to browse them.

Depending of what and where you are deploying, getting rid of .sources and .changes could be worth it.

CompiledMethod equality

What do you expect the following expression to answer:

(Boolean>>#&) = (Boolean>>#|)

Ok…you need to see the source code?

& aBoolean
"Evaluating conjunction. Evaluate the argument. Then answer true if
both the receiver and the argument are true."

self subclassResponsibility
| aBoolean
"Evaluating disjunction (OR). Evaluate the argument. Then answer true
if either the receiver or the argument is true."

self subclassResponsibility

So? true or false? TRUEEEE!!! that is true. And why? if they have different comments, they have different selectors! So? who cares about that? we are talking about COMPILED methods. Are the bytecodes the same? yes. Are the literals the same? yes.  So they are the same compiled method. Point. So….you would really be careful when putting CompiledMethods in Sets, Dictionaries or things like that. Example:

InstructionClient methods size -> 27
InstructionClient methods asSet size -> 21

Conclusion: use an IdentitySet or a IdentiyDictionary if you want to avoid problems.

Sorry for the long post, but there is too much to talk about CompiledMethods. In the next post we will talk a little more about bytecodes.

Advertisements

11 responses to “Playing with CompiledMethod

  • Chris Cunnington

    That was a cool post. I had no concept why there was a CompiledMethodTrailer at all. Its initial use was to point to the location of the code in the .sources file.

    Not every CompiledMethod has a trailer. I guess there is only going to be a trailer if the compiler references the .sources file.

    If the .sources file is removed and all the source is produced from decompilation, then there can’t be a trailer (unless it’s for one of the newer uses).

    • marianopeck

      Hi Chris, and thanks for you comment. You said “Its initial use was to point to the location of the code in the .sources file. ” . Yes, that’s correct. To the .sources or .changes file. And now, there are more options, like encoding there some temp names so that the decompiler can use it. Or you can implement a kind of trailer that directly encodes the full source code as String.
      “Not every CompiledMethod has a trailer” No. Every method has some bytes at the end that represent teh “trailer”. Always. “If the .sources file is removed and all the source is produced from decompilation, then there can’t be a trailer (unless it’s for one of the newer uses).” That’s a good question. Always, with .sources/.changes present or not, when the source code is asked to the CompiledMethod the method finally executed is #getSourceFor: selector in: class. Check that method and you will see it checks which kind of trailer the method has. If the CompiledMethod has a trailer “Source Pointer” AND the .sources/.changes is not present, then it will answer the decompiled version. So in summmary: the trailer is still there, it is not removed when removing the .sources. The only difference is that when the source is asked, since the -soruces is not present, it will answer the decompiled. But the trailer remains the same.

      Cheers

      • marianopeck

        Sorry, one more….. yes, if the .sources/.changes is removed, then the source pointer encoded in the trailer is completly useless. The only thing may be that if you then put back the file, they can still be working correctly. Note that now with CompiledMethodTrailer we can define custom kinds and hence, use as much bytes as we want (not only 4). Notice also that there is a method (maybe it is not working with all the new MethodTrailer Stuff) called #abandonSource which is from the time where the trailer was always 4 bytes, and the idea was to try to encode as much temp names as possible in those 4 bytes. This was of course, replacing the source pointer (it was lost).

  • Chris Cunnington

    Thanks, Mariano. Looking forward to your post on byte codes and literals.

  • Introduction to Smalltalk bytecodes « Mariano Martinez Peck

    […] talk that much because this topic is well explained in the Blue book, in the code, etc. In the previous posts we saw that a CompiledMethod is all about bytecodes and literals (ok, and a header and a trailer). […]

  • Igor Stasenko

    Answering your question about trailers:

    “A question to all of you….wouldn’t it make sense to rename CompiledMethodTrailer to MethodSource ? because trailers has been always use only for that….”

    Not only for that. In NativeBoost i using special kind of trailer, which in addition to source pointer holds a machine code, which are invoked by special primitive.
    Yes, a machine code stored directly inside a body of compiled method. And trailer manages that pretty well.

    • marianopeck

      Thanks Igor. It is nice to see some other usage to trailers than managing source code 🙂
      BTW, and thanks for reviewing my posts. Please, always control what I say heheheh.

      • Igor Stasenko

        Trailer’s purpose is to attach an additional metainformation to compiled method.
        It could serve for anything, not just pointing to source code.
        In a future, i think at some moment we can get rid of trailers, but instead reserve an extra slot for mehtod’s metainformation.
        Why it was not done like that from the beginning is to conserve the space, since if you need to allocate one extra object per compiled method instance then it costs you much more than a couple of bytes in a method’s body.

      • marianopeck

        Thanks igor for the context 🙂

  • Primitives, Pragmas, Literals and their relation to CompiledMethods « Mariano Martinez Peck

    […] we talked about CompiledMethod and literals I forgot to mention that there are 2 literals in every CompiledMethod that are really important. […]

  • The second part of the journey is over « Mariano Martinez Peck

    […] we knew about object formats, we continue with an introduction to CompiledMethods. I explained how to understand the results of inspecting/exploring a CompiledMethod instance. It […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: