Class formats and CompiledMethod uniqueness

Before going deeper with CompiledMethods I would like to talk a little bit about class formats. Unfortunately, I didn’t find class formats documented more than in code and method comments. If you know a source of documentation of this topic, please let me know.

Class format

From my point of view, the class format is a really internal and implementative detail of the VM. The class format defines the structure (layout) of the instances of a class, in the VM. In the previous post, I said: “In the internal representation of the Virtual Machine, objects are a chunck of memory. They have an object header which (there will be a whole post about it) can be between one and three words, and following the object header, there are slots (normally of 32 or 64 bytes) that are memory addresses which usually (we will see why I didn’t say always) represent the instance variables.”

So, usually, that is, “normal” structure, an object has a fixed amount of instance variables which are just pointers to other objects. In this case, those “slots” (which are one word size, that is 32 or 64 bits) contain the memory address (pointer) of the header of the object they point to. But that’s not the only possibility, another object (like a Collection instance), do not have a fixed number of instance variables, but instead it is variable. And the representation is not always pointers (a word), but it can also be bytes. In summary, what changes is how it is represented the chunk of memory of an object.

Different class formats

  • Normal: there is a fixed amount of instance variables and each of them is just a pointer to another object. Notice that not only the amount of pointers is fixed by the amount of instance variables, but also, the pointer is always the same, one world (32 or 64 bits). Examples are any normal class like TestCase, Browser, True, Integer, etc.
  • Bytes: it means that the chunk of memory of an object is represented in a variable sequence of individual bytes.  Examples: ByteArray, ByteString, ByteSymbol, LargePositiveInteger, LargeNegativeInteger, etc.
  • Words: it is similar to “Bytes”, in the way that it is variable, but it is represented by a sequence of words instead. Notice that “Normal” also encodes the pointers in words, but in that case, the amount of those words is fixed and second they represent pointers. In this case, the amount of words is variable and they do not represent pointers to objects. Examples are Bitmap, WideString, WideSymbol, WordArray, FloatArray, etc.
  • Weak: when an object has weak references it means that its pointers to other objects don’t count for the Garbage Collector. So the GC removes and object when nobody else non-weak point to it. Weak format can be applied to both, variable and fixed formats. For example, WeakFinalizerItem has a normal format, but weak. On the contrary, WeakArray has a variable format and weak.
  • Variable: this is like “Normal” but where the pointers are not fixed, but instead variable. It can also be seen as “Words” but there each word does represent a pointer. Examples: BlockClosure, MethodDictionary, etc.
  • CompiledMethod: Chan! Chan! Chan! Yes, CompiledMethod class has its own format. Do you understand already why I wanted to talk about this before CompiledMethods?  But we will let the explanation to the end of the post…

Now…if you want to check by yourself, check the method Behavior >> #typeOfClass, it answers a symbol uniquely describing the format of the receiver class:

Behavior >> typeOfClass
"Answer a symbol uniquely describing the type of the receiver"
self instSpec = CompiledMethod instSpec ifTrue:[^#compiledMethod]. "Very special!"
self isBytes ifTrue:[^#bytes].
(self isWords and:[self isPointers not]) ifTrue:[^#words].
self isWeak ifTrue:[^#weak].
self isVariable ifTrue:[^#variable].
^#normal.

So you can do for example:

TestCase typeOfClass -> #normal
ByteArray typeOfClass -> #bytes
Bitmap typeOfClass -> #words
WeakArray typeOfClass -> #weak
BlockClosure typeOfClass -> #variable

Or you can inspect all classes of a certain type:

(Smalltalk allClasses select: [:each | each typeOfClass = #weak ]) inspect

Now, if you take a look to the method #typeOfClass we can see that it ask to itself whether it is bytes, or bits, or pointers, etc…In addition, notice the word “uniquely” in the comment of the method #typeOfClass. This means that the same class can be several “things” at the same time. For example:

Bitmap isVariable -> true
Bitmap isWords -> true
Bitmap isPointers -> false

BlockClosure isVariable -> true
BlockClosure isWords -> true
BlockClosure isPointers -> true

That example shows that all those classes that are “Words” or “Bytes” are also “variable”. Ahhh and btw…those variable classes supports Behavior >> #new: sizeOfVariables. Most classes in the Collection‘s hierarchy  are variable.

Prepare the image

In my post about compiling the VM I told you to use a PharoCore image since it was the “recommended” way. However, in the second post about building the VM, I provided you with a PharoDev 1.2.1 image ready to load Cog and its VMMaker branch.  So, even if you are not going to compile the VM, I recommend you to load Cog and VMMaker so that you could follow some of my comments. In addition, since we are not going to build the VM for a couple of posts, but instead understanding it, you can save this image and you will be able to use it in the next posts. Just thake the image and evaluate:

Deprecation raiseWarning: false.
Gofer new
squeaksource: 'MetacelloRepository';
package: 'ConfigurationOfCog';
load.
((Smalltalk at: #ConfigurationOfCog) project version: '2.0') load.

Class format encoding in classes and instances

If you see all those methods like #isBytes, #isVariable, #isPointers, etc (all those methods in the category ‘testing’ in Behavior class) you will notice that they all send #instSpec (instance specification I guess) at the end. And this method looks like this:

Behavior >> instSpec
^ (format bitShift: -7) bitAnd: 16rF

And a couple of examples:

TestCase instSpec -> 1
ByteArray instSpec -> 8
CompiledMethod instSpec -> 12

“format” is the instVar of Behavior, and as it says its getter method “Answer an Integer that encodes the kinds and numbers of variables of  instances of the receiver.”. So the number just alone is not really useful, but taking some bits from it yes, like #instSpec
, #instSize, #indexIfCompact, etc. So…the class encodes this information in an integer which is the “format” instVar.

But what happens to their instances?  Imagine that the VM for different tasks needs to how the format of a particular object. Fetching its class every time may be expensive. So where is such information stored? To answer, we will take our image and browse the “core” of the VM. Let’s see the method ObjectMemory >> formatOf:

formatOf: oop
"       0      no fields
1      fixed fields only (all containing pointers)
2      indexable fields only (all containing pointers)
3      both fixed and indexable fields (all containing pointers)
4      both fixed and indexable weak fields (all containing pointers).
5      unused
6      indexable word fields only (no pointers)
7      indexable long (64-bit) fields (only in 64-bit images)
8-11      indexable byte fields only (no pointers) (low 2 bits are low 2 bits of size)
12-15     compiled methods:
# of literal oops specified in method header,
followed by indexable bytes (same interpretation of low 2 bits as above)
"
<inline: true>
^((self baseHeader: oop) >> 8 ) bitAnd: 16rF

As you can see, there are 16 possible formats, encoded from 0 to 15 in 4 bits of the Object Header. The line “^((self baseHeader: oop) >> 8 ) bitAnd: 16rF” is the one that takes those 4 bits from the Object Header of the OOP (object pointer) received by parameter.

If you now browse the class comment of ObjectMemory, you will read it says that there are 4 bits for the object format. As you can guess, that number that represents the format is what we get in the image side with the method #instSpec. Notice that at the beginning of the post described all the different types of format and they were 6, but here we have 16 possibilities.  Ok, some are for optimizations (for example the number zero means that the object has no instVar, hence the GC can stop there while doing the mark and trace instead of trying to follow non-existent pointers), some are not used (like the number 5), some are only for 64 bits (number 7), the format for “bytes” uses 4 numbers, and CompiledMethod also uses 4 numbers.

Don’t get confused:  In the image side, we have an instVar which is called “format” in Behavior that keeps an integer with both, what WE call format plus the amount of variables. What we call format, is the method #instSpec in the image (which in fact gets the format from the “format” instVar). Finally, the VM agree with us, the method is #formatOf:  and it refers to what we call format. All in all, the instVar “format” of Behavior is misleading. Don’t get confused.

Finally, if you are curious you can check senders of #formatOf: and you will see all the places where the VM needs to know the format of an object.

Creating classes with a special format

We saw all the details of the class formats but we didn’t see how to create a class with a special one. In the previous post, I told you the way to create a subclass in Smalltalk was, of course, by sending a message. In this case, a message to the desired superclass. The method was Class >> #subclass:instanceVariableNames:classVariableNames:poolDictionaries:category: .  Now, if you check in the category of that method, that is, ‘subclass creation’ you will see much more methods like:

  • #variableSubclass: t instanceVariableNames: f classVariableNames: d poolDictionaries: s category: cat
  • #variableByteSubclass: t instanceVariableNames: f classVariableNames: d poolDictionaries: s category: cat
  • #variableWordSubclass: t instanceVariableNames: f  classVariableNames: d poolDictionaries: s category: cat
  • #weakSubclass: t instanceVariableNames: f  classVariableNames: d poolDictionaries: s category: cat

So…you image what each of those methods do, don’t you? If we want to confirm our suspicion, take a look to the definition of the classes. For example, we saw that Bitmap was “words” and ByteArray was “bytes”, hence:

ArrayedCollection variableWordSubclass: #Bitmap
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'Graphics-Primitives'

And:

ArrayedCollection variableByteSubclass: #ByteArray
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'Collections-Arrayed'

Do you notice the difference?  🙂    It is important to note also that there must be some validation. For example, if I define a class as variable with bytes, I shouldn’t be able to declare instance variables to that class, because I cannot mix both (only CompiledMethod do that!!!). So for example, if you try to do:

TestCase variableByteSubclass: #MarianoArray
instanceVariableNames: ' size '
classVariableNames: ''
poolDictionaries: ''
category: 'Collections-Arrayed'

You will get an error that says ‘cannot make a byte subclass of a class with named fields’. These validation are done by ClassBuilder.

CompiledMethod format

All this post was just to explain you the following 😉  As I said, CompiledMethod has a very special class format, and we can read it in his own class comment “My instances are methods suitable for interpretation by the virtual machine.  This is the only class in the system whose instances intermix both indexable pointer fields and indexable integer fields.”  This means that CompiledMethod was created with the message #variableByteSubclass:instanceVariableNames:classVariableNames:poolDictionaries:category:    and in addition:

CompiledMethod isBytes -> true
CompiledMethod isWords -> false  "lying!! he also includes words for pointers "
CompiledMethod isPointers -> false  "lying!! he also includes words for pointers"

So…the system thinks CompiledMethod is just a “Bytes” but it is not, it is a mix between bytes and pointers (words). The pointers are used to point to the literals and this part of the CompiledMethod is known as the “Literal Frame”. In fact, you will notice that the literals usually include a few type of objects: Symbols (for selectors), Association (for classes and globals), SmallInteger, ByteString for string constants, etc. The “bytes” part is the part used to encode the bytecodes (so it means we have only 255 possible bytecodes???  stay tuned…) . Example:

MyClass >> testSomething
TestCase new.
self name.
Transcript show: 'The answer is:', 42.

If you now inspect the literals, you can see something like this:

(MyClass >>#testSomething) literals ---->>>{(#TestCase->TestCase). #name. #show:. (#Transcript->Transcript). #,. 'The answer is:'. 42. #testSomething. (#MyClass->MyClass)}

So…those are regular objects: (#TestCase->TestCase)  is an Association, #name a Symbol, ‘I am hungry’ a Bytestring, 42 a SmallInteger, etc. Think this:  if you explore any of those objects and check for the pointers to them, will you see the CompiledMethod of #testSomething as one of the pointers to them??? we will see the answer next post, but basically it depends whether the tool takes into consideration or not this special magic of CompiledMethod.

Mmmm now I wonder which are the possible classes for literals? …if my Smalltalk doesn’t fail me:

(CompiledMethod allInstances
inject: OrderedCollection new
into: [:allTypesOfLiterals :aCompiledMethod | allTypesOfLiterals addAll: ((aCompiledMethod literals collect: [:aLiteral | aLiteral class]) asSet ); yourself  ]) asSet.

Prints: ” a Set(Float Association ByteArray WideString LargeNegativeInteger AdditionalMethodState Character ByteSymbol Fraction ByteString SmallInteger Array ScaledDecimal LargePositiveInteger)”

If the format is “Bytes” and there is supposed to be no pointers, how it is possible that we can ask for an object (a literal for example)?  Ok…if you see CompiledMethod >> objectAt:   it delegates to a primitive. But since now you know how to download VMMaker and go to StackInterpreter class >> initializePrimitiveTable  and see that the primitive method is in fact called #primitiveObjectAt and you can see the code of what it does (hint: CompiledMethod has a header which contains the amount of literals among other stuff).

To conclude, let’s say that CompiledMethod format is “Bytes” but in fact it is the only class in the system that mixes pointers (for the literals) with bytes (for the bytecodes). Because of this, and another couple of reasons, CompiledMethod is aunique  quite special class.

Finally, I let you homework 😉 If we inspect/explore a ByteArray, we get something like this:

However, if we explore a CompiledMethod we get an explorer that show us the literals and the bytecodes in a nice way. Like this one:

How do you think the Explorer can do such thing?  and the Inspector ?

See you


14 thoughts on “Class formats and CompiledMethod uniqueness

  1. I’m playing with MicroSqueak. It uses the image as a development environment to spit out a tiny image of around 6OK. Tiny.

    This would be a great way to make a small deployment image and which left all the tests and the IDE out.

    The example image for MicroSqueak can be activated and it will spit out a tiny txt file with some strings in it. That’s all it does. (Of course it could do more.)

    I took all the classes in put them in Squeak 4.2. I tried the proper command:

    MicroSqueakImageBuilder new buildImageNamed: ‘foo.image’

    I got an interesting error:

    Error: Bad VM class layout: String

    MicroSqueak has an #checkLayoutOfVMClasses method that executes before #buildImageNamed: as a check.

    I should point out that MicroSqueak creates a special hierarchy of M classes. i.e. MString, MObject, MSystem, etc.

    It iterates through an Array of standard classes and compares them to the M-versions. It does that by testing for equivalence with instSpec.

    Having read your post, the same goal is achieved by sending #typeOfClass. I ran a few tests:

    CompiledMethod instSpec 12.
    CompiledMethod typeOfClass #compiledMethod.
    String instSpec 0.
    String typeOfClass #normal.
    MicroSqueak instSpec. 0
    MicroSqueak typeOfClass #normal.
    MString instSpec. 8
    MString typeOfClass #bytes

    It is clear that when the String instSpec = MString instSpec was reached it threw the error. It produced:

    #normal = #bytes

    and an error resulted. And I suppose that’s because the VM MicroSqueak ran on (which is old) is different from Cog.

    I figure that MicroSqueak can run on Cog. I imagine that things in the image are plastic and can be changed. I’ve got a program that expects one kind of VM, but I’ve got another.

    A useful and timely post. Thanks.

    Like

    1. Hi Chris. What a great comment 🙂 Yes, I am a little familiar with MicroSqueak because some guys at my lab are working with PharoSeed images and bootstrap and they have analyzed MicroSqueak as well. The problem is pretty clear, MString instSpec is 8 (bytes) and String 0 (normal). I think it is not a problem of the VM but instead of the image. If you see, in Squeak 4.1 String is normal and instSpec 0, and you can notice it is created with the normal method #subclass: ….. And, we have ByteString which is instSpec 8 and bytes. I think MicroSqueak is based in an old Squeak image. I have a squeak 2.6 handly, and as I suspected, ByteString doesn’t exist and instead String instSpec is 8 🙂 So what I am saying is that MicroSqueak was developed when String was “bytes”. Do you understand what I think or I have confused you even more? maybe someone can confirm or reject us this.

      Like

      1. I understand. And as I’d hoped, you’ve made the answer very lucid. In my MicroSqueak image String instSpec is #8. It’s bytes, as you say. And there’s no ByteString in that image either. Cleanly solved. Thanks!

        Like

Leave a Reply