Hi. After a couple of months talking about other stuff like Fuel, and presentations in conferences such as ESUG and Smalltalks, I would like now to continue with the “Journey through the Virtual Machine” for beginners. So far I have written the first and second part. Consider this post the first one of the third part.
Direct pointers vs object tables
Let’s say we have this code:
| aPoint | | aPoint := Point x: 10 y: 20.5.
In this case, aPoint has an instance variable that refers to an integer (10) and a float (20.5). How are these references implemented in the VM?
Most virtual machines have an important part whose responsibility is managing the memory, allocating objects, releasing, etc. In Squeak/Pharo VM, such part is called Object Memory. In addition, the Object Memory defines the internal representation of objects, its references, its location, its object header, etc. Regarding the references implementation, there are two possibilities which are the most common: object tables and direct pointers.
With the first, there is a large table with two entries. When the object aPoint refers to the float 20.5, it means that the instance variable “y” of aPoint has an index in the table where the memory address of the float 20.5 is located. With direct pointers, when aPoint refers to 20.5, it means that the instance variable “y” of aPoint has directly the memory address of 20.5.
There are pros and cons for each strategy but such discussion is out of range for this post. One of the nice things with object tables is that the primitive #become: is really fast since it is just updating one reference. With direct references, the #become: it needs to scan all the memory do detect all the objects that are pointing to a particular one. On the other hand, with object tables, we have to pay the cost of accessing an extra indirection and (I guess) this may impacts on the overall performance of the system. With direct pointers, we do not have that problem. Finally, object table uses more memory since the table itself needs memory. Few months ago there was a nice discussion in the mailing list about the prons and cons.
First Smalltalk VMs used to have an object table, but now most current VMs (included the Squeak/Pharo VM) use direct pointers. The only current VM I am aware of that uses object tables is GemStone. But… they actually have one (virtual) Object Table (OT) per committed transaction!! How they can do those optimizations and not blowup in terabytes of memory used by OTs? Well, that’s one of GemStone keys 😉 If you are interested in this topic, you can read this thread.
In the previous paragraphs you learn that each memory address in the Squeak/Pharo VM represents a direct pointer to another object. Well, that’s almost correct. We are missing what it is usually known as “immediate objects”. Immediate objects are those that are directly encoded in the memory address and do not require an object header nor slots so they consume less memory. In the CogVM there is only one type of immediate object, and it is SmallInteger. What does it mean?
In our example, the instance variable “x” of aPoint does not have a pointer to an instance of SmallInteger with the content 10. Instead, the memory address of “x” has directly encoded the value 10. So there is no instance of SmallInteger. But now, how the VM can known whether an instance variable is a pointer to another object or a SmallInteger? We need to tag a memory address to say “this is a object pointer” or “this is a SmallInteger”. To do that, the VM uses the last bit of the word (32 bits). If such bit is 1, then it is a signed 31-bits SmallInteger. If it is 0, it is a regular object pointer (oop).
Since I told you SmallInteger were encoded in 31 bits and they were signed, it follows that we have 30 bits for the number (one bit is for the sign). Hence, SmallInteger maxVal should be (2 raisedTo: 30) -1, that is, 1073741823. Analogy, SmallInteger minVal answers -1073741824. Number are encoded using the two’s complement. If you want to know more about this, read the excellent chapter that Stéphane Ducasse wrote about it.
Now, regarding object pointers, they always point to the memory address where the object header is. In our example, the instance variable “y” of aPoint, has the memory address of 20.5‘s object header.
As you can imagine, the VM needs to check all the time whether a OOP is really an OOP or an integer:
ObjectMemory >> isIntegerObject: objectPointer ^ (objectPointer bitAnd: 1) > 0
If you have an image with Cog loaded (as I explained in all my posts about building the VM), you can check for its senders…and you will find quite a lot 😉
Previously, I explain you why SmallInteger instances do not have object headers and those instances do not really exist as “objects”. That’s exactly why “SmallInteger instanceCount” answers zero. Each SmallInteger is encoded in different instance variables of different objects.
Another funny fact is why identity is always true with SmallIntegers. Say you have ‘1’ asNumber == (4-3), that answers true. Because at the end, the VM calls a regular C’s equality (=), which of course, for 2 equal numbers, it is always true. But of course, if those numbers are actually OOP (a number), if they are equal, then it means they both point to the same object:
StackInterpreter >> bytecodePrimEquivalent | rcvr arg | rcvr := self internalStackValue: 1. arg := self internalStackValue: 0. self booleanCheat: rcvr = arg.
There are more things where you can notice that SmallInteger is special. In fact, you can browse the class and see some methods it overwrites, like #nextInstance (throwing an error), #shallowCopy, #sizeInMemory, etc. And of course, there are more problems like trying to do a become. For example, (42 become: Date new) throws an error saying it cannot become SmallIntegers.
More immediate objects?
As said, in a word of 32 bits, we only use 1 bit for tagging immediate objects (SmallInteger in the case of the squeak VM). We could use more than 1 bit…but then it means we have fewer bits for the OOP, therefore, the maximum possible memory to address is smaller, because the amount of bits of the OOP limits us in how much memory we can address as maximum.
But….what happens in a 64-bits VM? I think 63 bits can be more than enough for memory addresses. So what about using fewer bits for OOP and more for immediate objects? Say we can use 58 for OOP and 6 for tagging immediate objects. In that example, we have (2 raisedTo: 6) – 1 , that is, 63 different possibilities!!! So we can not only encode SmallIntegers but also small floats, true, false, nil, characters, etc… Is that all? No! there are even more ideas. We can not only encode instances of certain class, but also give semantics to the possibility of tagging memory addresses. For example..we could use one of the combinations of tag bits to say that memory address is in fact a proxy. It doesn’t need to be an instance of Proxy, but we just give the semantics that when a memory address finishes with that tag bit, it means that the 58 bits for the OOP is not an OOP but a proxy contents. Such content can be a number representing an offset in a table, an address in secondary memory, etc… The VM could then do something different if the object is a proxy!
Well…all that I mention is not new at all. In fact, Gemstone does something very similar. They use 61 bits for address + 3 for tags. Here is a nice set of videos about Gemstone’s internals. And in this video you can see what we are speaking here.
Documentation and future posts
I always try to put some links together related to each post I talk about:
- Object Table explanation in the blue book
- Slides and video of “Journey In The VM”
- A Tour of the Squeak Object Engine
In the next post, I will give details about the current Object Header.