A look inside blocks: Episode 2

This is a follow on post to A look inside blocks: Episode 1 in which I looked into the innards of blocks and how the compiler sees them. In this article I take a look at blocks that are not constant and how they are formed on the stack.

Block types

In the first article we saw the block have a class of _NSConcreteGlobalBlock. The block structure and descriptor were both fully initialised at compile time since all variables were known. There are a few different types of block, each with their own associated class. However for simplicities sake, we just need to consider 3 of them:

_NSConcreteGlobalBlock is a block defined globally where it is fully complete at compile time. These blocks are those that don’t capture any scope such as an empty block.
_NSConcreteStackBlock is a block located on the stack. This is where all blocks start out before they are eventually copied onto the heap.
_NSConcreteMallocBlock is a block located on the heap. After copying a block, this is where they end up. Once here they are reference counted and freed when the reference count drops to zero.

A block that captures scope

This time we’re going to look at the following bit of code:

#import <dispatch/dispatch.h>

typedef void(^BlockA)(void);
void foo(int);

__attribute__((noinline))
void runBlockA(BlockA block) {
    block();
}

void doBlockA() {
    int a = 128;
    BlockA block = ^{
        foo(a);
    };
    runBlockA(block);
}

The function called foo is just there so that the block captures something, by having a function to call with a captured variable. Once again, we look at the armv7 assembly produced, relevant bits only:

    .globl  _runBlockA
    .align  2
    .code   16                      @ @runBlockA
    .thumb_func     _runBlockA
_runBlockA:
    ldr     r1, [r0, #12]
    bx      r1

First of all the runBlockA function is the same as before. It’s calling the invoke function of the block. Then onto doBlockA:

    .globl  _doBlockA
    .align  2
    .code   16                      @ @doBlockA
    .thumb_func     _doBlockA
_doBlockA:
    push    {r7, lr}
    mov     r7, sp
    sub     sp, #24
    movw    r2, :lower16:(L__NSConcreteStackBlock$non_lazy_ptr-(LPC1_0+4))
    movt    r2, :upper16:(L__NSConcreteStackBlock$non_lazy_ptr-(LPC1_0+4))
    movw    r1, :lower16:(___doBlockA_block_invoke_0-(LPC1_1+4))
LPC1_0:
    add     r2, pc
    movt    r1, :upper16:(___doBlockA_block_invoke_0-(LPC1_1+4))
    movw    r0, :lower16:(___block_descriptor_tmp-(LPC1_2+4))
LPC1_1:
    add     r1, pc
    ldr     r2, [r2]
    movt    r0, :upper16:(___block_descriptor_tmp-(LPC1_2+4))
    str     r2, [sp]
    mov.w   r2, #1073741824
    str     r2, [sp, #4]
    movs    r2, #0
LPC1_2:
    add     r0, pc
    str     r2, [sp, #8]
    str     r1, [sp, #12]
    str     r0, [sp, #16]
    movs    r0, #128
    str     r0, [sp, #20]
    mov     r0, sp
    bl      _runBlockA
    add     sp, #24
    pop     {r7, pc}

Well this is very different to before. Instead of seeing a block get loaded from a global symbol, it looks like a lot more work is being done. It might look daunting, but it’s pretty easy to see what’s going on. It’s probably best to consider the function rearranged, but believe me that this doesn’t alter anything functionally. The reason the compiler has emitted the instructions in the order it has is for optimisation to reduce pipeline bubbles, etc. So, rearranged the function looks like this:

_doBlockA:
        // 1
        push    {r7, lr}
        mov     r7, sp

        // 2
        sub     sp, #24

        // 3
        movw    r2, :lower16:(L__NSConcreteStackBlock$non_lazy_ptr-(LPC1_0+4))
        movt    r2, :upper16:(L__NSConcreteStackBlock$non_lazy_ptr-(LPC1_0+4))
LPC1_0:
        add     r2, pc
        ldr     r2, [r2]
        str     r2, [sp]

        // 4
        mov.w   r2, #1073741824
        str     r2, [sp, #4]

        // 5
        movs    r2, #0
        str     r2, [sp, #8]

        // 6
        movw    r1, :lower16:(___doBlockA_block_invoke_0-(LPC1_1+4))
        movt    r1, :upper16:(___doBlockA_block_invoke_0-(LPC1_1+4))
LPC1_1:
        add     r1, pc
        str     r1, [sp, #12]

        // 7
        movw    r0, :lower16:(___block_descriptor_tmp-(LPC1_2+4))
        movt    r0, :upper16:(___block_descriptor_tmp-(LPC1_2+4))
LPC1_2:
        add     r0, pc
        str     r0, [sp, #16]

        // 8
        movs    r0, #128
        str     r0, [sp, #20]

        // 9
        mov     r0, sp
        bl      _runBlockA

        // 10
        add     sp, #24
        pop     {r7, pc}

This is what that is doing:

Function prologue. r7 is pushed onto the stack because it’s going to get overwritten and is a register which must be preserved across function calls. lr is the link register and contains the address of the next instruction to execute when this function returns. See the function epilogue for more on that. Also, the stack pointer is saved into r7.
Subtract 24 from the stack pointer. This makes room for 24 bytes of data to be stored in stack space.
This little block of code is doing a lookup of the L__NSConcreteStackBlock$non_lazy_ptr symbol, relative to the program counter such that it works wherever the code may end up in the binary when finally linked. The value is then stored to the address of the stack pointer.
The value 1073741824 is stored to the stack pointer + 4.
The value 0 is stored to the stack pointer + 8. By now it may be becoming clear what’s going on. A Block_layout structure is being created on the stack! Up until now there’s the isa pointer, the flags and the reserved values being set.
The address of ___doBlockA_block_invoke_0 is stored at the stack pointer + 12. This is the invoke parameter of the block structure.
The address of ___block_descriptor_tmp is stored at the stack pointer + 16. This is the descriptor parameter of the block structure.
The value 128 is stored at the stack pointer + 20. Ah. If you look back at the Block_layout struct you’ll see that there’s only 5 values in it. So what is this being stored after the end of the struct then? Well, you’ll notice that the value is 128 which is the value of the variable captured in the block. So this must be where blocks store values that they use – after the end of the Block_layout struct.
The stack pointer, which now points to a fully initialised block structure is put into r0 and runBlockA is called. (Remember that r0 contains the first argument to a function in the ARM EABI).
Finally the stack pointer has 24 added back to it to balance out the subtraction at the start of the function. Then 2 values are popped off the stack into r7 and pc respectively. The r7 balances the push from the prologue and the pc will now get the value that was in lr when the function began. This effectively performs the return of the function as it sets the CPU to continue executing (the pc, program counter) from where the function was told to return to, lr the link register.

Wow! You still with me? Brilliant!

The final bit of this little section is to check what the invoke function and the descriptor look like. We would expect them to be not much different to the global block from episode 1. Here they are:

    .align  2
    .code   16                      @ @__doBlockA_block_invoke_0
    .thumb_func     ___doBlockA_block_invoke_0
___doBlockA_block_invoke_0:
    ldr     r0, [r0, #20]
    b.w     _foo

    .section        __TEXT,__cstring,cstring_literals
L_.str:                                 @ @.str
    .asciz   "v4@?0"

    .section        __TEXT,__objc_classname,cstring_literals
L_OBJC_CLASS_NAME_:                     @ @"\01L_OBJC_CLASS_NAME_"
    .asciz   "\001P"

    .section        __DATA,__const
    .align  2                       @ @__block_descriptor_tmp
___block_descriptor_tmp:
    .long   0                       @ 0x0
    .long   24                      @ 0x18
    .long   L_.str
    .long   L_OBJC_CLASS_NAME_

And yep, there’s not much difference really. The only difference is the size parameter of the block descriptor. It’s now 24 rather than 20. This is because there’s an integer value captured by the block and so the block structure is 24 bytes rather than the standard 20. We saw the extra bytes being added to the end of the structure when it was created.

Also in the actual block function, i.e. __doBlockA_block_invoke_0, you can see the value being read out of the end of the block structure, i.e. r0 + 20. This is the variable captured in the block.

What about capturing object types?

The next thing to consider is what if instead of capturing an integer, it was an object type such as an NSString. To see what happens there, consider the following code:

#import <dispatch/dispatch.h>

typedef void(^BlockA)(void);
void foo(NSString*);

__attribute__((noinline))
void runBlockA(BlockA block) {
    block();
}

void doBlockA() {
    NSString *a = @"A";
    BlockA block = ^{
        foo(a);
    };
    runBlockA(block);
}

I won’t go into the details of doBlockA because that doesn’t change much. What is interesting is the block descriptor structure that’s created:

    .section        __DATA,__const
    .align  4                       @ @__block_descriptor_tmp
___block_descriptor_tmp:
    .long   0                       @ 0x0
    .long   24                      @ 0x18
    .long   ___copy_helper_block_
    .long   ___destroy_helper_block_
    .long   L_.str1
    .long   L_OBJC_CLASS_NAME_

Notice there are pointers to functions called ___copy_helper_block_ and ___destroy_helper_block_. Here are the definitions of those functions:

    .align  2
    .code   16                      @ @__copy_helper_block_
    .thumb_func     ___copy_helper_block_
___copy_helper_block_:
    ldr     r1, [r1, #20]
    adds    r0, #20
    movs    r2, #3
    b.w     __Block_object_assign

    .align  2
    .code   16                      @ @__destroy_helper_block_
    .thumb_func     ___destroy_helper_block_
___destroy_helper_block_:
    ldr     r0, [r0, #20]
    movs    r1, #3
    b.w     __Block_object_dispose

I assume these functions are what gets run when blocks are copied and destroyed. They must be retaining and releasing the object that was captured by the block. It looks like the copy function takes 2 parameters as both r0 and r1 are addressed as if they contain valid data. The destroy function looks like it just takes 1. All of the hard work looks like it’s done by _Block_object_assign and _Block_object_dispose. The code for that is within the block runtime code, part of the compiler-rt project within LLVM.

If you want to go away and have a read of the code for the blocks runtime then take a look at the source which can be downloaded from http://compiler-rt.llvm.org. In particular, runtime.c is the file to look at.

What next?

In the next episode I shall take a look into the blocks runtime by investigating the code for Block_copy and see just how that does its business. This will give an insight into the copy and destroy helper functions we’ve just seen get created for blocks that capture objects.

Matt Galloway

My home on the 'net.

A look inside blocks: Episode 2

Block types

A block that captures scope

What about capturing object types?

What next?

Comments