next up previous
Next: Stack Reduction Results Up: Experiences with modularizing the Previous: Comparison of core sizes


Reduction of Stack Usage

The second task was to reduce the load placed on the C stack by the interpreter to enable the interpreter to run in environments with limited stack space. In the case of our customer this would be a router.

To this end three different techniques were used:

  1. Identification of variable sized data types with a large default size, instances of which are placed on the stack.

    For such types we reduced their default size either through an existing preprocessor macro determining their size, or by introducing such a macro and proceeding as before.

    An example of such a type is Tcl_DString. Its base size is $3*sizeof(long)$ on typical 32bit systems, i.e. 12 bytes. However it also allocates a static character buffer as part of the structure to handle small strings without having to allocate space on the heap. The macro controlling the size of this buffer is
    TCL_DSTRING_STATIC_SIZE. Its default value is 200, driving the size of the whole structure to 212 bytes.

    Table 3 shows the macros introduced in this way.


    Table 3: Adaptable default sizes
    Macro Default size Meaning, Location of usage
    TCL_DSTRING_STATIC_SIZE 200 Default static buffer in Tcl_DString
    TCL_FMT_STATIC_FLOATBUFFER_SZ 320 generic/tclCmdAH.c, line 1978
    TCL_FMT_STATIC_VALIDATE_LIST 16 generic/tclScan.c, line 266
    TCL_FOREACH_STATIC_ARGS 9 generic/tclCmdAH.c, line 1735
    TCL_FOREACH_STATIC_LIST_SZ 4 generic/tclCmdAH.c, line 1739
    TCL_FOREACH_STATIC_VARLIST_SZ 5 generic/tclCompCmds.c, line 621
    TCL_RESULT_APPEND_STATIC_LIST_SZ 16 generic/tclResult.c, line 458
        generic/tclStringObj.c, line 1199
    TCL_MERGE_STATIC_LIST_SZ 20 generic/tclListObj.c, line 1009
        generic/tclUtil.c, line 831
    TCL_PROC_STATIC_CLOCALS 20 generic/tclExecute.c, line 6272
    TCL_PROC_STATIC_ARGS 20 generic/tclProc.c, line 751
    TCL_INVOKE_STATIC_ARGS 20 generic/tclBasic.c, line 1782
        generic/tclBasic.c, line 1857
        generic/tclBasic.c, line 2951
        generic/tclParse.c, line 1332
    TCL_EVAL_STATIC_VARCHARS 30 generic/tclParse.c, line 1166
    TCL_STATS_COUNTERS 10 generic/tclHash.c, line 306
        generic/tclLiteral.c, line 898
    TCL_LSORT_STATIC_MERGE_BUCKETS 30 generic/tclCmdIL.c, line 2680

  2. Identification of large variables on the stack which cannot be shrunk. For these we devised a set of macros for the declaration and use of such variables whose implementation can be switched between the allocation of such variables on the stack and their allocation on the heap.

    An example of the usage of these macros can be found in the file generic/tclCompCmds.c, see the function TclCompileIncrCmd:

    int
    TclCompileIncrCmd(interp, parsePtr, envPtr)
        Tcl_Interp *interp;
        Tcl_Parse *parsePtr;
        CompileEnv *envPtr;
    {
        Tcl_Token *varTokenPtr, *incrTokenPtr;
        TEMP (Tcl_Parse) elemParse;
        ...
        STRING (160, buffer);
    
        NEWTEMP (Tcl_Parse,elemParse);
        NEWSTR (160, buffer);
    
        envPtr->maxStackDepth = 0;
        if ((parsePtr->numWords != 2) &&
    	(parsePtr->numWords != 3)) {
            ...
            RELTEMP(elemParse);
            return TCL_ERROR;
        }
        ...
    }
    

    Figure 4 shows side by side the two implementations for allocating temporary variables on either the stack or the heap. If the macro TCL_STRUCT_ON_HEAP is defined during compilation temporary variables will be allocated on the heap.


    Table 4: Switchable temporary variables
    Macro Heap Stack
    TEMP(t) t * t
    ITEM(var,item) var $\rightarrow$ item var . item
    REF(var) (var) &(var)
    NEWTEMP(t,var) (var) = (t *) ckalloc(sizeof(t))  
    RELTEMP(var) ckfree((void*)(var))  
    STRING(n,var) char* var char var [n]
    NEWSTR(n,var) (var) = (char *) ckalloc(n)  

  3. Miguel Sofer[*] provided us with a modified engine for the execution of bytecode. This engine does not invoke itself recursively when calling a byte-compiled procedure from other byte-compiled code. It rather jumps directly back into its own main loop, reusing the state variables on the C stack. The actual state of the engine is saved to and restored from the Tcl evaluation stack, which is managed on the heap. This means that calling Tcl procedures inside of Tcl procedures, and especially recursion of Tcl procedures does not consume any C stack anymore.

    This new executor was named NRE, for ``non-recursive engine''.

To identify the hot spots of stack usage, and from there the variables and data structures causing them, we semi-automatically instrumented a copy of the base sources with code monitoring the entrance to and returns from all functions in the interpreter and also recording the the location of the stack pointer for each call.

This instrumentation was only semi-automatic because the script used to insert the code was very simple and did not recognize all possible C syntax for return statements and function calls. About 25 % of the instrumented code had to be reworked manually to get the instrumentation code to work properly in them. There were two big offenders in this regard.

The inserted code manages a logical duplicate of the C call stack on the heap and uses that to record both the names of the active functions and the locations of the cpu stack pointer for them. By comparing the location of the stack pointer before the call of a function F and its location inside, it was possible for us to deduce the amount of C stack space consumed by F.

This method also records the amount of stack required for function linkage, i.e. stack management done by the C compiler. This is no true disadvantage as this amount is usually small, and more important, constant. Because of the latter this does not disturb the ranking of functions when sorting them by the amount of stack they require.

As the execution of the inserted code causes a noticeable slowdown of the interpreter additional processing beyond the comparison of the stack pointer is not done. Instead we write the information about the association between functions and required stack directly to a log file. After each run a number of external scripts is used to postprocess the logged data. The result is a sorted list where the functions consuming the most stack are shown at the top. Thus identifying the hot spots to look at.

Instead of trying to devise a Tcl application which exercises all parts of the interpreter we simply executed the testsuite that comes with the Tcl core to generate the stack traces.


next up previous
Next: Stack Reduction Results Up: Experiences with modularizing the Previous: Comparison of core sizes
andreas_kupries@users.sourceforge.net