Thread Bundles

[VERIFIED] Last updated by Joe Schaefer on Fri, 19 Sep 2025    source
 

AI generated Thread Bundle

POSIX threads

I was tasked this year with porting our maximum likelihood function from a single threaded version to a multithreaded one, based on the architectural premise that the slowest portion of that code is where the actual optimization logic loops over an index of modeling group Id‘s, which by design, each iteration through the loop should be independent of the rest.

Fun problem when the loop

  1. fails to reinitialize dependent automatic variables,
  2. alters several global pointers indexed by modeling group id (jG),
  3. is in dire need of thread_local variables to separate global memory segments dedicated to the calculation (on a per id basis).

1 and 3 were straightforward. With 2 I discovered a clever algorithm involving the concept of a THREAD_BUNDLE, which is a per thread block of serial modeling group IDs.

By ensuring THREAD_BUNDLE is at least as large as a memory page size divided by the size of the individual pointers in 2, we can ensure that different threads operate on independent pages in virtual memory, so we can eschew mutex locks on those memory writes. (Assuming of course that the threads process their THREAD_BUNDLE at roughly the same velocity).

Moreover, I eventually removed the “same velocity” dependency by splitting the loop into even/odd pairings, which guarantees at least one memory page separates the active THREAD_BUNDLE loops.

Core Loops

  /* EVEN */
  for (jG = minG; jG <= numG; jG += 2*THREAD_BUNDLE)
  {
    pthread_t ptid; 3 refs
    arg_t* arg = mal (sizeof(*arg), "arg_t"); 12 refs

#define SET_ARG(foo) arg->foo = (foo) 30 refs

    SET_ARG(jG);
    ...

#undef SET_ARG

    // rsim_mlf_thread loops over the index bundle
    // from jG to MIN(numG, jG + THREAD_BUNDLE - minG), doing
    // stuff with global data structures also indexed by jG

    pthread_create (&ptid, NULL, rsim_mlf_thread, arg);
    PUSH_STACK_ELT(thread_stack, ptid, pthread_t);
  }

  {
    pthread_t * ptid; 3 refs
    while (POP_STACK_PTR(thread_stack, ptid, pthread_t) != NULL)
    {
      arg_t* arg; 6 refs
      pthread_join(*ptid, (void**)&arg);
      ...
  }

  /* ODD */
  for (jG = THREAD_BUNDLE + minG; jG <= numG; jG += 2*THREAD_BUNDLE)
  {
    pthread_t ptid; 3 refs
    arg_t* arg = mal (sizeof(*arg), "arg_t"); 12 refs
    ...

Memory Layout