Solved: `Thread::unpark` is extremely slow on windows

I am investigating a particular instance of slowness in rust-analyzer, and, debugging through a threadpool and crossbeam-channel, I made the following observation:

  • If I pin my program to one core with
    unsafe {
        let process = winapi::um::processthreadsapi::GetCurrentProcess();
        let mask = 1;
        winapi::um::winbase::SetProcessAffinityMask(process, mask);
    }
    
  • then I repeatedly observe std::thread::Thread::unpark taking 30-40 milliseconds.

Specifically, sticking Instant::now/::elapsed around this routinely leads to printouts with unpark = 39.9091ms or some such (example). This seems very surprising to me, I would expect unpark to be sys-call lenght (hundreds of nanos), but this waaaay longer.

Now, this is the bottom of the rabbit hole (cc @retep998 ) for me for now, as I can't [patch.crates-io] libstd easily. Does anyone perchance know what might be happening here?

Note that I don't have a minimal reproducible example yet, as it requires patching rust-analyzer, threadpool and crossbeam.

EDIT: @jstarks figured it out, it were priority boosts: Solved: `Thread::unpark` is extremely slow on windows

3 Likes

I don't know why this is happening, but this whole "optimization" seems suspicious:

I suspect the overhead of this extra lock would outweigh the cost of just always calling notify_one, which on windows translates to a call to WakeConditionVariable, which AFAIK doesn't always result in a syscall. Still it doesn't really explain why it's so slow.

Is there any workflow to use a patched version of stdlib? I’d love to add more printfs...

It's a bit of a hack, but you can modify your rust-src in ~/.rustup and then use xargo to build a custom libstd.

1 Like

I am stumped:

    #[inline]
    pub unsafe fn notify_one(&self) {
        let s = crate::time::Instant::now();
        c::WakeConditionVariable(self.inner.get());
        eprintln!("sys/notify_one {:?}", s.elapsed());
    }

The above prints sys/notify_one 35.3309ms. WakeConditionVariable is "libc" function (there was a bit of logic to load fallback for win XP, but I've removed it in favor of just unconditionally calling the function) .

Which version of Windows are you testing against?

Windows 10 Pro

Oh, I think I know why this is happening. You are limiting the process to a single logical processor, and when you call thread::unpark, it is context switching away from the current thread (giving up its remaining time-slice). If other threads in the process are CPU-bound, they will use up their entire time-slices (15-20ms each by default) so what you are seeing corresponds to about two CPU-bound threads in the same process.

But why would it context switch away? This is unpark, not park. I definitely don't want to context switch away. The thing I am doing, on the high-level, is threadpool.execute(move || some_task()). That execute uses a channel to send the task to the worker thread, and that channel uses unpark to actually wake the worker. I want the main thread to continue doing the work (which it definitely can do).

If the two threads are the same priority why would it not switch to the thread that's being woken up? The thread being unparked also has work it can do.

The good reason not to switch is fairness. In my case, the main thread spawns ten background tasks in a row, and then actually cancels them. If unpark also yields, my main thread doesn't use a full quant, while all background threads do. But I see how OS API could be implemented that way.

I don't believe the windows scheduler is completely fair, but you should be able to lower the priority of the thread pool to prevent it from conflicting (assuming the CPU bounds tasks are running on the thread pool).

Internally in Windows, WakeConditionVariable dynamically (temporarily) boosts the waiting thread's priority by one, which likely makes it higher priority than the waker thread. Therefore it runs first.

There's a little info on priority boosts at https://docs.microsoft.com/en-us/windows/win32/procthread/priority-boosts.

5 Likes

You could try disabling the priority boost to see if this help: SetProcessPriorityBoost(GetCurrentProcess(), FALSE)

4 Likes

It should be TRUE, because

BOOL SetProcessPriorityBoost(
  HANDLE hProcess,
  BOOL   bDisablePriorityBoost
);

But yes, I confirm that this does in fact fix the problem I am seeing, priority boosts are to blame!

9 Likes