Solved: `Thread::unpark` is extremely slow on windows

matklad · January 25, 2020, 2:34pm

I am investigating a particular instance of slowness in rust-analyzer, and, debugging through a threadpool and crossbeam-channel, I made the following observation:

If I pin my program to one core with

unsafe {
    let process = winapi::um::processthreadsapi::GetCurrentProcess();
    let mask = 1;
    winapi::um::winbase::SetProcessAffinityMask(process, mask);
}

then I repeatedly observe std::thread::Thread::unpark taking 30-40 milliseconds.

Specifically, sticking Instant::now/::elapsed around this routinely leads to printouts with unpark = 39.9091ms or some such (example). This seems very surprising to me, I would expect unpark to be sys-call lenght (hundreds of nanos), but this waaaay longer.

Now, this is the bottom of the rabbit hole (cc @retep998 ) for me for now, as I can't [patch.crates-io] libstd easily. Does anyone perchance know what might be happening here?

Note that I don't have a minimal reproducible example yet, as it requires patching rust-analyzer, threadpool and crossbeam.

EDIT: @jstarks figured it out, it were priority boosts: Solved: `Thread::unpark` is extremely slow on windows

Diggsey · January 25, 2020, 2:46pm

I don't know why this is happening, but this whole "optimization" seems suspicious:

github.com

rust-lang/rust/blob/master/src/libstd/thread/mod.rs#L1191-L1209


match self.inner.state.swap(NOTIFIED, SeqCst) {
    EMPTY => return,    // no one was waiting
    NOTIFIED => return, // already unparked
    PARKED => {}        // gotta go wake someone up
    _ => panic!("inconsistent state in unpark"),
}


// There is a period between when the parked thread sets `state` to
// `PARKED` (or last checked `state` in the case of a spurious wake
// up) and when it actually waits on `cvar`. If we were to notify
// during this period it would be ignored and then when the parked
// thread went to sleep it would never wake up. Fortunately, it has
// `lock` locked at this stage so we can acquire `lock` to wait until
// it is ready to receive the notification.
//
// Releasing `lock` before the call to `notify_one` means that when the
// parked thread wakes it doesn't get woken only to have to wait for us
// to release `lock`.
drop(self.inner.lock.lock().unwrap());

I suspect the overhead of this extra lock would outweigh the cost of just always calling notify_one, which on windows translates to a call to WakeConditionVariable, which AFAIK doesn't always result in a syscall. Still it doesn't really explain why it's so slow.

matklad · January 25, 2020, 3:07pm

Is there any workflow to use a patched version of stdlib? I’d love to add more printfs...

Amanieu · January 25, 2020, 3:18pm

It's a bit of a hack, but you can modify your rust-src in ~/.rustup and then use xargo to build a custom libstd.

matklad · January 25, 2020, 3:53pm

I am stumped:

    #[inline]
    pub unsafe fn notify_one(&self) {
        let s = crate::time::Instant::now();
        c::WakeConditionVariable(self.inner.get());
        eprintln!("sys/notify_one {:?}", s.elapsed());
    }

The above prints sys/notify_one 35.3309ms. WakeConditionVariable is "libc" function (there was a bit of logic to load fallback for win XP, but I've removed it in favor of just unconditionally calling the function) .

jstarks · January 25, 2020, 4:25pm

Which version of Windows are you testing against?

matklad · January 25, 2020, 4:25pm

Windows 10 Pro

Diggsey · January 25, 2020, 4:45pm

Oh, I think I know why this is happening. You are limiting the process to a single logical processor, and when you call thread::unpark, it is context switching away from the current thread (giving up its remaining time-slice). If other threads in the process are CPU-bound, they will use up their entire time-slices (15-20ms each by default) so what you are seeing corresponds to about two CPU-bound threads in the same process.

matklad · January 25, 2020, 5:01pm

But why would it context switch away? This is unpark, not park. I definitely don't want to context switch away. The thing I am doing, on the high-level, is threadpool.execute(move || some_task()). That execute uses a channel to send the task to the worker thread, and that channel uses unpark to actually wake the worker. I want the main thread to continue doing the work (which it definitely can do).

Diggsey · January 25, 2020, 5:19pm

If the two threads are the same priority why would it not switch to the thread that's being woken up? The thread being unparked also has work it can do.

matklad · January 25, 2020, 5:29pm

The good reason not to switch is fairness. In my case, the main thread spawns ten background tasks in a row, and then actually cancels them. If unpark also yields, my main thread doesn't use a full quant, while all background threads do. But I see how OS API could be implemented that way.

Diggsey · January 25, 2020, 5:38pm

I don't believe the windows scheduler is completely fair, but you should be able to lower the priority of the thread pool to prevent it from conflicting (assuming the CPU bounds tasks are running on the thread pool).

jstarks · January 25, 2020, 7:50pm

Internally in Windows, WakeConditionVariable dynamically (temporarily) boosts the waiting thread's priority by one, which likely makes it higher priority than the waker thread. Therefore it runs first.

There's a little info on priority boosts at https://docs.microsoft.com/en-us/windows/win32/procthread/priority-boosts.

Amanieu · January 26, 2020, 12:36am

You could try disabling the priority boost to see if this help: SetProcessPriorityBoost(GetCurrentProcess(), FALSE)

matklad · January 26, 2020, 10:48am

It should be TRUE, because

BOOL SetProcessPriorityBoost(
  HANDLE hProcess,
  BOOL   bDisablePriorityBoost
);

But yes, I confirm that this does in fact fix the problem I am seeing, priority boosts are to blame!

system · April 25, 2020, 10:48am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
`Thread::park_waker`: a `Waker` calling `Thread::unpark` libs	3	577	October 7, 2023
Pre-RFC: Fix spuriosity in thread::park libs	17	1792	July 14, 2019
Standard library, synchronization primitives, and undefined behavior libs	24	10308	April 7, 2019
Trying to understand correctness of synchronization primitives libs	4	777	January 25, 2021
Pre-RFC: Thread Affinity	9	9810	March 25, 2019

Solved: `Thread::unpark` is extremely slow on windows

Related topics