I am investigating a particular instance of slowness in rust-analyzer, and, debugging through a threadpool and crossbeam-channel, I made the following observation:
If I pin my program to one core with
unsafe {
let process = winapi::um::processthreadsapi::GetCurrentProcess();
let mask = 1;
winapi::um::winbase::SetProcessAffinityMask(process, mask);
}
then I repeatedly observe std::thread::Thread::unpark taking 30-40 milliseconds.
Specifically, sticking Instant::now/::elapsed around this routinely leads to printouts with unpark = 39.9091ms or some such (example). This seems very surprising to me, I would expect unpark to be sys-call lenght (hundreds of nanos), but this waaaay longer.
Now, this is the bottom of the rabbit hole (cc @retep998 ) for me for now, as I can't [patch.crates-io] libstd easily. Does anyone perchance know what might be happening here?
Note that I don't have a minimal reproducible example yet, as it requires patching rust-analyzer, threadpool and crossbeam.
I don't know why this is happening, but this whole "optimization" seems suspicious:
I suspect the overhead of this extra lock would outweigh the cost of just always calling notify_one, which on windows translates to a call to WakeConditionVariable, which AFAIK doesn't always result in a syscall. Still it doesn't really explain why it's so slow.
#[inline]
pub unsafe fn notify_one(&self) {
let s = crate::time::Instant::now();
c::WakeConditionVariable(self.inner.get());
eprintln!("sys/notify_one {:?}", s.elapsed());
}
The above prints sys/notify_one 35.3309ms. WakeConditionVariable is "libc" function (there was a bit of logic to load fallback for win XP, but I've removed it in favor of just unconditionally calling the function) .
Oh, I think I know why this is happening. You are limiting the process to a single logical processor, and when you call thread::unpark, it is context switching away from the current thread (giving up its remaining time-slice). If other threads in the process are CPU-bound, they will use up their entire time-slices (15-20ms each by default) so what you are seeing corresponds to about two CPU-bound threads in the same process.
But why would it context switch away? This is unpark, not park. I definitely don't want to context switch away. The thing I am doing, on the high-level, is threadpool.execute(move || some_task()). That execute uses a channel to send the task to the worker thread, and that channel uses unpark to actually wake the worker. I want the main thread to continue doing the work (which it definitely can do).
If the two threads are the same priority why would it not switch to the thread that's being woken up? The thread being unparked also has work it can do.
The good reason not to switch is fairness. In my case, the main thread spawns ten background tasks in a row, and then actually cancels them. If unpark also yields, my main thread doesn't use a full quant, while all background threads do. But I see how OS API could be implemented that way.
I don't believe the windows scheduler is completely fair, but you should be able to lower the priority of the thread pool to prevent it from conflicting (assuming the CPU bounds tasks are running on the thread pool).
Internally in Windows, WakeConditionVariable dynamically (temporarily) boosts the waiting thread's priority by one, which likely makes it higher priority than the waker thread. Therefore it runs first.