Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
Commit 15a77c6f authored by Andrea Arcangeli's avatar Andrea Arcangeli Committed by Linus Torvalds
Browse files

userfaultfd: fix SIGBUS resulting from false rwsem wakeups

With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

  $ ./userfaultfd 100 99999
  nr_pages: 25600, nr_pages_per_cpu: 800
  bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 1...
parent de182cc8
No related merge requests found
......@@ -63,6 +63,7 @@ struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
struct userfaultfd_ctx *ctx;
bool waken;
};
struct userfaultfd_wake_range {
......@@ -86,6 +87,12 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
if (len && (start > uwq->msg.arg.pagefault.address ||
start + len <= uwq->msg.arg.pagefault.address))
goto out;
WRITE_ONCE(uwq->waken, true);
/*
* The implicit smp_mb__before_spinlock in try_to_wake_up()
* renders uwq->waken visible to other CPUs before the task is
* waken.
*/
ret = wake_up_state(wq->private, mode);
if (ret)
/*
......@@ -264,6 +271,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
struct userfaultfd_wait_queue uwq;
int ret;
bool must_wait, return_to_userland;
long blocking_state;
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
......@@ -334,10 +342,13 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
uwq.wq.private = current;
uwq.msg = userfault_msg(vmf->address, vmf->flags, reason);
uwq.ctx = ctx;
uwq.waken = false;
return_to_userland =
(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
TASK_KILLABLE;
spin_lock(&ctx->fault_pending_wqh.lock);
/*
......@@ -350,8 +361,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
* following the spin_unlock to happen before the list_add in
* __add_wait_queue.
*/
set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
TASK_KILLABLE);
set_current_state(blocking_state);
spin_unlock(&ctx->fault_pending_wqh.lock);
must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
......@@ -364,6 +374,29 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();
ret |= VM_FAULT_MAJOR;
/*
* False wakeups can orginate even from rwsem before
* up_read() however userfaults will wait either for a
* targeted wakeup on the specific uwq waitqueue from
* wake_userfault() or for signals or for uffd
* release.
*/
while (!READ_ONCE(uwq.waken)) {
/*
* This needs the full smp_store_mb()
* guarantee as the state write must be
* visible to other CPUs before reading
* uwq.waken from other CPUs.
*/
set_current_state(blocking_state);
if (READ_ONCE(uwq.waken) ||
READ_ONCE(ctx->released) ||
(return_to_userland ? signal_pending(current) :
fatal_signal_pending(current)))
break;
schedule();
}
}
__set_current_state(TASK_RUNNING);
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment