Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Mar 28, 2022
  2. Mar 18, 2022
  3. Mar 09, 2022
  4. Mar 08, 2022
  5. Feb 22, 2022
  6. Feb 17, 2022
  7. Feb 16, 2022
  8. Feb 02, 2022
  9. Dec 21, 2021
  10. Dec 03, 2021
  11. Nov 29, 2021
  12. Nov 15, 2021
    • Ming Lei's avatar
      blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() · 2a19b28f
      Ming Lei authored
      
      For avoiding to slow down queue destroy, we don't call
      blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
      cancel dispatch work in blk_release_queue().
      
      However, this way has caused kernel oops[1], reported by Changhui. The log
      shows that scsi_device can be freed before running blk_release_queue(),
      which is expected too since scsi_device is released after the scsi disk
      is closed and the scsi_device is removed.
      
      Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
      and disk_release():
      
      1) when disk_release() is run, the disk has been closed, and any sync
      dispatch activities have been done, so canceling dispatch work is enough to
      quiesce filesystem I/O dispatch activity.
      
      2) in blk_cleanup_queue(), we only focus on passthrough request, and
      passthrough request is always explicitly allocated & freed by
      its caller, so once queue is frozen, all sync dispatch activity
      for passthrough request has been done, then it is enough to just cancel
      dispatch work for avoiding any dispatch activity.
      
      [1] kernel panic log
      [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
      [12622.777186] #PF: supervisor read access in kernel mode
      [12622.782918] #PF: error_code(0x0000) - not-present page
      [12622.788649] PGD 0 P4D 0
      [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
      [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
      [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
      [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
      [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
      [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
      [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
      [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
      [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
      [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
      [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
      [12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
      [12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
      [12622.913328] Call Trace:
      [12622.916055]  <TASK>
      [12622.918394]  scsi_mq_get_budget+0x1a/0x110
      [12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
      [12622.928404]  ? pick_next_task_fair+0x39/0x390
      [12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
      [12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
      [12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
      [12622.949593]  process_one_work+0x1e8/0x3c0
      [12622.954059]  worker_thread+0x50/0x3b0
      [12622.958144]  ? rescuer_thread+0x370/0x370
      [12622.962616]  kthread+0x158/0x180
      [12622.966218]  ? set_kthread_struct+0x40/0x40
      [12622.970884]  ret_from_fork+0x22/0x30
      [12622.974875]  </TASK>
      [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
      
      Reported-by: default avatarChanghuiZhong <czhong@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2a19b28f
  13. Nov 09, 2021
  14. Nov 05, 2021
  15. Nov 04, 2021
    • Luis Chamberlain's avatar
      block: update __register_blkdev() probe documentation · 26e06f5b
      Luis Chamberlain authored
      
      __register_blkdev() is used to register a probe callback, and
      that callback is typically used to call add_disk(). Now that
      we are able to capture errors for add_disk(), we need to fix
      those probe calls where add_disk() fails and clean up resources.
      
      We don't extend the probe call to return the error given:
      
      1) we'd have to always special-case the case where the disk
         was already present, as otherwise concurrent requests to
         open an existing block device would fail, and this would be
         a userspace visible change
      2) the error from ilookup() on blkdev_get_no_open() is sufficient
      3) The only thing the probe call is used for is to support
         pre-devtmpfs, pre-udev semantics that want to create disks when
         their pre-created device node is accessed, and so we don't care
         for failures on probe there.
      
      Expand documentation for the probe callback to ensure users cleanup
      resources if add_disk() is used and to clarify this interface may be
      removed in the future.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20211103230437.1639990-12-mcgrof@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      26e06f5b
  16. Oct 26, 2021
  17. Oct 21, 2021
  18. Oct 19, 2021
  19. Oct 18, 2021
  20. Oct 17, 2021
  21. Oct 15, 2021
  22. Oct 02, 2021
  23. Sep 07, 2021
    • Tetsuo Handa's avatar
      block: genhd: don't call blkdev_show() with major_names_lock held · dfbb3409
      Tetsuo Handa authored
      If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
      combinations), lockdep complains circular locking dependency at
      __loop_clr_fd(), for major_names_lock serves as a locking dependency
      aggregating hub across multiple block modules.
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.14.0+ #757 Tainted: G            E
       ------------------------------------------------------
       systemd-udevd/7568 is trying to acquire lock:
       ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
      
       but task is already holding lock:
       ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #6 (&lo->lo_mutex){+.+.}-{3:3}:
              lock_acquire+0xbe/0x1f0
              __mutex_lock_common+0xb6/0xe10
              mutex_lock_killable_nested+0x17/0x20
              lo_open+0x23/0x50 [loop]
              blkdev_get_by_dev+0x199/0x540
              blkdev_open+0x58/0x90
              do_dentry_open+0x144/0x3a0
              path_openat+0xa57/0xda0
              do_filp_open+0x9f/0x140
              do_sys_openat2+0x71/0x150
              __x64_sys_openat+0x78/0xa0
              do_syscall_64+0x3d/0xb0
              entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       -> #5 (&disk->open_mutex){+.+.}-{3:3}:
              lock_acquire+0xbe/0x1f0
              __mutex_lock_common+0xb6/0xe10
              mutex_lock_nested+0x17/0x20
              bd_register_pending_holders+0x20/0x100
              device_add_disk+0x1ae/0x390
              loop_add+0x29c/0x2d0 [loop]
              blk_request_module+0x5a/0xb0
              blkdev_get_no_open+0x27/0xa0
              blkdev_get_by_dev+0x5f/0x540
              blkdev_open+0x58/0x90
              do_dentry_open+0x144/0x3a0
              path_openat+0xa57/0xda0
              do_filp_open+0x9f/0x140
              do_sys_openat2+0x71/0x150
              __x64_sys_openat+0x78/0xa0
              do_syscall_64+0x3d/0xb0
              entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       -> #4 (major_names_lock){+.+.}-{3:3}:
              lock_acquire+0xbe/0x1f0
              __mutex_lock_common+0xb6/0xe10
              mutex_lock_nested+0x17/0x20
              blkdev_show+0x19/0x80
              devinfo_show+0x52/0x60
              seq_read_iter+0x2d5/0x3e0
              proc_reg_read_iter+0x41/0x80
              vfs_read+0x2ac/0x330
              ksys_read+0x6b/0xd0
              do_syscall_64+0x3d/0xb0
              entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       -> #3 (&p->lock){+.+.}-{3:3}:
              lock_acquire+0xbe/0x1f0
              __mutex_lock_common+0xb6/0xe10
              mutex_lock_nested+0x17/0x20
              seq_read_iter+0x37/0x3e0
              generic_file_splice_read+0xf3/0x170
              splice_direct_to_actor+0x14e/0x350
              do_splice_direct+0x84/0xd0
              do_sendfile+0x263/0x430
              __se_sys_sendfile64+0x96/0xc0
              do_syscall_64+0x3d/0xb0
              entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       -> #2 (sb_writers#3){.+.+}-{0:0}:
              lock_acquire+0xbe/0x1f0
              lo_write_bvec+0x96/0x280 [loop]
              loop_process_work+0xa68/0xc10 [loop]
              process_one_work+0x293/0x480
              worker_thread+0x23d/0x4b0
              kthread+0x163/0x180
              ret_from_fork+0x1f/0x30
      
       -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
              lock_acquire+0xbe/0x1f0
              process_one_work+0x280/0x480
              worker_thread+0x23d/0x4b0
              kthread+0x163/0x180
              ret_from_fork+0x1f/0x30
      
       -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
              validate_chain+0x1f0d/0x33e0
              __lock_acquire+0x92d/0x1030
              lock_acquire+0xbe/0x1f0
              flush_workqueue+0x8c/0x560
              drain_workqueue+0x80/0x140
              destroy_workqueue+0x47/0x4f0
              __loop_clr_fd+0xb4/0x400 [loop]
              blkdev_put+0x14a/0x1d0
              blkdev_close+0x1c/0x20
              __fput+0xfd/0x220
              task_work_run+0x69/0xc0
              exit_to_user_mode_prepare+0x1ce/0x1f0
              syscall_exit_to_user_mode+0x26/0x60
              do_syscall_64+0x4c/0xb0
              entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       other info that might help us debug this:
      
       Chain exists of:
         (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&lo->lo_mutex);
                                      lock(&disk->open_mutex);
                                      lock(&lo->lo_mutex);
         lock((wq_completion)loop0);
      
        *** DEADLOCK ***
      
       2 locks held by systemd-udevd/7568:
        #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
        #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
      
       stack backtrace:
       CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G            E     5.14.0+ #757
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
       Call Trace:
        dump_stack_lvl+0x79/0xbf
        print_circular_bug+0x5d6/0x5e0
        ? stack_trace_save+0x42/0x60
        ? save_trace+0x3d/0x2d0
        check_noncircular+0x10b/0x120
        validate_chain+0x1f0d/0x33e0
        ? __lock_acquire+0x953/0x1030
        ? __lock_acquire+0x953/0x1030
        __lock_acquire+0x92d/0x1030
        ? flush_workqueue+0x70/0x560
        lock_acquire+0xbe/0x1f0
        ? flush_workqueue+0x70/0x560
        flush_workqueue+0x8c/0x560
        ? flush_workqueue+0x70/0x560
        ? sched_clock_cpu+0xe/0x1a0
        ? drain_workqueue+0x41/0x140
        drain_workqueue+0x80/0x140
        destroy_workqueue+0x47/0x4f0
        ? blk_mq_freeze_queue_wait+0xac/0xd0
        __loop_clr_fd+0xb4/0x400 [loop]
        ? __mutex_unlock_slowpath+0x35/0x230
        blkdev_put+0x14a/0x1d0
        blkdev_close+0x1c/0x20
        __fput+0xfd/0x220
        task_work_run+0x69/0xc0
        exit_to_user_mode_prepare+0x1ce/0x1f0
        syscall_exit_to_user_mode+0x26/0x60
        do_syscall_64+0x4c/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f0fd4c661f7
       Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
       RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
       RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
       RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
       R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
      
      Commit 1c500ad7 ("loop: reduce the loop_ctl_mutex scope") is for
      breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
      a different block module results in forming circular locking dependency
      due to shared major_names_lock mutex.
      
      The simplest fix is to call probe function without holding
      major_names_lock [1], but Christoph Hellwig does not like such idea.
      Therefore, instead of holding major_names_lock in blkdev_show(),
      introduce a different lock for blkdev_show() in order to break
      "sb_writers#$N => &p->lock => major_names_lock" dependency chain.
      
      Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp
      
       [1]
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dfbb3409
  24. Aug 24, 2021