Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Feb 06, 2023
  2. Feb 01, 2023
  3. Jan 14, 2023
    • Ye Bin's avatar
      blk-mq: fix possible memleak when register 'hctx' failed · e8022da1
      Ye Bin authored
      [ Upstream commit 4b7a21c5
      
       ]
      
      There's issue as follows when do fault injection test:
      unreferenced object 0xffff888132a9f400 (size 512):
        comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff  ...........2....
          08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00  ...2............
        backtrace:
          [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0
          [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0
          [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230
          [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910
          [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0
          [<00000000a2a34657>] 0xffffffffa2ad310f
          [<00000000b173f718>] 0xffffffffa2af824a
          [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0
          [<00000000f32fdf93>] do_init_module+0xdf/0x320
          [<00000000cbe8541e>] load_module+0x3006/0x3390
          [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0
          [<00000000a1a29ae8>] do_syscall_64+0x35/0x80
          [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fault injection context as follows:
       kobject_add
       blk_mq_register_hctx
       blk_mq_sysfs_register
       blk_register_queue
       device_add_disk
       null_add_dev.part.0 [null_blk]
      
      As 'blk_mq_register_hctx' may already add some objects when failed halfway,
      but there isn't do fallback, caller don't know which objects add failed.
      To solve above issue just do fallback when add objects failed halfway in
      'blk_mq_register_hctx'.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e8022da1
  4. Dec 08, 2022
    • Ming Lei's avatar
      block: unhash blkdev part inode when the part is deleted · 5f2f7756
      Ming Lei authored
      v5.11 changes the blkdev lookup mechanism completely since commit
      22ae8ce8 ("block: simplify bdev/disk lookup in blkdev_get"),
      and small part of the change is to unhash part bdev inode when
      deleting partition. Turns out this kind of change does fix one
      nasty issue in case of BLOCK_EXT_MAJOR:
      
      1) when one partition is deleted & closed, disk_put_part() is always
      called before bdput(bdev), see blkdev_put(); so the part's devt can
      be freed & re-used before the inode is dropped
      
      2) then new partition with same devt can be created just before the
      inode in 1) is dropped, then the old inode/bdev structurein 1) is
      re-used for this new partition, this way causes use-after-free and
      kernel panic.
      
      It isn't possible to backport the whole big patchset of "merge struct
      block_device and struct hd_struct v4" for addressing this issue.
      
      https://lore.kernel.org/linux-block/20201128161510.347752-1-hch@lst.de/
      
      
      
      So fixes it by unhashing part bdev in delete_partition(), and this way
      is actually aligned with v5.11+'s behavior.
      
      Reported-by: default avatarShiwei Cui <cuishw@inspur.com>
      Tested-by: default avatarShiwei Cui <cuishw@inspur.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      5f2f7756
  5. Dec 02, 2022
    • Yu Kuai's avatar
      block, bfq: fix null pointer dereference in bfq_bio_bfqg() · fa5f2c72
      Yu Kuai authored
      [ Upstream commit f02be900
      
       ]
      
      Out test found a following problem in kernel 5.10, and the same problem
      should exist in mainline:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000094
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP
      CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4
      Workqueue: kthrotld blk_throtl_dispatch_work_fn
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       bfq_bic_update_cgroup+0x3c/0x350
       ? ioc_create_icq+0x42/0x270
       bfq_init_rq+0xfd/0x1060
       bfq_insert_requests+0x20f/0x1cc0
       ? ioc_create_icq+0x122/0x270
       blk_mq_sched_insert_requests+0x86/0x1d0
       blk_mq_flush_plug_list+0x193/0x2a0
       blk_flush_plug_list+0x127/0x170
       blk_finish_plug+0x31/0x50
       blk_throtl_dispatch_work_fn+0x151/0x190
       process_one_work+0x27c/0x5f0
       worker_thread+0x28b/0x6b0
       ? rescuer_thread+0x590/0x590
       kthread+0x153/0x1b0
       ? kthread_flush_work+0x170/0x170
       ret_from_fork+0x1f/0x30
      Modules linked in:
      CR2: 0000000000000094
      ---[ end trace e2e59ac014314547 ]---
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Root cause is quite complex:
      
      1) use bfq elevator for the test device.
      2) create a cgroup CG
      3) config blk throtl in CG
      
         blkg_conf_prep
          blkg_create
      
      4) create a thread T1 and issue async io in CG:
      
         bio_init
          bio_associate_blkg
         ...
         submit_bio
          submit_bio_noacct
           blk_throtl_bio -> io is throttled
           // io submit is done
      
      5) switch elevator:
      
         bfq_exit_queue
          blkcg_deactivate_policy
           list_for_each_entry(blkg, &q->blkg_list, q_node)
            blkg->pd[] = NULL
            // bfq policy is removed
      
      5) thread t1 exist, then remove the cgroup CG:
      
         blkcg_unpin_online
          blkcg_destroy_blkgs
           blkg_destroy
            list_del_init(&blkg->q_node)
            // blkg is removed from queue list
      
      6) switch elevator back to bfq
      
       bfq_init_queue
        bfq_create_group_hierarchy
         blkcg_activate_policy
          list_for_each_entry_reverse(blkg, &q->blkg_list)
           // blkg is removed from list, hence bfq policy is still NULL
      
      7) throttled io is dispatched to bfq:
      
       bfq_insert_requests
        bfq_init_rq
         bfq_bic_update_cgroup
          bfq_bio_bfqg
           bfqg = blkg_to_bfqg(blkg)
           // bfqg is NULL because bfq policy is NULL
      
      The problem is only possible in bfq because only bfq can be deactivated and
      activated while queue is online, while others can only be deactivated while
      the device is removed.
      
      Fix the problem in bfq by checking if blkg is online before calling
      blkg_to_bfqg().
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fa5f2c72
  6. Nov 25, 2022
    • Serge Semin's avatar
      block: sed-opal: kmalloc the cmd/resp buffers · 58636b5f
      Serge Semin authored
      [ Upstream commit f829230d ]
      
      In accordance with [1] the DMA-able memory buffers must be
      cacheline-aligned otherwise the cache writing-back and invalidation
      performed during the mapping may cause the adjacent data being lost. It's
      specifically required for the DMA-noncoherent platforms [2]. Seeing the
      opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and
      SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit()
      methods respectively they must be cacheline-aligned to prevent the denoted
      problem. One of the option to guarantee that is to kmalloc the buffers
      [2]. Let's explicitly allocate them then instead of embedding into the
      opal_dev structure instance.
      
      Note this fix was inspired by the commit c94b7f9b ("nvme-hwmon:
      kmalloc the NVME SMART log buffer").
      
      [1] Documentation/core-api/dma-api.rst
      [2] Documentation/core-api/dma-api-howto.rst
      
      Fixes: 455a7b23 ("block: Add Sed-opal library")
      Si...
      58636b5f
  7. Nov 10, 2022
  8. Oct 30, 2022
  9. Oct 26, 2022
  10. Aug 31, 2022
    • Yu Kuai's avatar
      blk-mq: fix io hung due to missing commit_rqs · 3ddbd090
      Yu Kuai authored
      commit 65fac0d5 upstream.
      
      Currently, in virtio_scsi, if 'bd->last' is not set to true while
      dispatching request, such io will stay in driver's queue, and driver
      will wait for block layer to dispatch more rqs. However, if block
      layer failed to dispatch more rq, it should trigger commit_rqs to
      inform driver.
      
      There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
      won't be called:
      
      // assume that queue_depth is set to 1, list contains two rq
      blk_mq_try_issue_list_directly
       blk_mq_request_issue_directly
       // dispatch first rq
       // last is false
        __blk_mq_try_issue_directly
         blk_mq_get_dispatch_budget
         // succeed to get first budget
         __blk_mq_issue_directly
          scsi_queue_rq
           cmd->flags |= SCMD_LAST
            virtscsi_queuecommand
             kick = (sc->flags & SCMD_LAST) != 0
             // kick is false, first rq won't issue to disk
       queued++
      
       blk_mq_request_issue_directly
       // dispatch second rq
        __blk_mq_try_issue_directly
         blk_mq_get_dispatch_budget
         // failed to get second budget
       ret == BLK_STS_RESOURCE
        blk_mq_request_bypass_insert
       // errors is still 0
      
       if (!list_empty(list) || errors && ...)
        // won't pass, commit_rqs won't be called
      
      In this situation, first rq relied on second rq to dispatch, while
      second rq relied on first rq to complete, thus they will both hung.
      
      Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
      it means that request is not queued successfully.
      
      Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
      can't be treated as 'errors' here, fix the problem by calling
      commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.
      
      Fixes: d666ba98
      
       ("blk-mq: add mq_ops->commit_rqs()")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ddbd090
  11. Aug 21, 2022
  12. Jun 22, 2022
    • Bart Van Assche's avatar
      block: Fix handling of offline queues in blk_mq_alloc_request_hctx() · 7fa28a7c
      Bart Van Assche authored
      [ Upstream commit 14dc7a18 ]
      
      This patch prevents that test nvme/004 triggers the following:
      
      UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
      index 512 is out of range for type 'long unsigned int [512]'
      Call Trace:
       show_stack+0x52/0x58
       dump_stack_lvl+0x49/0x5e
       dump_stack+0x10/0x12
       ubsan_epilogue+0x9/0x3b
       __ubsan_handle_out_of_bounds.cold+0x44/0x49
       blk_mq_alloc_request_hctx+0x304/0x310
       __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
       nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
       nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
       nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
       nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
       nvmf_dev_write+0xae/0x111 [nvme_fabrics]
       vfs_write+0x144/0x560
       ksys_write+0xb7/0x140
       __x64_sys_write+0x42/0x50
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Fixes: 20e4d813 ("b...
      7fa28a7c
  13. Jun 09, 2022
    • Jan Kara's avatar
      block: fix bio_clone_blkg_association() to associate with proper blkcg_gq · 6b03dc67
      Jan Kara authored
      commit 22b106e5 upstream.
      
      Commit d92c370a
      
       ("block: really clone the block cgroup in
      bio_clone_blkg_association") changed bio_clone_blkg_association() to
      just clone bio->bi_blkg reference from source to destination bio. This
      is however wrong if the source and destination bios are against
      different block devices because struct blkcg_gq is different for each
      bdev-blkcg pair. This will result in IOs being accounted (and throttled
      as a result) multiple times against the same device (src bdev) while
      throttling of the other device (dst bdev) is ignored. In case of BFQ the
      inconsistency can even result in crashes in bfq_bic_update_cgroup().
      Fix the problem by looking up correct blkcg_gq for the cloned bio.
      
      Reported-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reported-and-tested-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Fixes: d92c370a ("block: really clone the block cgroup in bio_clone_blkg_association...
      6b03dc67
    • Jan Kara's avatar
      bfq: Make sure bfqg for which we are queueing requests is online · 51f724bf
      Jan Kara authored
      commit 075a53b7 upstream.
      
      Bios queued into BFQ IO scheduler can be associated with a cgroup that
      was already offlined. This may then cause insertion of this bfq_group
      into a service tree. But this bfq_group will get freed as soon as last
      bio associated with it is completed leading to use after free issues for
      service tree users. Fix the problem by making sure we always operate on
      online bfq_group. If the bfq_group associated with the bio is not
      online, we pick the first online parent.
      
      CC: stable@vger.kernel.org
      Fixes: e21b7a0b
      
       ("block, bfq: add full hierarchical scheduling and cgroups support")
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-9-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51f724bf
    • Jan Kara's avatar
      bfq: Get rid of __bio_blkcg() usage · 0285718e
      Jan Kara authored
      commit 4e54a249 upstream.
      
      BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio
      would not be associated with any blkcg, the usage of __bio_blkcg() in
      BFQ is prone to races with the task being migrated between cgroups as
      __bio_blkcg() calls at different places could return different blkcgs.
      
      Convert BFQ to the new situation where bio->bi_blkg is initialized in
      bio_set_dev() and thus practically always valid. This allows us to save
      blkcg_gq lookup and noticeably simplify the code.
      
      CC: stable@vger.kernel.org
      Fixes: 0fe061b9
      
       ("blkcg: fix ref count issue with bio_blkcg() using task_css")
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0285718e
    • Jan Kara's avatar
      bfq: Remove pointless bfq_init_rq() calls · 80b0a2b3
      Jan Kara authored
      commit 5f550ede
      
       upstream.
      
      We call bfq_init_rq() from request merging functions where requests we
      get should have already gone through bfq_init_rq() during insert and
      anyway we want to do anything only if the request is already tracked by
      BFQ. So replace calls to bfq_init_rq() with RQ_BFQQ() instead to simply
      skip requests untracked by BFQ. We move bfq_init_rq() call in
      bfq_insert_request() a bit earlier to cover request merging and thus
      can transfer FIFO position in case of a merge.
      
      CC: stable@vger.kernel.org
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-6-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80b0a2b3
    • Jan Kara's avatar
      bfq: Drop pointless unlock-lock pair · 13599aac
      Jan Kara authored
      commit fc84e1f9
      
       upstream.
      
      In bfq_insert_request() we unlock bfqd->lock only to call
      trace_block_rq_insert() and then lock bfqd->lock again. This is really
      pointless since tracing is disabled if we really care about performance
      and even if the tracepoint is enabled, it is a quick call.
      
      CC: stable@vger.kernel.org
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-5-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13599aac
    • Jan Kara's avatar
      bfq: Avoid merging queues with different parents · 7d172b9d
      Jan Kara authored
      commit c1cee4ab upstream.
      
      It can happen that the parent of a bfqq changes between the moment we
      decide two queues are worth to merge (and set bic->stable_merge_bfqq)
      and the moment bfq_setup_merge() is called. This can happen e.g. because
      the process submitted IO for a different cgroup and thus bfqq got
      reparented. It can even happen that the bfqq we are merging with has
      parent cgroup that is already offline and going to be destroyed in which
      case the merge can lead to use-after-free issues such as:
      
      BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50
      Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544
      
      CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G            E     5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x46/0x5a
       print_address_description.constprop.0+0x1f/0x140
       ? __bfq_deactivate_entity+0x9cb/0xa50
       kasan_report.cold+0x7f/0x11b
       ? __bfq_deactivate_entity+0x9cb/0xa50
       __bfq_deactivate_entity+0x9cb/0xa50
       ? update_curr+0x32f/0x5d0
       bfq_deactivate_entity+0xa0/0x1d0
       bfq_del_bfqq_busy+0x28a/0x420
       ? resched_curr+0x116/0x1d0
       ? bfq_requeue_bfqq+0x70/0x70
       ? check_preempt_wakeup+0x52b/0xbc0
       __bfq_bfqq_expire+0x1a2/0x270
       bfq_bfqq_expire+0xd16/0x2160
       ? try_to_wake_up+0x4ee/0x1260
       ? bfq_end_wr_async_queues+0xe0/0xe0
       ? _raw_write_unlock_bh+0x60/0x60
       ? _raw_spin_lock_irq+0x81/0xe0
       bfq_idle_slice_timer+0x109/0x280
       ? bfq_dispatch_request+0x4870/0x4870
       __hrtimer_run_queues+0x37d/0x700
       ? enqueue_hrtimer+0x1b0/0x1b0
       ? kvm_clock_get_cycles+0xd/0x10
       ? ktime_get_update_offsets_now+0x6f/0x280
       hrtimer_interrupt+0x2c8/0x740
      
      Fix the problem by checking that the parent of the two bfqqs we are
      merging in bfq_setup_merge() is the same.
      
      Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
      CC: stable@vger.kernel.org
      Fixes: 430a67f9
      
       ("block, bfq: merge bursts of newly-created queues")
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d172b9d
    • Tejun Heo's avatar
      blk-iolatency: Fix inflight count imbalances and IO hangs on offline · 77692c02
      Tejun Heo authored
      commit 8a177a36
      
       upstream.
      
      iolatency needs to track the number of inflight IOs per cgroup. As this
      tracking can be expensive, it is disabled when no cgroup has iolatency
      configured for the device. To ensure that the inflight counters stay
      balanced, iolatency_set_limit() freezes the request_queue while manipulating
      the enabled counter, which ensures that no IO is in flight and thus all
      counters are zero.
      
      Unfortunately, iolatency_set_limit() isn't the only place where the enabled
      counter is manipulated. iolatency_pd_offline() can also dec the counter and
      trigger disabling. As this disabling happens without freezing the q, this
      can easily happen while some IOs are in flight and thus leak the counts.
      
      This can be easily demonstrated by turning on iolatency on an one empty
      cgroup while IOs are in flight in other cgroups and then removing the
      cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
      system to ensure that removing the cgroup disables iolatency for the whole
      device.
      
      The following keeps flipping on and off iolatency on sda:
      
        echo +io > /sys/fs/cgroup/cgroup.subtree_control
        while true; do
            mkdir -p /sys/fs/cgroup/test
            echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
            sleep 1
            rmdir /sys/fs/cgroup/test
            sleep 1
        done
      
      and there's concurrent fio generating direct rand reads:
      
        fio --name test --filename=/dev/sda --direct=1 --rw=randread \
            --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k
      
      while monitoring with the following drgn script:
      
        while True:
          for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
              for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
                  blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
                  pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
                  if pd.value_() == 0:
                      continue
                  iolat = container_of(pd, 'struct iolatency_grp', 'pd')
                  inflight = iolat.rq_wait.inflight.counter.value_()
                  if inflight:
                      print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                            f'{cgroup_path(css.cgroup).decode("utf-8")}')
          time.sleep(1)
      
      The monitoring output looks like the following:
      
        inflight=1 sda /user.slice
        inflight=1 sda /user.slice
        ...
        inflight=14 sda /user.slice
        inflight=13 sda /user.slice
        inflight=17 sda /user.slice
        inflight=15 sda /user.slice
        inflight=18 sda /user.slice
        inflight=17 sda /user.slice
        inflight=20 sda /user.slice
        inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
        inflight=19 sda /user.slice
        inflight=19 sda /user.slice
      
      If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
      will never get issued as there's no completion event to wake it up leading
      to an indefinite hang.
      
      This patch fixes the bug by unifying enable handling into a work item which
      is automatically kicked off from iolatency_set_min_lat_nsec() which is
      called from both iolatency_set_limit() and iolatency_pd_offline() paths.
      Punting to a work item is necessary as iolatency_pd_offline() is called
      under spinlocks while freezing a request_queue requires a sleepable context.
      
      This also simplifies the code reducing LOC sans the comments and avoids the
      unnecessary freezes which were happening whenever a cgroup's latency target
      is newly set or cleared.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Liu Bo <bo.liu@linux.alibaba.com>
      Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter")
      Cc: stable@vger.kernel.org # v5.0+
      Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77692c02
    • Jan Kara's avatar
      bfq: Track whether bfq_group is still online · 70a7dea8
      Jan Kara authored
      commit 09f87186
      
       upstream.
      
      Track whether bfq_group is still online. We cannot rely on
      blkcg_gq->online because that gets cleared only after all policies are
      offlined and we need something that gets updated already under
      bfqd->lock when we are cleaning up our bfq_group to be able to guarantee
      that when we see online bfq_group, it will stay online while we are
      holding bfqd->lock lock.
      
      CC: stable@vger.kernel.org
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-7-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      70a7dea8
    • Jan Kara's avatar
      bfq: Update cgroup information before merging bio · b06691af
      Jan Kara authored
      commit ea591cd4 upstream.
      
      When the process is migrated to a different cgroup (or in case of
      writeback just starts submitting bios associated with a different
      cgroup) bfq_merge_bio() can operate with stale cgroup information in
      bic. Thus the bio can be merged to a request from a different cgroup or
      it can result in merging of bfqqs for different cgroups or bfqqs of
      already dead cgroups and causing possible use-after-free issues. Fix the
      problem by updating cgroup information in bfq_merge_bio().
      
      CC: stable@vger.kernel.org
      Fixes: e21b7a0b
      
       ("block, bfq: add full hierarchical scheduling and cgroups support")
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-4-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b06691af
    • Jan Kara's avatar
      bfq: Split shared queues on move between cgroups · 4dfc12f8
      Jan Kara authored
      commit 3bc5e683 upstream.
      
      When bfqq is shared by multiple processes it can happen that one of the
      processes gets moved to a different cgroup (or just starts submitting IO
      for different cgroup). In case that happens we need to split the merged
      bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
      we will just account IO time to wrong entities etc.
      
      Similarly if the bfqq is scheduled to merge with another bfqq but the
      merge didn't happen yet, cancel the merge as it need not be valid
      anymore.
      
      CC: stable@vger.kernel.org
      Fixes: e21b7a0b
      
       ("block, bfq: add full hierarchical scheduling and cgroups support")
      Tested-by: default avatar"yukuai (C)" <yukuai3@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4dfc12f8
  14. May 12, 2022
  15. May 09, 2022
    • Tejun Heo's avatar
      iocost: don't reset the inuse weight of under-weighted debtors · 0967830e
      Tejun Heo authored
      commit 8c936f9e
      
       upstream.
      
      When an iocg is in debt, its inuse weight is owned by debt handling and
      should stay at 1. This invariant was broken when determining the amount of
      surpluses at the beginning of donation calculation - when an iocg's
      hierarchical weight is too low, the iocg is excluded from donation
      calculation and its inuse is reset to its active regardless of its
      indebtedness, triggering warnings like the following:
      
       WARNING: CPU: 5 PID: 0 at block/blk-iocost.c:1416 iocg_kick_waitq+0x392/0x3a0
       ...
       RIP: 0010:iocg_kick_waitq+0x392/0x3a0
       Code: 00 00 be ff ff ff ff 48 89 4d a8 e8 98 b2 70 00 48 8b 4d a8 85 c0 0f 85 4a fe ff ff 0f 0b e9 43 fe ff ff 0f 0b e9 4d fe ff ff <0f> 0b e9 50 fe ff ff e8 a2 ae 70 00 66 90 0f 1f 44 00 00 55 48 89
       RSP: 0018:ffffc90000200d08 EFLAGS: 00010016
       ...
        <IRQ>
        ioc_timer_fn+0x2e0/0x1470
        call_timer_fn+0xa1/0x2c0
       ...
      
      As this happens only when an iocg's hierarchical weight is negligible, its
      impact likely is limited to triggering the warnings. Fix it by skipping
      resetting inuse of under-weighted debtors.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRik van Riel <riel@surriel.com>
      Fixes: c421a3eb ("blk-iocost: revamp debt handling")
      Cc: stable@vger.kernel.org # v5.10+
      Link: https://lore.kernel.org/r/YmjODd4aif9BzFuO@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0967830e
  16. Apr 27, 2022
  17. Apr 08, 2022
    • Paolo Valente's avatar
      Revert "Revert "block, bfq: honor already-setup queue merges"" · cc051f49
      Paolo Valente authored
      [ Upstream commit 15729ff8 ]
      
      A crash [1] happened to be triggered in conjunction with commit
      2d52c58b ("block, bfq: honor already-setup queue merges"). The
      latter was then reverted by commit ebc69e89 ("Revert "block, bfq:
      honor already-setup queue merges""). Yet, the reverted commit was not
      the one introducing the bug. In fact, it actually triggered a UAF
      introduced by a different commit, and now fixed by commit d29bd414
      ("block, bfq: reset last_bfqq_created on group change").
      
      So, there is no point in keeping commit 2d52c58b ("block, bfq:
      honor already-setup queue merges") out. This commit restores it.
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
      
      
      
      Reported-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cc051f49
    • Zhang Wensheng's avatar
      bfq: fix use-after-free in bfq_dispatch_request · df6e00b1
      Zhang Wensheng authored
      [ Upstream commit ab552fcb
      
       ]
      
      KASAN reports a use-after-free report when doing normal scsi-mq test
      
      [69832.239032] ==================================================================
      [69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
      [69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
      [69832.244656]
      [69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
      [69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [69832.249069] Workqueue: kblockd blk_mq_run_work_fn
      [69832.250022] Call Trace:
      [69832.250541]  dump_stack+0x9b/0xce
      [69832.251232]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.252243]  print_address_description.constprop.6+0x3e/0x60
      [69832.253381]  ? __cpuidle_text_end+0x5/0x5
      [69832.254211]  ? vprintk_func+0x6b/0x120
      [69832.254994]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.255952]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.256914]  kasan_report.cold.9+0x22/0x3a
      [69832.257753]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.258755]  check_memory_region+0x1c1/0x1e0
      [69832.260248]  bfq_dispatch_request+0x1045/0x44b0
      [69832.261181]  ? bfq_bfqq_expire+0x2440/0x2440
      [69832.262032]  ? blk_mq_delay_run_hw_queues+0xf9/0x170
      [69832.263022]  __blk_mq_do_dispatch_sched+0x52f/0x830
      [69832.264011]  ? blk_mq_sched_request_inserted+0x100/0x100
      [69832.265101]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
      [69832.266206]  ? blk_mq_do_dispatch_ctx+0x570/0x570
      [69832.267147]  ? __switch_to+0x5f4/0xee0
      [69832.267898]  blk_mq_sched_dispatch_requests+0xdf/0x140
      [69832.268946]  __blk_mq_run_hw_queue+0xc0/0x270
      [69832.269840]  blk_mq_run_work_fn+0x51/0x60
      [69832.278170]  process_one_work+0x6d4/0xfe0
      [69832.278984]  worker_thread+0x91/0xc80
      [69832.279726]  ? __kthread_parkme+0xb0/0x110
      [69832.280554]  ? process_one_work+0xfe0/0xfe0
      [69832.281414]  kthread+0x32d/0x3f0
      [69832.282082]  ? kthread_park+0x170/0x170
      [69832.282849]  ret_from_fork+0x1f/0x30
      [69832.283573]
      [69832.283886] Allocated by task 7725:
      [69832.284599]  kasan_save_stack+0x19/0x40
      [69832.285385]  __kasan_kmalloc.constprop.2+0xc1/0xd0
      [69832.286350]  kmem_cache_alloc_node+0x13f/0x460
      [69832.287237]  bfq_get_queue+0x3d4/0x1140
      [69832.287993]  bfq_get_bfqq_handle_split+0x103/0x510
      [69832.289015]  bfq_init_rq+0x337/0x2d50
      [69832.289749]  bfq_insert_requests+0x304/0x4e10
      [69832.290634]  blk_mq_sched_insert_requests+0x13e/0x390
      [69832.291629]  blk_mq_flush_plug_list+0x4b4/0x760
      [69832.292538]  blk_flush_plug_list+0x2c5/0x480
      [69832.293392]  io_schedule_prepare+0xb2/0xd0
      [69832.294209]  io_schedule_timeout+0x13/0x80
      [69832.295014]  wait_for_common_io.constprop.1+0x13c/0x270
      [69832.296137]  submit_bio_wait+0x103/0x1a0
      [69832.296932]  blkdev_issue_discard+0xe6/0x160
      [69832.297794]  blk_ioctl_discard+0x219/0x290
      [69832.298614]  blkdev_common_ioctl+0x50a/0x1750
      [69832.304715]  blkdev_ioctl+0x470/0x600
      [69832.305474]  block_ioctl+0xde/0x120
      [69832.306232]  vfs_ioctl+0x6c/0xc0
      [69832.306877]  __se_sys_ioctl+0x90/0xa0
      [69832.307629]  do_syscall_64+0x2d/0x40
      [69832.308362]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [69832.309382]
      [69832.309701] Freed by task 155:
      [69832.310328]  kasan_save_stack+0x19/0x40
      [69832.311121]  kasan_set_track+0x1c/0x30
      [69832.311868]  kasan_set_free_info+0x1b/0x30
      [69832.312699]  __kasan_slab_free+0x111/0x160
      [69832.313524]  kmem_cache_free+0x94/0x460
      [69832.314367]  bfq_put_queue+0x582/0x940
      [69832.315112]  __bfq_bfqd_reset_in_service+0x166/0x1d0
      [69832.317275]  bfq_bfqq_expire+0xb27/0x2440
      [69832.318084]  bfq_dispatch_request+0x697/0x44b0
      [69832.318991]  __blk_mq_do_dispatch_sched+0x52f/0x830
      [69832.319984]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
      [69832.321087]  blk_mq_sched_dispatch_requests+0xdf/0x140
      [69832.322225]  __blk_mq_run_hw_queue+0xc0/0x270
      [69832.323114]  blk_mq_run_work_fn+0x51/0x60
      [69832.323942]  process_one_work+0x6d4/0xfe0
      [69832.324772]  worker_thread+0x91/0xc80
      [69832.325518]  kthread+0x32d/0x3f0
      [69832.326205]  ret_from_fork+0x1f/0x30
      [69832.326932]
      [69832.338297] The buggy address belongs to the object at ffff88802622b968
      [69832.338297]  which belongs to the cache bfq_queue of size 512
      [69832.340766] The buggy address is located 288 bytes inside of
      [69832.340766]  512-byte region [ffff88802622b968, ffff88802622bb68)
      [69832.343091] The buggy address belongs to the page:
      [69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
      [69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
      [69832.347719] flags: 0x1fffff80010200(slab|head)
      [69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
      [69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
      [69832.356547] page dumped because: kasan: bad access detected
      [69832.357652]
      [69832.357970] Memory state around the buggy address:
      [69832.358926]  ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.360358]  ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.363273]                       ^
      [69832.363975]  ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
      [69832.375960]  ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [69832.377405] ==================================================================
      
      In bfq_dispatch_requestfunction, it may have function call:
      
      bfq_dispatch_request
      	__bfq_dispatch_request
      		bfq_select_queue
      			bfq_bfqq_expire
      				__bfq_bfqd_reset_in_service
      					bfq_put_queue
      						kmem_cache_free
      In this function call, in_serv_queue has beed expired and meet the
      conditions to free. In the function bfq_dispatch_request, the address
      of in_serv_queue pointing to has been released. For getting the value
      of idle_timer_disabled, it will get flags value from the address which
      in_serv_queue pointing to, then the problem of use-after-free happens;
      
      Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
      get the value of idle_timer_disabled if in_serve_queue is equel to
      bfqd->in_service_queue. If the space of in_serv_queue pointing has
      been released, this judge will aviod use-after-free problem.
      And if in_serv_queue may be expired or finished, the idle_timer_disabled
      will be false which would not give effects to bfq_update_dispatch_stats.
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Wensheng <zhangwensheng5@huawei.com>
      Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      df6e00b1
    • Yu Kuai's avatar
      block, bfq: don't move oom_bfqq · 7507ead1
      Yu Kuai authored
      [ Upstream commit 8410f709
      
       ]
      
      Our test report a UAF:
      
      [ 2073.019181] ==================================================================
      [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
      [ 2073.019192]
      [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
      [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [ 2073.019200] Call trace:
      [ 2073.019203]  dump_backtrace+0x0/0x310
      [ 2073.019206]  show_stack+0x28/0x38
      [ 2073.019210]  dump_stack+0xec/0x15c
      [ 2073.019216]  print_address_description+0x68/0x2d0
      [ 2073.019220]  kasan_report+0x238/0x2f0
      [ 2073.019224]  __asan_store8+0x88/0xb0
      [ 2073.019229]  __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019233]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019236]  bfq_pd_offline+0x178/0x238
      [ 2073.019240]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019244]  bfq_exit_queue+0x128/0x178
      [ 2073.019249]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019252]  elevator_exit+0xc8/0xd0
      [ 2073.019256]  blk_exit_queue+0x50/0x88
      [ 2073.019259]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019267]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019274]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019278]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019282]  el0_svc_common+0xc8/0x320
      [ 2073.019287]  el0_svc_handler+0xf8/0x160
      [ 2073.019290]  el0_svc+0x10/0x218
      [ 2073.019291]
      [ 2073.019294] Allocated by task 14163:
      [ 2073.019301]  kasan_kmalloc+0xe0/0x190
      [ 2073.019305]  kmem_cache_alloc_node_trace+0x1cc/0x418
      [ 2073.019308]  bfq_pd_alloc+0x54/0x118
      [ 2073.019313]  blkcg_activate_policy+0x250/0x460
      [ 2073.019317]  bfq_create_group_hierarchy+0x38/0x110
      [ 2073.019321]  bfq_init_queue+0x6d0/0x948
      [ 2073.019325]  blk_mq_init_sched+0x1d8/0x390
      [ 2073.019330]  elevator_switch_mq+0x88/0x170
      [ 2073.019334]  elevator_switch+0x140/0x270
      [ 2073.019338]  elv_iosched_store+0x1a4/0x2a0
      [ 2073.019342]  queue_attr_store+0x90/0xe0
      [ 2073.019348]  sysfs_kf_write+0xa8/0xe8
      [ 2073.019351]  kernfs_fop_write+0x1f8/0x378
      [ 2073.019359]  __vfs_write+0xe0/0x360
      [ 2073.019363]  vfs_write+0xf0/0x270
      [ 2073.019367]  ksys_write+0xdc/0x1b8
      [ 2073.019371]  __arm64_sys_write+0x50/0x60
      [ 2073.019375]  el0_svc_common+0xc8/0x320
      [ 2073.019380]  el0_svc_handler+0xf8/0x160
      [ 2073.019383]  el0_svc+0x10/0x218
      [ 2073.019385]
      [ 2073.019387] Freed by task 72584:
      [ 2073.019391]  __kasan_slab_free+0x120/0x228
      [ 2073.019394]  kasan_slab_free+0x10/0x18
      [ 2073.019397]  kfree+0x94/0x368
      [ 2073.019400]  bfqg_put+0x64/0xb0
      [ 2073.019404]  bfqg_and_blkg_put+0x90/0xb0
      [ 2073.019408]  bfq_put_queue+0x220/0x228
      [ 2073.019413]  __bfq_put_async_bfqq+0x98/0x168
      [ 2073.019416]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019420]  bfq_pd_offline+0x178/0x238
      [ 2073.019424]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019429]  bfq_exit_queue+0x128/0x178
      [ 2073.019433]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019437]  elevator_exit+0xc8/0xd0
      [ 2073.019440]  blk_exit_queue+0x50/0x88
      [ 2073.019443]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019451]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019459]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019462]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019467]  el0_svc_common+0xc8/0x320
      [ 2073.019471]  el0_svc_handler+0xf8/0x160
      [ 2073.019474]  el0_svc+0x10/0x218
      [ 2073.019475]
      [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
       which belongs to the cache kmalloc-1024 of size 1024
      [ 2073.019484] The buggy address is located 552 bytes inside of
       1024-byte region [ffff8000ccf63f00, ffff8000ccf64300)
      [ 2073.019486] The buggy address belongs to the page:
      [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
      [ 2073.020123] flags: 0x7ffff0000008100(slab|head)
      [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
      [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
      [ 2073.020411] page dumped because: kasan: bad access detected
      [ 2073.020412]
      [ 2073.020414] Memory state around the buggy address:
      [ 2073.020420]  ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020424]  ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020430]                                   ^
      [ 2073.020434]  ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020438]  ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020439] ==================================================================
      
      The same problem exist in mainline as well.
      
      This is because oom_bfqq is moved to a non-root group, thus root_group
      is freed earlier.
      
      Thus fix the problem by don't move oom_bfqq.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7507ead1
    • Eric Biggers's avatar
      block: don't delete queue kobject before its children · 0b5924a1
      Eric Biggers authored
      [ Upstream commit 0f692882 ]
      
      kobjects aren't supposed to be deleted before their child kobjects are
      deleted.  Apparently this is usually benign; however, a WARN will be
      triggered if one of the child kobjects has a named attribute group:
      
          sysfs group 'modes' not found for kobject 'crypto'
          WARNING: CPU: 0 PID: 1 at fs/sysfs/group.c:278 sysfs_remove_group+0x72/0x80
          ...
          Call Trace:
            sysfs_remove_groups+0x29/0x40 fs/sysfs/group.c:312
            __kobject_del+0x20/0x80 lib/kobject.c:611
            kobject_cleanup+0xa4/0x140 lib/kobject.c:696
            kobject_release lib/kobject.c:736 [inline]
            kref_put include/linux/kref.h:65 [inline]
            kobject_put+0x53/0x70 lib/kobject.c:753
            blk_crypto_sysfs_unregister+0x10/0x20 block/blk-crypto-sysfs.c:159
            blk_unregister_queue+0xb0/0x110 block/blk-sysfs.c:962
            del_gendisk+0x117/0x250 block/genhd.c:610
      
      Fix this by moving the kobject_del() and the corresponding
      kobject_uevent() to the correct place.
      
      Fixes: 2c2086af
      
       ("block: Protect less code with sysfs_lock in blk_{un,}register_queue()")
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220124215938.2769-3-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0b5924a1
    • Tejun Heo's avatar
      block: don't merge across cgroup boundaries if blkcg is enabled · ce1927b8
      Tejun Heo authored
      commit 6b2b0459
      
       upstream.
      
      blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't
      disable merges across different cgroups. This obviously can lead to
      accounting and control errors but more importantly to priority inversions -
      e.g. an IO which belongs to a higher priority cgroup or IO class may end up
      getting throttled incorrectly because it gets merged to an IO issued from a
      low priority cgroup.
      
      Fix it by adding blk_cgroup_mergeable() which is called from merge paths and
      rejects cross-cgroup and cross-issue_as_root merges.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: d7067512 ("block: introduce blk-iolatency io controller")
      Cc: stable@vger.kernel.org # v4.19+
      Cc: Josef Bacik <jbacik@fb.com>
      Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce1927b8
    • Shin'ichiro Kawasaki's avatar
      block: limit request dispatch loop duration · 6e0d2459
      Shin'ichiro Kawasaki authored
      commit 572299f0 upstream.
      
      When IO requests are made continuously and the target block device
      handles requests faster than request arrival, the request dispatch loop
      keeps on repeating to dispatch the arriving requests very long time,
      more than a minute. Since the loop runs as a workqueue worker task, the
      very long loop duration triggers workqueue watchdog timeout and BUG [1].
      
      To avoid the very long loop duration, break the loop periodically. When
      opportunity to dispatch requests still exists, check need_resched(). If
      need_resched() returns true, the dispatch loop already consumed its time
      slice, then reschedule the dispatch work and break the loop. With heavy
      IO load, need_resched() does not return true for 20~30 seconds. To cover
      such case, check time spent in the dispatch loop with jiffies. If more
      than 1 second is spent, reschedule the dispatch work and break the loop.
      
      [1]
      
      [  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
      [  609.701820] Showing busy workqueues and worker pools:
      [  609.707915] workqueue events: flags=0x0
      [  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
      [  609.712687] workqueue events_freezable: flags=0x4
      [  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.732952]     pending: pci_pme_list_scan
      [  609.732968] workqueue events_power_efficient: flags=0x80
      [  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  609.751955]     pending: neigh_managed_work
      [  609.752018] workqueue kblockd: flags=0x18
      [  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
      [  609.769488]     in-flight: 1020:blk_mq_run_work_fn
      [  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
      [  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
      [  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
      [  639.909513] Showing busy workqueues and worker pools:
      [  639.915404] workqueue events: flags=0x0
      [  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      [  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
      [  639.920365] workqueue kblockd: flags=0x18
      [  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
      [  639.939942]     in-flight: 1020:blk_mq_run_work_fn
      [  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
      [  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67
      
      Fixes: 6e6fcbc2
      
       ("blk-mq: support batching dispatch in case of io")
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Cc: stable@vger.kernel.org # v5.10+
      Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/
      Link: https://lore.kernel.org/r/20220318022641.133484-1-shinichiro.kawasaki@wdc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e0d2459
  18. Feb 23, 2022
    • Laibin Qiu's avatar
      block/wbt: fix negative inflight counter when remove scsi device · 598dbaf7
      Laibin Qiu authored
      commit e92bc4cd upstream.
      
      Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
      wbt_disable_default() when switch elevator to bfq. And when
      we remove scsi device, wbt will be enabled by wbt_enable_default.
      If it become false positive between wbt_wait() and wbt_track()
      when submit write request.
      
      The following is the scenario that triggered the problem.
      
      T1                          T2                           T3
                                  elevator_switch_mq
                                  bfq_init_queue
                                  wbt_disable_default <= Set
                                  rwb->enable_state (OFF)
      Submit_bio
      blk_mq_make_request
      rq_qos_throttle
      <= rwb->enable_state (OFF)
                                                               scsi_remove_device
                                                               sd_remove
                                                               del_gendisk
                           ...
      598dbaf7
  19. Feb 08, 2022
  20. Feb 01, 2022
  21. Jan 27, 2022
    • Ye Bin's avatar
      block: Fix fsync always failed if once failed · 2bcab471
      Ye Bin authored
      commit 8a751893 upstream.
      
      We do test with inject error fault base on v4.19, after test some time we found
      sync /dev/sda always failed.
      [root@localhost] sync /dev/sda
      sync: error syncing '/dev/sda': Input/output error
      
      scsi log as follows:
      [19069.812296] sd 0:0:0:0: [sda] tag#64 Send: scmd 0x00000000d03a0b6b
      [19069.812302] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
      [19069.812533] sd 0:0:0:0: [sda] tag#64 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
      [19069.812536] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
      [19069.812539] sd 0:0:0:0: [sda] tag#64 scsi host busy 1 failed 0
      [19069.812542] sd 0:0:0:0: Notifying upper driver of completion (result 0)
      [19069.812546] sd 0:0:0:0: [sda] tag#64 sd_done: completed 0 of 0 bytes
      [19069.812549] sd 0:0:0:0: [sda] tag#64 0 sectors total, 0 bytes done.
      [19069.812564] print_req_error: I/O error, dev sda, sector 0
      
      ftrace log as follows:
       rep-306069 [007] .... 19654.923315: block_bio_queue: 8,0 FWS 0 + 0 [rep]
       rep-306069 [007] .... 19654.923333: block_getrq: 8,0 FWS 0 + 0 [rep]
       kworker/7:1H-250   [007] .... 19654.923352: block_rq_issue: 8,0 FF 0 () 0 + 0 [kworker/7:1H]
       <idle>-0     [007] ..s. 19654.923562: block_rq_complete: 8,0 FF () 18446744073709551615 + 0 [0]
       <idle>-0     [007] d.s. 19654.923576: block_rq_complete: 8,0 WS () 0 + 0 [-5]
      
      As 8d699663 introduce 'fq->rq_status', this data only update when 'flush_rq'
      reference count isn't zero. If flush request once failed and record error code
      in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
      will always failed.
      To address this issue reset 'fq->rq_status' after return error code to upper layer.
      
      Fixes: 8d699663
      
      ("block: fix null pointer dereference in blk_mq_rq_timed_out()")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bcab471