Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 21, 2022
    • Christophe Leroy's avatar
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy authored
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053
      
       ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • Ye Bin's avatar
      jbd2: fix a potential race while discarding reserved buffers after an abort · 23e3d7f7
      Ye Bin authored
      we got issue as follows:
      [   72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal
      [   72.826847] EXT4-fs (sda): Remounting filesystem read-only
      fallocate: fallocate failed: Read-only file system
      [   74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay
      [   74.793597] ------------[ cut here ]------------
      [   74.794203] kernel BUG at fs/jbd2/transaction.c:2063!
      [   74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [   74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150
      [   74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60
      [   74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202
      [   74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90
      [   74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20
      [   74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90
      [   74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00
      [   74.807301] FS:  0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000
      [   74.808338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0
      [   74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   74.811897] Call Trace:
      [   74.812241]  <TASK>
      [   74.812566]  __jbd2_journal_refile_buffer+0x12f/0x180
      [   74.813246]  jbd2_journal_refile_buffer+0x4c/0xa0
      [   74.813869]  jbd2_journal_commit_transaction.cold+0xa1/0x148
      [   74.817550]  kjournald2+0xf8/0x3e0
      [   74.819056]  kthread+0x153/0x1c0
      [   74.819963]  ret_from_fork+0x22/0x30
      
      Above issue may happen as follows:
              write                   truncate                   kjournald2
      generic_perform_write
       ext4_write_begin
        ext4_walk_page_buffers
         do_journal_get_write_access ->add BJ_Reserved list
       ext4_journalled_write_end
        ext4_walk_page_buffers
         write_end_fn
          ext4_handle_dirty_metadata
                      ***************JBD2 ABORT**************
           jbd2_journal_dirty_metadata
       -> return -EROFS, jh in reserved_list
                                                         jbd2_journal_commit_transaction
                                                          while (commit_transaction->t_reserved_list)
                                                            jh = commit_transaction->t_reserved_list;
                              truncate_pagecache_range
                               do_invalidatepage
      			  ext4_journalled_invalidatepage
      			   jbd2_journal_invalidatepage
      			    journal_unmap_buffer
      			     __dispose_buffer
      			      __jbd2_journal_unfile_buffer
      			       jbd2_journal_put_journal_head ->put last ref_count
      			        __journal_remove_journal_head
      				 bh->b_private = NULL;
      				 jh->b_bh = NULL;
      				                      jbd2_journal_refile_buffer(journal, jh);
      							bh = jh2bh(jh);
      							->bh is NULL, later will trigger null-ptr-deref
      				 journal_free_journal_head(jh);
      
      After commit 96f1e097, we no longer hold the j_state_lock while
      iterating over the list of reserved handles in
      jbd2_journal_commit_transaction().  This potentially allows the
      journal_head to be freed by journal_unmap_buffer while the commit
      codepath is also trying to free the BJ_Reserved buffers.  Keeping
      j_state_lock held while trying extends hold time of the lock
      minimally, and solves this issue.
      
      Fixes: 96f1e097
      
      ("jbd2: avoid long hold times of j_state_lock while committing a transaction")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      23e3d7f7
    • Christian Brauner's avatar
      fs: unset MNT_WRITE_HOLD on failure · 0014edae
      Christian Brauner authored
      After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD
      and consequently we always need to pair mnt_hold_writers() with
      mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a
      do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for
      the first mount that was changed. Fix this and make sure that the first mount
      will be cleaned up and add some comments to make it more obvious.
      
      Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com
      Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com
      Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org
      Fixes: e257039f
      
       ("mount_setattr(): clean the control flow and calling conventions") [1]
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reported-by: default avatar <syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      0014edae
  2. Apr 20, 2022
    • Ronnie Sahlberg's avatar
      cifs: destage any unwritten data to the server before calling copychunk_write · f5d0f921
      Ronnie Sahlberg authored
      because the copychunk_write might cover a region of the file that has not yet
      been sent to the server and thus fail.
      
      A simple way to reproduce this is:
      truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile
      
      the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
      the 'fcollapse 16k 24k' due to write-back caching.
      
      fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
      and it will fail serverside since the file is still 0b in size serverside
      until the writes have been destaged.
      To avoid this we must ensure that we destage any unwritten data to the
      server before calling COPYCHUNK_WRITE.
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1997373
      
      
      Reported-by: default avatarXiaoli Feng <xifeng@redhat.com>
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      f5d0f921
    • Paulo Alcantara's avatar
      cifs: use correct lock type in cifs_reconnect() · cd70a3e8
      Paulo Alcantara authored
      
      TCP_Server_Info::origin_fullpath and TCP_Server_Info::leaf_fullpath
      are protected by refpath_lock mutex and not cifs_tcp_ses_lock
      spinlock.
      
      Signed-off-by: default avatarPaulo Alcantara (SUSE) <pc@cjr.nz>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      cd70a3e8
    • Paulo Alcantara's avatar
      cifs: fix NULL ptr dereference in refresh_mounts() · 41f10081
      Paulo Alcantara authored
      
      Either mount(2) or automount might not have server->origin_fullpath
      set yet while refresh_cache_worker() is attempting to refresh DFS
      referrals.  Add missing NULL check and locking around it.
      
      This fixes bellow crash:
      
      [ 1070.276835] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
      [ 1070.277676] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
      [ 1070.278219] CPU: 1 PID: 8506 Comm: kworker/u8:1 Not tainted 5.18.0-rc3 #10
      [ 1070.278701] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
      [ 1070.279495] Workqueue: cifs-dfscache refresh_cache_worker [cifs]
      [ 1070.280044] RIP: 0010:strcasecmp+0x34/0x150
      [ 1070.280359] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44
      [ 1070.281729] RSP: 0018:ffffc90008367958 EFLAGS: 00010246
      [ 1070.282114] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
      [ 1070.282691] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1070.283273] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27
      [ 1070.283857] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000
      [ 1070.284436] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000
      [ 1070.284990] FS:  0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000
      [ 1070.285625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1070.286100] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0
      [ 1070.286683] Call Trace:
      [ 1070.286890]  <TASK>
      [ 1070.287070]  refresh_cache_worker+0x895/0xd20 [cifs]
      [ 1070.287475]  ? __refresh_tcon.isra.0+0xfb0/0xfb0 [cifs]
      [ 1070.287905]  ? __lock_acquire+0xcd1/0x6960
      [ 1070.288247]  ? is_dynamic_key+0x1a0/0x1a0
      [ 1070.288591]  ? lockdep_hardirqs_on_prepare+0x410/0x410
      [ 1070.289012]  ? lock_downgrade+0x6f0/0x6f0
      [ 1070.289318]  process_one_work+0x7bd/0x12d0
      [ 1070.289637]  ? worker_thread+0x160/0xec0
      [ 1070.289970]  ? pwq_dec_nr_in_flight+0x230/0x230
      [ 1070.290318]  ? _raw_spin_lock_irq+0x5e/0x90
      [ 1070.290619]  worker_thread+0x5ac/0xec0
      [ 1070.290891]  ? process_one_work+0x12d0/0x12d0
      [ 1070.291199]  kthread+0x2a5/0x350
      [ 1070.291430]  ? kthread_complete_and_exit+0x20/0x20
      [ 1070.291770]  ret_from_fork+0x22/0x30
      [ 1070.292050]  </TASK>
      [ 1070.292223] Modules linked in: bpfilter cifs cifs_arc4 cifs_md4
      [ 1070.292765] ---[ end trace 0000000000000000 ]---
      [ 1070.293108] RIP: 0010:strcasecmp+0x34/0x150
      [ 1070.293471] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44
      [ 1070.297718] RSP: 0018:ffffc90008367958 EFLAGS: 00010246
      [ 1070.298622] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
      [ 1070.299428] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1070.300296] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27
      [ 1070.301204] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000
      [ 1070.301932] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000
      [ 1070.302645] FS:  0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000
      [ 1070.303462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1070.304131] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0
      [ 1070.305004] Kernel panic - not syncing: Fatal exception
      [ 1070.305711] Kernel Offset: disabled
      [ 1070.305971] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Signed-off-by: default avatarPaulo Alcantara (SUSE) <pc@cjr.nz>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      41f10081
    • Linus Torvalds's avatar
      Revert "fs/pipe: use kvcalloc to allocate a pipe_buffer array" · 906f9040
      Linus Torvalds authored
      This reverts commit 5a519c8f
      
      .
      
      It turns out that making the pipe almost arbitrarily large has some
      rather unexpected downsides.  The kernel test robot reports a kernel
      warning that is due to pipe->max_usage now growing to the point where
      the iter_file_splice_write() buffer allocation can no longer be
      satisfied as a slab allocation, and the
      
              int nbufs = pipe->max_usage;
              struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
                                              GFP_KERNEL);
      
      code sequence there will now always fail as a result.
      
      That code could be modified to use kvcalloc() too, but I feel very
      uncomfortable making those kinds of changes for a very niche use case
      that really should have other options than make these kinds of
      fundamental changes to pipe behavior.
      
      Maybe the CRIU process dumping should be multi-threaded, and use
      multiple pipes and multiple cores, rather than try to use one larger
      pipe to minimize splice() calls.
      
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/
      
      
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      906f9040
  3. Apr 19, 2022
    • Christian Brauner's avatar
      fs: fix acl translation · 705191b0
      Christian Brauner authored
      Last cycle we extended the idmapped mounts infrastructure to support
      idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
      Since then, the meaning of an idmapped mount is a mount whose idmapping
      is different from the filesystems idmapping.
      
      While doing that work we missed to adapt the acl translation helpers.
      They still assume that checking for the identity mapping is enough.  But
      they need to use the no_idmapping() helper instead.
      
      Note, POSIX ACLs are always translated right at the userspace-kernel
      boundary using the caller's current idmapping and the initial idmapping.
      The order depends on whether we're coming from or going to userspace.
      The filesystem's idmapping doesn't matter at the border.
      
      Consequently, if a non-idmapped mount is passed we need to make sure to
      always pass the initial idmapping as the mount's idmapping and not the
      filesystem idmapping.  Since it's irrelevant here it would yield invalid
      ids and prevent setting acls for filesystems that are mountable in a
      userns and support posix acls (tmpfs and fuse).
      
      I verified the regression reported in [1] and verified that this patch
      fixes it.  A regression test will be added to xfstests in parallel.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
      Fixes: bd303368
      
       ("fs: support mapped mounts of mapped filesystems")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org> # 5.17
      Cc: <regressions@lists.linux.dev>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      705191b0
  4. Apr 18, 2022
  5. Apr 17, 2022
  6. Apr 16, 2022
  7. Apr 15, 2022
  8. Apr 14, 2022
  9. Apr 13, 2022
  10. Apr 12, 2022
    • Tadeusz Struk's avatar
      ext4: limit length to bitmap_maxbytes - blocksize in punch_hole · 2da37622
      Tadeusz Struk authored
      Syzbot found an issue [1] in ext4_fallocate().
      The C reproducer [2] calls fallocate(), passing size 0xffeffeff000ul,
      and offset 0x1000000ul, which, when added together exceed the
      bitmap_maxbytes for the inode. This triggers a BUG in
      ext4_ind_remove_space(). According to the comments in this function
      the 'end' parameter needs to be one block after the last block to be
      removed. In the case when the BUG is triggered it points to the last
      block. Modify the ext4_punch_hole() function and add constraint that
      caps the length to satisfy the one before laster block requirement.
      
      LINK: [1] https://syzkaller.appspot.com/bug?id=b80bd9cf348aac724a4f4dff251800106d721331
      LINK: [2] https://syzkaller.appspot.com/text?tag=ReproC&x=14ba0238700000
      
      Fixes: a4bb6b64
      
       ("ext4: enable "punch hole" functionality")
      Reported-by: default avatar <syzbot+7a806094edd5d07ba029@syzkaller.appspotmail.com>
      Signed-off-by: default avatarTadeusz Struk <tadeusz.struk@linaro.org>
      Link: https://lore.kernel.org/r/20220331200515.153214-1-tadeusz.struk@linaro.org
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      2da37622
    • Ye Bin's avatar
      ext4: fix use-after-free in ext4_search_dir · c186f088
      Ye Bin authored
      
      We got issue as follows:
      EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
      ==================================================================
      BUG: KASAN: use-after-free in ext4_search_dir fs/ext4/namei.c:1394 [inline]
      BUG: KASAN: use-after-free in search_dirblock fs/ext4/namei.c:1199 [inline]
      BUG: KASAN: use-after-free in __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
      Read of size 1 at addr ffff8881317c3005 by task syz-executor117/2331
      
      CPU: 1 PID: 2331 Comm: syz-executor117 Not tainted 5.10.0+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:83 [inline]
       dump_stack+0x144/0x187 lib/dump_stack.c:124
       print_address_description+0x7d/0x630 mm/kasan/report.c:387
       __kasan_report+0x132/0x190 mm/kasan/report.c:547
       kasan_report+0x47/0x60 mm/kasan/report.c:564
       ext4_search_dir fs/ext4/namei.c:1394 [inline]
       search_dirblock fs/ext4/namei.c:1199 [inline]
       __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
       ext4_lookup_entry fs/ext4/namei.c:1622 [inline]
       ext4_lookup+0xb8/0x3a0 fs/ext4/namei.c:1690
       __lookup_hash+0xc5/0x190 fs/namei.c:1451
       do_rmdir+0x19e/0x310 fs/namei.c:3760
       do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x445e59
      Code: 4d c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff2277fac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
      RAX: ffffffffffffffda RBX: 0000000000400280 RCX: 0000000000445e59
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200000c0
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002
      R10: 00007fff2277f990 R11: 0000000000000246 R12: 0000000000000000
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      The buggy address belongs to the page:
      page:0000000048cd3304 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1317c3
      flags: 0x200000000000000()
      raw: 0200000000000000 ffffea0004526588 ffffea0004528088 0000000000000000
      raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8881317c2f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff8881317c2f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff8881317c3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                         ^
       ffff8881317c3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff8881317c3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      ==================================================================
      
      ext4_search_dir:
        ...
        de = (struct ext4_dir_entry_2 *)search_buf;
        dlimit = search_buf + buf_size;
        while ((char *) de < dlimit) {
        ...
          if ((char *) de + de->name_len <= dlimit &&
      	 ext4_match(dir, fname, de)) {
      	    ...
          }
        ...
          de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize);
          if (de_len <= 0)
            return -1;
          offset += de_len;
          de = (struct ext4_dir_entry_2 *) ((char *) de + de_len);
        }
      
      Assume:
      de=0xffff8881317c2fff
      dlimit=0x0xffff8881317c3000
      
      If read 'de->name_len' which address is 0xffff8881317c3005, obviously is
      out of range, then will trigger use-after-free.
      To solve this issue, 'dlimit' must reserve 8 bytes, as we will read
      'de->name_len' to judge if '(char *) de + de->name_len' out of range.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220324064816.1209985-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      c186f088
    • Ye Bin's avatar
      ext4: fix bug_on in start_this_handle during umount filesystem · b98535d0
      Ye Bin authored
      
      We got issue as follows:
      ------------[ cut here ]------------
      kernel BUG at fs/jbd2/transaction.c:389!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 9 PID: 131 Comm: kworker/9:1 Not tainted 5.17.0-862.14.0.6.x86_64-00001-g23f87daf7d74-dirty #197
      Workqueue: events flush_stashed_error_work
      RIP: 0010:start_this_handle+0x41c/0x1160
      RSP: 0018:ffff888106b47c20 EFLAGS: 00010202
      RAX: ffffed10251b8400 RBX: ffff888128dc204c RCX: ffffffffb52972ac
      RDX: 0000000000000200 RSI: 0000000000000004 RDI: ffff888128dc2050
      RBP: 0000000000000039 R08: 0000000000000001 R09: ffffed10251b840a
      R10: ffff888128dc204f R11: ffffed10251b8409 R12: ffff888116d78000
      R13: 0000000000000000 R14: dffffc0000000000 R15: ffff888128dc2000
      FS:  0000000000000000(0000) GS:ffff88839d680000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000001620068 CR3: 0000000376c0e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       jbd2__journal_start+0x38a/0x790
       jbd2_journal_start+0x19/0x20
       flush_stashed_error_work+0x110/0x2b3
       process_one_work+0x688/0x1080
       worker_thread+0x8b/0xc50
       kthread+0x26f/0x310
       ret_from_fork+0x22/0x30
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      
      Above issue may happen as follows:
            umount            read procfs            error_work
      ext4_put_super
        flush_work(&sbi->s_error_work);
      
                            ext4_mb_seq_groups_show
      	                ext4_mb_load_buddy_gfp
      			  ext4_mb_init_group
      			    ext4_mb_init_cache
      	                      ext4_read_block_bitmap_nowait
      			        ext4_validate_block_bitmap
      				  ext4_error
      			            ext4_handle_error
      			              schedule_work(&EXT4_SB(sb)->s_error_work);
      
        ext4_unregister_sysfs(sb);
        jbd2_journal_destroy(sbi->s_journal);
          journal_kill_thread
            journal->j_flags |= JBD2_UNMOUNT;
      
                                                flush_stashed_error_work
      				            jbd2_journal_start
      					      start_this_handle
      					        BUG_ON(journal->j_flags & JBD2_UNMOUNT);
      
      To solve this issue, we call 'ext4_unregister_sysfs() before flushing
      s_error_work in ext4_put_super().
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220322012419.725457-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b98535d0
    • Ye Bin's avatar
      ext4: fix symlink file size not match to file content · a2b0b205
      Ye Bin authored
      
      We got issue as follows:
      [home]# fsck.ext4  -fn  ram0yb
      e2fsck 1.45.6 (20-Mar-2020)
      Pass 1: Checking inodes, blocks, and sizes
      Pass 2: Checking directory structure
      Symlink /p3/d14/d1a/l3d (inode #3494) is invalid.
      Clear? no
      Entry 'l3d' in /p3/d14/d1a (3383) has an incorrect filetype (was 7, should be 0).
      Fix? no
      
      As the symlink file size does not match the file content. If the writeback
      of the symlink data block failed, ext4_finish_bio() handles the end of IO.
      However this function fails to mark the buffer with BH_write_io_error and
      so when unmount does journal checkpoint it cannot detect the writeback
      error and will cleanup the journal. Thus we've lost the correct data in the
      journal area. To solve this issue, mark the buffer as BH_write_io_error in
      ext4_finish_bio().
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220321144438.201685-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a2b0b205
    • Darrick J. Wong's avatar
      ext4: fix fallocate to use file_modified to update permissions consistently · ad5cd4f4
      Darrick J. Wong authored
      
      Since the initial introduction of (posix) fallocate back at the turn of
      the century, it has been possible to use this syscall to change the
      user-visible contents of files.  This can happen by extending the file
      size during a preallocation, or through any of the newer modes (punch,
      zero, collapse, insert range).  Because the call can be used to change
      file contents, we should treat it like we do any other modification to a
      file -- update the mtime, and drop set[ug]id privileges/capabilities.
      
      The VFS function file_modified() does all this for us if pass it a
      locked inode, so let's make fallocate drop permissions correctly.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20220308185043.GA117678@magnolia
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      ad5cd4f4
    • Mikulas Patocka's avatar
      stat: fix inconsistency between struct stat and struct compat_stat · 932aba1e
      Mikulas Patocka authored
      struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit
      st_dev and st_rdev; struct compat_stat (defined in
      arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by
      a 16-bit padding.
      
      This patch fixes struct compat_stat to match struct stat.
      
      [ Historical note: the old x86 'struct stat' did have that 16-bit field
        that the compat layer had kept around, but it was changes back in 2003
        by "struct stat - support larger dev_t":
      
          https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d
      
      
      
        and back in those days, the x86_64 port was still new, and separate
        from the i386 code, and had already picked up the old version with a
        16-bit st_dev field ]
      
      Note that we can't change compat_dev_t because it is used by
      compat_loop_info.
      
      Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
      old_valid_dev to test if the value fits into them.  This fixes
      -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major
      number 259.
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932aba1e
    • Dylan Yudaken's avatar
      io_uring: verify pad field is 0 in io_get_ext_arg · d2347b96
      Dylan Yudaken authored
      Ensure that only 0 is passed for pad here.
      
      Fixes: c73ebb68
      
       ("io_uring: add timeout support for io_uring_enter()")
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220412163042.2788062-5-dylany@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d2347b96
    • Dylan Yudaken's avatar
      io_uring: verify resv is 0 in ringfd register/unregister · 6fb53cf8
      Dylan Yudaken authored
      Only allow resv field to be 0 in struct io_uring_rsrc_update user
      arguments.
      
      Fixes: e7a6c00d
      
       ("io_uring: add support for registering ring file descriptors")
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220412163042.2788062-4-dylany@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6fb53cf8
    • Dylan Yudaken's avatar
      io_uring: verify that resv2 is 0 in io_uring_rsrc_update2 · d8a3ba9c
      Dylan Yudaken authored
      Verify that the user does not pass in anything but 0 for this field.
      
      Fixes: 992da01a
      
       ("io_uring: change registration/upd/rsrc tagging ABI")
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220412163042.2788062-3-dylany@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d8a3ba9c
    • Dylan Yudaken's avatar
      io_uring: move io_uring_rsrc_update2 validation · 565c5e61
      Dylan Yudaken authored
      
      Move validation to be more consistently straight after
      copy_from_user. This is already done in io_register_rsrc_update and so
      this removes that redundant check.
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220412163042.2788062-2-dylany@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      565c5e61
    • Pavel Begunkov's avatar
      io_uring: fix assign file locking issue · 0f8da75b
      Pavel Begunkov authored
      io-wq work cancellation path can't take uring_lock as how it's done on
      file assignment, we have to handle IO_WQ_WORK_CANCEL first, this fixes
      encountered hangs.
      
      Fixes: 6bf9c47a
      
       ("io_uring: defer file assignment")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0d9b9f37841645518503f6a207e509d14a286aba.1649773463.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f8da75b
  11. Apr 11, 2022
    • Jens Axboe's avatar
      io_uring: stop using io_wq_work as an fd placeholder · 82733d16
      Jens Axboe authored
      There are two reasons why this isn't the best idea:
      
      - It's an odd area to grab a bit of storage space, hence it's an odd area
        to grab storage from.
      - It puts the 3rd io_kiocb cacheline into the hot path, where normal hot
        path just needs the first two.
      
      Use 'cflags' for joint fd/cflags storage. We only need fd until we
      successfully issue, and we only need cflags once a request is done and is
      completed.
      
      Fixes: 6bf9c47a
      
       ("io_uring: defer file assignment")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      82733d16
    • Jens Axboe's avatar
      io_uring: move apoll->events cache · 2804ecd8
      Jens Axboe authored
      In preparation for fixing a regression with pulling in an extra cacheline
      for IO that doesn't usually touch the last cacheline of the io_kiocb,
      move the cached location of apoll->events to space shared with some other
      completion data. Like cflags, this isn't used until after the request
      has been completed, so we can piggy back on top of comp_list.
      
      Fixes: 81459350
      
       ("io_uring: cache req->apoll->events in req->cflags")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2804ecd8
    • Jens Axboe's avatar
      io_uring: io_kiocb_update_pos() should not touch file for non -1 offset · 6f83ab22
      Jens Axboe authored
      -1 tells use to use the current position, but we check if the file is
      a stream regardless of that. Fix up io_kiocb_update_pos() to only
      dip into file if we need to. This is both more efficient and also drops
      12 bytes of text on aarch64 and 64 bytes on x86-64.
      
      Fixes: b4aec400
      
       ("io_uring: do not recalculate ppos unnecessarily")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6f83ab22
  12. Apr 10, 2022
    • Jens Axboe's avatar
      io_uring: flag the fact that linked file assignment is sane · c4212f3e
      Jens Axboe authored
      Give applications a way to tell if the kernel supports sane linked files,
      as in files being assigned at the right time to be able to reliably
      do <open file direct into slot X><read file from slot X> while using
      IOSQE_IO_LINK to order them.
      
      Not really a bug fix, but flag it as such so that it gets pulled in with
      backports of the deferred file assignment.
      
      Fixes: 6bf9c47a
      
       ("io_uring: defer file assignment")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4212f3e
  13. Apr 08, 2022