Skip to content
Snippets Groups Projects
  1. Jun 27, 2022
  2. Jun 17, 2022
    • Christoph Hellwig's avatar
      block: remove per-disk debugfs files in blk_unregister_queue · 99d055b4
      Christoph Hellwig authored
      
      The block debugfs files are created in blk_register_queue, which is
      called by add_disk and use a naming scheme based on the disk_name.
      After del_gendisk returns that name can be reused and thus we must not
      leave these debugfs files around, otherwise the kernel is unhappy
      and spews messages like:
      
      	Directory XXXXX with parent 'block' already present!
      
      and the newly created devices will not have working debugfs files.
      
      Move the unregistration to blk_unregister_queue instead (which matches
      the sysfs unregistration) to make sure the debugfs life time rules match
      those of the disk name.
      
      As part of the move also make sure the whole debugfs unregistration is
      inside a single debugfs_mutex critical section.
      
      Note that this breaks blktests block/002, which checks that the debugfs
      directory has not been removed while blktests is running, but that
      particular check should simply be removed from the test case.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220614074827.458955-4-hch@lst.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99d055b4
    • Christoph Hellwig's avatar
      block: serialize all debugfs operations using q->debugfs_mutex · 5cf9c91b
      Christoph Hellwig authored
      
      Various places like I/O schedulers or the QOS infrastructure try to
      register debugfs files on demans, which can race with creating and
      removing the main queue debugfs directory.  Use the existing
      debugfs_mutex to serialize all debugfs operations that rely on
      q->debugfs_dir or the directories hanging off it.
      
      To make the teardown code a little simpler declare all debugfs dentry
      pointers and not just the main one uncoditionally in blkdev.h.
      
      Move debugfs_mutex next to the dentries that it protects and document
      what it is used for.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220614074827.458955-3-hch@lst.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5cf9c91b
  3. Mar 09, 2022
  4. Feb 28, 2022
    • Eric Biggers's avatar
      blk-crypto: show crypto capabilities in sysfs · 20f01f16
      Eric Biggers authored
      
      Add sysfs files that expose the inline encryption capabilities of
      request queues:
      
      	/sys/block/$disk/queue/crypto/max_dun_bits
      	/sys/block/$disk/queue/crypto/modes/$mode
      	/sys/block/$disk/queue/crypto/num_keyslots
      
      Userspace can use these new files to decide what encryption settings to
      use, or whether to use inline encryption at all.  This also brings the
      crypto capabilities in line with the other queue properties, which are
      already discoverable via the queue directory in sysfs.
      
      Design notes:
      
        - Place the new files in a new subdirectory "crypto" to group them
          together and to avoid complicating the main "queue" directory.  This
          also makes it possible to replace "crypto" with a symlink later if
          we ever make the blk_crypto_profiles into real kobjects (see below).
      
        - It was necessary to define a new kobject that corresponds to the
          crypto subdirectory.  For now, this kobject just contains a pointer
          to the blk_crypto_profile.  Note that multiple queues (and hence
          multiple such kobjects) may refer to the same blk_crypto_profile.
      
          An alternative design would more closely match the current kernel
          data structures: the blk_crypto_profile could be a kobject itself,
          located directly under the host controller device's kobject, while
          /sys/block/$disk/queue/crypto would be a symlink to it.
      
          I decided not to do that for now because it would require a lot more
          changes, such as no longer embedding blk_crypto_profile in other
          structures, and also because I'm not sure we can rule out moving the
          crypto capabilities into 'struct queue_limits' in the future.  (Even
          if multiple queues share the same crypto engine, maybe the supported
          data unit sizes could differ due to other queue properties.)  It
          would also still be possible to switch to that design later without
          breaking userspace, by replacing the directory with a symlink.
      
        - Use "max_dun_bits" instead of "max_dun_bytes".  Currently, the
          kernel internally stores this value in bytes, but that's an
          implementation detail.  It probably makes more sense to talk about
          this value in bits, and choosing bits is more future-proof.
      
        - "modes" is a sub-subdirectory, since there may be multiple supported
          crypto modes, sysfs is supposed to have one value per file, and it
          makes sense to group all the mode files together.
      
        - Each mode had to be named.  The crypto API names like "xts(aes)" are
          not appropriate because they don't specify the key size.  Therefore,
          I assigned new names.  The exact names chosen are arbitrary, but
          they happen to match the names used in log messages in fs/crypto/.
      
        - The "num_keyslots" file is a bit different from the others in that
          it is only useful to know for performance reasons.  However, it's
          included as it can still be useful.  For example, a user might not
          want to use inline encryption if there aren't very many keyslots.
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      20f01f16
    • Eric Biggers's avatar
      block: don't delete queue kobject before its children · 0f692882
      Eric Biggers authored
      
      kobjects aren't supposed to be deleted before their child kobjects are
      deleted.  Apparently this is usually benign; however, a WARN will be
      triggered if one of the child kobjects has a named attribute group:
      
          sysfs group 'modes' not found for kobject 'crypto'
          WARNING: CPU: 0 PID: 1 at fs/sysfs/group.c:278 sysfs_remove_group+0x72/0x80
          ...
          Call Trace:
            sysfs_remove_groups+0x29/0x40 fs/sysfs/group.c:312
            __kobject_del+0x20/0x80 lib/kobject.c:611
            kobject_cleanup+0xa4/0x140 lib/kobject.c:696
            kobject_release lib/kobject.c:736 [inline]
            kref_put include/linux/kref.h:65 [inline]
            kobject_put+0x53/0x70 lib/kobject.c:753
            blk_crypto_sysfs_unregister+0x10/0x20 block/blk-crypto-sysfs.c:159
            blk_unregister_queue+0xb0/0x110 block/blk-sysfs.c:962
            del_gendisk+0x117/0x250 block/genhd.c:610
      
      Fix this by moving the kobject_del() and the corresponding
      kobject_uevent() to the correct place.
      
      Fixes: 2c2086af ("block: Protect less code with sysfs_lock in blk_{un,}register_queue()")
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220124215938.2769-3-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f692882
    • Eric Biggers's avatar
      block: simplify calling convention of elv_unregister_queue() · f5ec592d
      Eric Biggers authored
      
      Make elv_unregister_queue() a no-op if q->elevator is NULL or is not
      registered.
      
      This simplifies the existing callers, as well as the future caller in
      the error path of blk_register_queue().
      
      Also don't bother checking whether q is NULL, since it never is.
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20220124215938.2769-2-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f5ec592d
  5. Feb 23, 2022
  6. Feb 11, 2022
  7. Jan 17, 2022
  8. Dec 21, 2021
  9. Dec 03, 2021
  10. Nov 29, 2021
  11. Nov 16, 2021
    • Ming Lei's avatar
      blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() · 2a19b28f
      Ming Lei authored
      
      For avoiding to slow down queue destroy, we don't call
      blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
      cancel dispatch work in blk_release_queue().
      
      However, this way has caused kernel oops[1], reported by Changhui. The log
      shows that scsi_device can be freed before running blk_release_queue(),
      which is expected too since scsi_device is released after the scsi disk
      is closed and the scsi_device is removed.
      
      Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
      and disk_release():
      
      1) when disk_release() is run, the disk has been closed, and any sync
      dispatch activities have been done, so canceling dispatch work is enough to
      quiesce filesystem I/O dispatch activity.
      
      2) in blk_cleanup_queue(), we only focus on passthrough request, and
      passthrough request is always explicitly allocated & freed by
      its caller, so once queue is frozen, all sync dispatch activity
      for passthrough request has been done, then it is enough to just cancel
      dispatch work for avoiding any dispatch activity.
      
      [1] kernel panic log
      [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
      [12622.777186] #PF: supervisor read access in kernel mode
      [12622.782918] #PF: error_code(0x0000) - not-present page
      [12622.788649] PGD 0 P4D 0
      [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
      [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
      [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
      [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
      [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
      [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
      [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
      [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
      [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
      [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
      [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
      [12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
      [12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
      [12622.913328] Call Trace:
      [12622.916055]  <TASK>
      [12622.918394]  scsi_mq_get_budget+0x1a/0x110
      [12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
      [12622.928404]  ? pick_next_task_fair+0x39/0x390
      [12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
      [12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
      [12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
      [12622.949593]  process_one_work+0x1e8/0x3c0
      [12622.954059]  worker_thread+0x50/0x3b0
      [12622.958144]  ? rescuer_thread+0x370/0x370
      [12622.962616]  kthread+0x158/0x180
      [12622.966218]  ? set_kthread_struct+0x40/0x40
      [12622.970884]  ret_from_fork+0x22/0x30
      [12622.974875]  </TASK>
      [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
      
      Reported-by: default avatarChanghuiZhong <czhong@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2a19b28f
  12. Oct 27, 2021
    • Damien Le Moal's avatar
      block: Add independent access ranges support · a2247f19
      Damien Le Moal authored
      
      The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
      (for ATA) contain parameters describing the set of contiguous LBAs that
      can be served independently by a single LUN multi-actuator hard-disk.
      Similarly, a logically defined block device composed of multiple disks
      can in some cases execute requests directed at different sector ranges
      in parallel. A dm-linear device aggregating 2 block devices together is
      an example.
      
      This patch implements support for exposing a block device independent
      access ranges to the user through sysfs to allow optimizing device
      accesses to increase performance.
      
      To describe the set of independent sector ranges of a device (actuators
      of a multi-actuator HDDs or table entries of a dm-linear device),
      The type struct blk_independent_access_ranges is introduced. This
      structure describes the sector ranges using an array of
      struct blk_independent_access_range structures. This range structure
      defines the start sector and number of sectors of the access range.
      The ranges in the array cannot overlap and must contain all sectors
      within the device capacity.
      
      The function disk_set_independent_access_ranges() allows a device
      driver to signal to the block layer that a device has multiple
      independent access ranges.  In this case, a struct
      blk_independent_access_ranges is attached to the device request queue
      by the function disk_set_independent_access_ranges(). The function
      disk_alloc_independent_access_ranges() is provided for drivers to
      allocate this structure.
      
      struct blk_independent_access_ranges contains kobjects (struct kobject)
      to expose to the user through sysfs the set of independent access ranges
      supported by a device. When the device is initialized, sysfs
      registration of the ranges information is done from blk_register_queue()
      using the block layer internal function
      disk_register_independent_access_ranges(). If a driver calls
      disk_set_independent_access_ranges() for a registered queue, e.g. when a
      device is revalidated, disk_set_independent_access_ranges() will execute
      disk_register_independent_access_ranges() to update the sysfs attribute
      files.  The sysfs file structure created starts from the
      independent_access_ranges sub-directory and contains the start sector
      and number of sectors of each range, with the information for each range
      grouped in numbered sub-directories.
      
      E.g. for a dual actuator HDD, the user sees:
      
      $ tree /sys/block/sdk/queue/independent_access_ranges/
      /sys/block/sdk/queue/independent_access_ranges/
      |-- 0
      |   |-- nr_sectors
      |   `-- sector
      `-- 1
          |-- nr_sectors
          `-- sector
      
      For a regular device with a single access range, the
      independent_access_ranges sysfs directory does not exist.
      
      Device revalidation may lead to changes to this structure and to the
      attribute values. When manipulated, the queue sysfs_lock and
      sysfs_dir_lock mutexes are held for atomicity, similarly to how the
      blk-mq and elevator sysfs queue sub-directories are protected.
      
      The code related to the management of independent access ranges is
      added in the new file block/blk-ia-ranges.c.
      
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2247f19
  13. Oct 18, 2021
  14. Aug 23, 2021
  15. Aug 16, 2021
  16. Aug 09, 2021
  17. Jun 16, 2021
  18. May 24, 2021
  19. Apr 06, 2021
  20. Feb 22, 2021
    • Jeffle Xu's avatar
      block: fix potential IO hang when turning off io_poll · 6b09b4d3
      Jeffle Xu authored
      
      QUEUE_FLAG_POLL flag will be cleared when turning off 'io_poll', while
      at that moment there may be IOs stuck in hw queue uncompleted. The
      following polling routine won't help reap these IOs, since blk_poll()
      will return immediately because of cleared QUEUE_FLAG_POLL flag. Thus
      these IOs will hang until they finnaly time out. The hang out can be
      observed by 'fio --engine=io_uring iodepth=1', while turning off
      'io_poll' at the same time.
      
      To fix this, freeze and flush the request queue first when turning off
      'io_poll'.
      
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b09b4d3
  21. Feb 10, 2021
    • Damien Le Moal's avatar
      block: introduce zone_write_granularity limit · a805a4fa
      Damien Le Moal authored
      
      Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
      all writes into sequential write required zones be aligned to the device
      physical block size. However, NVMe ZNS does not have this constraint and
      allows write operations into sequential zones to be aligned to the
      device logical block size. This inconsistency does not help with
      software portability across device types.
      
      To solve this, introduce the zone_write_granularity queue limit to
      indicate the alignment constraint, in bytes, of write operations into
      zones of a zoned block device. This new limit is exported as a
      read-only sysfs queue attribute and the helper
      blk_queue_zone_write_granularity() introduced for drivers to set this
      limit.
      
      The function blk_queue_set_zoned() is modified to set this new limit to
      the device logical block size by default. NVMe ZNS devices as well as
      zoned nullb devices use this default value as is. The scsi disk driver
      is modified to execute the blk_queue_zone_write_granularity() helper to
      set the zone write granularity of host-managed SMR disks to the disk
      physical block size.
      
      The accessor functions queue_zone_write_granularity() and
      bdev_zone_write_granularity() are also introduced.
      
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@edc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a805a4fa
  22. Oct 09, 2020
  23. Sep 24, 2020
  24. Sep 08, 2020
Loading