Skip to content
Snippets Groups Projects
  1. Mar 08, 2022
    • Dietmar Eggemann's avatar
      sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() · 1312ef5a
      Dietmar Eggemann authored
      
      commit 71e5f664 upstream.
      
      Commit "sched/topology: Make sched_init_numa() use a set for the
      deduplicating sort" allocates 'i + nr_levels (level)' instead of
      'i + nr_levels + 1' sched_domain_topology_level.
      
      This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
      
      sched_init_domains
        build_sched_domains()
          __free_domain_allocs()
            __sdt_free() {
      	...
              for_each_sd_topology(tl)
      	  ...
                sd = *per_cpu_ptr(sdd->sd, j); <--
      	  ...
            }
      
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
      
      
      Signed-off-by: default avatardann frazier <dann.frazier@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1312ef5a
    • Valentin Schneider's avatar
      sched/topology: Make sched_init_numa() use a set for the deduplicating sort · d753aecb
      Valentin Schneider authored
      
      commit 620a6dc4 upstream.
      
      The deduplicating sort in sched_init_numa() assumes that the first line in
      the distance table contains all unique values in the entire table. I've
      been trying to pen what this exactly means for the topology, but it's not
      straightforward. For instance, topology.c uses this example:
      
        node   0   1   2   3
          0:  10  20  20  30
          1:  20  10  20  20
          2:  20  20  10  20
          3:  30  20  20  10
      
        0 ----- 1
        |     / |
        |   /   |
        | /     |
        2 ----- 3
      
      Which works out just fine. However, if we swap nodes 0 and 1:
      
        1 ----- 0
        |     / |
        |   /   |
        | /     |
        2 ----- 3
      
      we get this distance table:
      
        node   0  1  2  3
          0:  10 20 20 20
          1:  20 10 20 30
          2:  20 20 10 20
          3:  20 30 20 10
      
      Which breaks the deduplicating sort (non-representative first line). In
      this case this would just be a renumbering exercise, but it so happens that
      we can have a deduplicating sort that goes through the whole table in O(n²)
      at the extra cost of a temporary memory allocation (i.e. any form of set).
      
      The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
      this, implement the set as a 256-bits bitmap. Should this not be
      satisfactory (i.e. we want to support 32-bit values), then we'll have to go
      for some other sparse set implementation.
      
      This has the added benefit of letting us allocate just the right amount of
      memory for sched_domains_numa_distance[], rather than an arbitrary
      (nr_node_ids + 1).
      
      Note: DT binding equivalent (distance-map) decodes distances as 32-bit
      values.
      
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
      
      
      Signed-off-by: default avatardann frazier <dann.frazier@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d753aecb
    • Jacob Keller's avatar
      ice: fix concurrent reset and removal of VFs · 05ae1f0f
      Jacob Keller authored
      
      commit fadead80fe4c033b5e514fcbadd20b55c4494112 upstream.
      
      Commit c503e632 ("ice: Stop processing VF messages during teardown")
      introduced a driver state flag, ICE_VF_DEINIT_IN_PROGRESS, which is
      intended to prevent some issues with concurrently handling messages from
      VFs while tearing down the VFs.
      
      This change was motivated by crashes caused while tearing down and
      bringing up VFs in rapid succession.
      
      It turns out that the fix actually introduces issues with the VF driver
      caused because the PF no longer responds to any messages sent by the VF
      during its .remove routine. This results in the VF potentially removing
      its DMA memory before the PF has shut down the device queues.
      
      Additionally, the fix doesn't actually resolve concurrency issues within
      the ice driver. It is possible for a VF to initiate a reset just prior
      to the ice driver removing VFs. This can result in the remove task
      concurrently operating while the VF is being reset. This results in
      similar memory corruption and panics purportedly fixed by that commit.
      
      Fix this concurrency at its root by protecting both the reset and
      removal flows using the existing VF cfg_lock. This ensures that we
      cannot remove the VF while any outstanding critical tasks such as a
      virtchnl message or a reset are occurring.
      
      This locking change also fixes the root cause originally fixed by commit
      c503e632 ("ice: Stop processing VF messages during teardown"), so we
      can simply revert it.
      
      Note that I kept these two changes together because simply reverting the
      original commit alone would leave the driver vulnerable to worse race
      conditions.
      
      Fixes: c503e632 ("ice: Stop processing VF messages during teardown")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05ae1f0f
    • Brett Creeley's avatar
      ice: Fix race conditions between virtchnl handling and VF ndo ops · 41edeeaa
      Brett Creeley authored
      
      commit e6ba5273 upstream.
      
      The VF can be configured via the PF's ndo ops at the same time the PF is
      receiving/handling virtchnl messages. This has many issues, with
      one of them being the ndo op could be actively resetting a VF (i.e.
      resetting it to the default state and deleting/re-adding the VF's VSI)
      while a virtchnl message is being handled. The following error was seen
      because a VF ndo op was used to change a VF's trust setting while the
      VIRTCHNL_OP_CONFIG_VSI_QUEUES was ongoing:
      
      [35274.192484] ice 0000:88:00.0: Failed to set LAN Tx queue context, error: ICE_ERR_PARAM
      [35274.193074] ice 0000:88:00.0: VF 0 failed opcode 6, retval: -5
      [35274.193640] iavf 0000:88:01.0: PF returned error -5 (IAVF_ERR_PARAM) to our request 6
      
      Fix this by making sure the virtchnl handling and VF ndo ops that
      trigger VF resets cannot run concurrently. This is done by adding a
      struct mutex cfg_lock to each VF structure. For VF ndo ops, the mutex
      will be locked around the critical operations and VFR. Since the ndo ops
      will trigger a VFR, the virtchnl thread will use mutex_trylock(). This
      is done because if any other thread (i.e. VF ndo op) has the mutex, then
      that means the current VF message being handled is no longer valid, so
      just ignore it.
      
      This issue can be seen using the following commands:
      
      for i in {0..50}; do
              rmmod ice
              modprobe ice
      
              sleep 1
      
              echo 1 > /sys/class/net/ens785f0/device/sriov_numvfs
              echo 1 > /sys/class/net/ens785f1/device/sriov_numvfs
      
              ip link set ens785f1 vf 0 trust on
              ip link set ens785f0 vf 0 trust on
      
              sleep 2
      
              echo 0 > /sys/class/net/ens785f0/device/sriov_numvfs
              echo 0 > /sys/class/net/ens785f1/device/sriov_numvfs
              sleep 1
              echo 1 > /sys/class/net/ens785f0/device/sriov_numvfs
              echo 1 > /sys/class/net/ens785f1/device/sriov_numvfs
      
              ip link set ens785f1 vf 0 trust on
              ip link set ens785f0 vf 0 trust on
      done
      
      Fixes: 7c710869 ("ice: Add handlers for VF netdevice operations")
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41edeeaa
    • Frederic Weisbecker's avatar
      rcu/nocb: Fix missed nocb_timer requeue · 0c145262
      Frederic Weisbecker authored
      
      commit b2fcf210 upstream.
      
      This sequence of events can lead to a failure to requeue a CPU's
      ->nocb_timer:
      
      1.	There are no callbacks queued for any CPU covered by CPU 0-2's
      	->nocb_gp_kthread.  Note that ->nocb_gp_kthread is associated
      	with CPU 0.
      
      2.	CPU 1 enqueues its first callback with interrupts disabled, and
      	thus must defer awakening its ->nocb_gp_kthread.  It therefore
      	queues its rcu_data structure's ->nocb_timer.  At this point,
      	CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
      
      3.	CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
      	callback, but with interrupts enabled, allowing it to directly
      	awaken the ->nocb_gp_kthread.
      
      4.	The newly awakened ->nocb_gp_kthread associates both CPU 1's
      	and CPU 2's callbacks with a future grace period and arranges
      	for that grace period to be started.
      
      5.	This ->nocb_gp_kthread goes to sleep waiting for the end of this
      	future grace period.
      
      6.	This grace period elapses before the CPU 1's timer fires.
      	This is normally improbably given that the timer is set for only
      	one jiffy, but timers can be delayed.  Besides, it is possible
      	that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
      
      7.	The grace period ends, so rcu_gp_kthread awakens the
      	->nocb_gp_kthread, which in turn awakens both CPU 1's and
      	CPU 2's ->nocb_cb_kthread.  Then ->nocb_gb_kthread sleeps
      	waiting for more newly queued callbacks.
      
      8.	CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
      	waiting for more invocable callbacks.
      
      9.	Note that neither kthread updated any ->nocb_timer state,
      	so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.
      
      10.	CPU 1 enqueues its second callback, this time with interrupts
       	enabled so it can wake directly	->nocb_gp_kthread.
      	It does so with calling wake_nocb_gp() which also cancels the
      	pending timer that got queued in step 2. But that doesn't reset
      	CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
      	So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
      	desynchronized.
      
      11.	->nocb_gp_kthread associates the callback queued in 10 with a new
      	grace period, arranges for that grace period to start and sleeps
      	waiting for it to complete.
      
      12.	The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
      	which in turn wakes up CPU 1's ->nocb_cb_kthread which then
      	invokes the callback queued in 10.
      
      13.	CPU 1 enqueues its third callback, this time with interrupts
      	disabled so it must queue a timer for a deferred wakeup. However
      	the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
      	incorrectly indicates that a timer is already queued.  Instead,
      	CPU 1's ->nocb_timer was cancelled in 10.  CPU 1 therefore fails
      	to queue the ->nocb_timer.
      
      14.	CPU 1 has its pending callback and it may go unnoticed until
      	some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
      	calls an explicit deferred wakeup, for example, during idle entry.
      
      This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
      we delete the ->nocb_timer.
      
      It is quite possible that there is a similar scenario involving
      ->nocb_bypass_timer and ->nocb_defer_wakeup.  However, despite some
      effort from several people, a failure scenario has not yet been located.
      However, that by no means guarantees that no such scenario exists.
      Finding a failure scenario is left as an exercise for the reader, and the
      "Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.
      
      Fixes: d1b222c6 (rcu/nocb: Add bypass callback queueing)
      Cc: <stable@vger.kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Reviewed-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c145262
    • D. Wythe's avatar
      net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server · 9bb7237c
      D. Wythe authored
      
      commit 4940a1fdf31c39f0806ac831cde333134862030b upstream.
      
      The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear.
      Based on the fact that whether a new SMC connection can be accepted or
      not depends on not only the limit of conn nums, but also the available
      entries of rtoken. Since the rtoken release is trigger by peer, while
      the conn nums is decrease by local, tons of thing can happen in this
      time difference.
      
      This only thing that needs to be mentioned is that now all connection
      creations are completely protected by smc_server_lgr_pending lock, it's
      enough to check only the available entries in rtokens_used_mask.
      
      Fixes: cd6851f3 ("smc: remote memory buffers (RMBs)")
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bb7237c
    • D. Wythe's avatar
      net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client · d7eb6626
      D. Wythe authored
      
      commit 0537f0a2151375dcf90c1bbfda6a0aaf57164e89 upstream.
      
      The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client
      dues to following execution sequence:
      
      Server Conn A:           Server Conn B:			Client Conn B:
      
      smc_lgr_unregister_conn
                              smc_lgr_register_conn
                              smc_clc_send_accept     ->
                                                              smc_rtoken_add
      smcr_buf_unuse
      		->		Client Conn A:
      				smc_rtoken_delete
      
      smc_lgr_unregister_conn() makes current link available to assigned to new
      incoming connection, while smcr_buf_unuse() has not executed yet, which
      means that smc_rtoken_add may fail because of insufficient rtoken_entry,
      reversing their execution order will avoid this problem.
      
      Fixes: 3e034725 ("net/smc: common functions for RMBs and send buffers")
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7eb6626
    • D. Wythe's avatar
      net/smc: fix connection leak · 2e8d465b
      D. Wythe authored
      
      commit 9f1c50cf39167ff71dc5953a3234f3f6eeb8fcb5 upstream.
      
      There's a potential leak issue under following execution sequence :
      
      smc_release  				smc_connect_work
      if (sk->sk_state == SMC_INIT)
      					send_clc_confirim
      	tcp_abort();
      					...
      					sk.sk_state = SMC_ACTIVE
      smc_close_active
      switch(sk->sk_state) {
      ...
      case SMC_ACTIVE:
      	smc_close_final()
      	// then wait peer closed
      
      Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are
      still in the tcp send buffer, in which case our connection token cannot
      be delivered to the server side, which means that we cannot get a
      passive close message at all. Therefore, it is impossible for the to be
      disconnected at all.
      
      This patch tries a very simple way to avoid this issue, once the state
      has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the
      smc connection, considering that the state is SMC_INIT before
      tcp_abort(), abandoning the complete disconnection process should not
      cause too much problem.
      
      In fact, this problem may exist as long as the CLC CONFIRM message is
      not received by the server. Whether a timer should be added after
      smc_close_final() needs to be discussed in the future. But even so, this
      patch provides a faster release for connection in above case, it should
      also be valuable.
      
      Fixes: 39f41f36 ("net/smc: common release code for non-accepted sockets")
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e8d465b
    • Vladimir Oltean's avatar
      net: dcb: flush lingering app table entries for unregistered devices · 6a8a4dc2
      Vladimir Oltean authored
      
      commit 91b0383fef06f20b847fa9e4f0e3054ead0b1a1b upstream.
      
      If I'm not mistaken (and I don't think I am), the way in which the
      dcbnl_ops work is that drivers call dcb_ieee_setapp() and this populates
      the application table with dynamically allocated struct dcb_app_type
      entries that are kept in the module-global dcb_app_list.
      
      However, nobody keeps exact track of these entries, and although
      dcb_ieee_delapp() is supposed to remove them, nobody does so when the
      interface goes away (example: driver unbinds from device). So the
      dcb_app_list will contain lingering entries with an ifindex that no
      longer matches any device in dcb_app_lookup().
      
      Reclaim the lost memory by listening for the NETDEV_UNREGISTER event and
      flushing the app table entries of interfaces that are now gone.
      
      In fact something like this used to be done as part of the initial
      commit (blamed below), but it was done in dcbnl_exit() -> dcb_flushapp(),
      essentially at module_exit time. That became dead code after commit
      7a6b6f51 ("DCB: fix kconfig option") which essentially merged
      "tristate config DCB" and "bool config DCBNL" into a single "bool config
      DCB", so net/dcb/dcbnl.c could not be built as a module anymore.
      
      Commit 36b9ad80 ("net/dcb: make dcbnl.c explicitly non-modular")
      recognized this and deleted dcbnl_exit() and dcb_flushapp() altogether,
      leaving us with the version we have today.
      
      Since flushing application table entries can and should be done as soon
      as the netdevice disappears, fundamentally the commit that is to blame
      is the one that introduced the design of this API.
      
      Fixes: 9ab933ab ("dcbnl: add appliction tlv handlers")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a8a4dc2
    • j.nixdorf@avm.de's avatar
      net: ipv6: ensure we call ipv6_mc_down() at most once · f4c63b24
      j.nixdorf@avm.de authored
      
      commit 9995b408f17ff8c7f11bc725c8aa225ba3a63b1c upstream.
      
      There are two reasons for addrconf_notify() to be called with NETDEV_DOWN:
      either the network device is actually going down, or IPv6 was disabled
      on the interface.
      
      If either of them stays down while the other is toggled, we repeatedly
      call the code for NETDEV_DOWN, including ipv6_mc_down(), while never
      calling the corresponding ipv6_mc_up() in between. This will cause a
      new entry in idev->mc_tomb to be allocated for each multicast group
      the interface is subscribed to, which in turn leaks one struct ifmcaddr6
      per nontrivial multicast group the interface is subscribed to.
      
      The following reproducer will leak at least $n objects:
      
      ip addr add ff2e::4242/32 dev eth0 autojoin
      sysctl -w net.ipv6.conf.eth0.disable_ipv6=1
      for i in $(seq 1 $n); do
      	ip link set up eth0; ip link set down eth0
      done
      
      Joining groups with IPV6_ADD_MEMBERSHIP (unprivileged) or setting the
      sysctl net.ipv6.conf.eth0.forwarding to 1 (=> subscribing to ff02::2)
      can also be used to create a nontrivial idev->mc_list, which will the
      leak objects with the right up-down-sequence.
      
      Based on both sources for NETDEV_DOWN events the interface IPv6 state
      should be considered:
      
       - not ready if the network interface is not ready OR IPv6 is disabled
         for it
       - ready if the network interface is ready AND IPv6 is enabled for it
      
      The functions ipv6_mc_up() and ipv6_down() should only be run when this
      state changes.
      
      Implement this by remembering when the IPv6 state is ready, and only
      run ipv6_mc_down() if it actually changed from ready to not ready.
      
      The other direction (not ready -> ready) already works correctly, as:
      
       - the interface notification triggered codepath for NETDEV_UP /
         NETDEV_CHANGE returns early if ipv6 is disabled, and
       - the disable_ipv6=0 triggered codepath skips fully initializing the
         interface as long as addrconf_link_ready(dev) returns false
       - calling ipv6_mc_up() repeatedly does not leak anything
      
      Fixes: 3ce62a84 ("ipv6: exit early in addrconf_notify() if IPv6 is disabled")
      Signed-off-by: default avatarJohannes Nixdorf <j.nixdorf@avm.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4c63b24
    • Sven Eckelmann's avatar
      batman-adv: Don't expect inter-netns unique iflink indices · a9c4a74a
      Sven Eckelmann authored
      
      commit 6c1f41afc1dbe59d9d3c8bb0d80b749c119aa334 upstream.
      
      The ifindex doesn't have to be unique for multiple network namespaces on
      the same machine.
      
        $ ip netns add test1
        $ ip -net test1 link add dummy1 type dummy
        $ ip netns add test2
        $ ip -net test2 link add dummy2 type dummy
      
        $ ip -net test1 link show dev dummy1
        6: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether 96:81:55:1e:dd:85 brd ff:ff:ff:ff:ff:ff
        $ ip -net test2 link show dev dummy2
        6: dummy2: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
            link/ether 5a:3c:af:35:07:c3 brd ff:ff:ff:ff:ff:ff
      
      But the batman-adv code to walk through the various layers of virtual
      interfaces uses this assumption because dev_get_iflink handles it
      internally and doesn't return the actual netns of the iflink. And
      dev_get_iflink only documents the situation where ifindex == iflink for
      physical devices.
      
      But only checking for dev->netdev_ops->ndo_get_iflink is also not an option
      because ipoib_get_iflink implements it even when it sometimes returns an
      iflink != ifindex and sometimes iflink == ifindex. The caller must
      therefore make sure itself to check both netns and iflink + ifindex for
      equality. Only when they are equal, a "physical" interface was detected
      which should stop the traversal. On the other hand, vxcan_get_iflink can
      also return 0 in case there was currently no valid peer. In this case, it
      is still necessary to stop.
      
      Fixes: b7eddd0b ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
      Fixes: 5ed4a460 ("batman-adv: additional checks for virtual interfaces on top of WiFi")
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9c4a74a
    • Sven Eckelmann's avatar
      batman-adv: Request iflink once in batadv_get_real_netdevice · 3dae11d2
      Sven Eckelmann authored
      
      commit 6116ba09423f7d140f0460be6a1644dceaad00da upstream.
      
      There is no need to call dev_get_iflink multiple times for the same
      net_device in batadv_get_real_netdevice. And since some of the
      ndo_get_iflink callbacks are dynamic (for example via RCUs like in
      vxcan_get_iflink), it could easily happen that the returned values are not
      stable. The pre-checks before __dev_get_by_index are then of course bogus.
      
      Fixes: 5ed4a460 ("batman-adv: additional checks for virtual interfaces on top of WiFi")
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3dae11d2
    • Sven Eckelmann's avatar
      batman-adv: Request iflink once in batadv-on-batadv check · dcf10d78
      Sven Eckelmann authored
      
      commit 690bb6fb64f5dc7437317153902573ecad67593d upstream.
      
      There is no need to call dev_get_iflink multiple times for the same
      net_device in batadv_is_on_batman_iface. And since some of the
      .ndo_get_iflink callbacks are dynamic (for example via RCUs like in
      vxcan_get_iflink), it could easily happen that the returned values are not
      stable. The pre-checks before __dev_get_by_index are then of course bogus.
      
      Fixes: b7eddd0b ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dcf10d78
    • Florian Westphal's avatar
      netfilter: nf_queue: handle socket prefetch · 81f817f3
      Florian Westphal authored
      
      commit 3b836da4081fa585cf6c392f62557496f2cb0efe upstream.
      
      In case someone combines bpf socket assign and nf_queue, then we will
      queue an skb who references a struct sock that did not have its
      reference count incremented.
      
      As we leave rcu protection, there is no guarantee that skb->sk is still
      valid.
      
      For refcount-less skb->sk case, try to increment the reference count
      and then override the destructor.
      
      In case of failure we have two choices: orphan the skb and 'delete'
      preselect or let nf_queue() drop the packet.
      
      Do the latter, it should not happen during normal operation.
      
      Fixes: cf7fbe66 ("bpf: Add socket assign support")
      Acked-by: default avatarJoe Stringer <joe@cilium.io>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81f817f3
    • Florian Westphal's avatar
      netfilter: nf_queue: fix possible use-after-free · 4d052392
      Florian Westphal authored
      
      commit c3873070247d9e3c7a6b0cf9bf9b45e8018427b1 upstream.
      
      Eric Dumazet says:
        The sock_hold() side seems suspect, because there is no guarantee
        that sk_refcnt is not already 0.
      
      On failure, we cannot queue the packet and need to indicate an
      error.  The packet will be dropped by the caller.
      
      v2: split skb prefetch hunk into separate change
      
      Fixes: 271b72c7 ("udp: RCU handling for Unicast packets.")
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d052392
    • Florian Westphal's avatar
      netfilter: nf_queue: don't assume sk is full socket · 3b9ba964
      Florian Westphal authored
      
      commit 747670fd9a2d1b7774030dba65ca022ba442ce71 upstream.
      
      There is no guarantee that state->sk refers to a full socket.
      
      If refcount transitions to 0, sock_put calls sk_free which then ends up
      with garbage fields.
      
      I'd like to thank Oleksandr Natalenko and Jiri Benc for considerable
      debug work and pointing out state->sk oddities.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Tested-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b9ba964
    • lena wang's avatar
      net: fix up skbs delta_truesize in UDP GRO frag_list · 4e178ed1
      lena wang authored
      
      commit 224102de2ff105a2c05695e66a08f4b5b6b2d19c upstream.
      
      The truesize for a UDP GRO packet is added by main skb and skbs in main
      skb's frag_list:
      skb_gro_receive_list
              p->truesize += skb->truesize;
      
      The commit 53475c5d ("net: fix use-after-free when UDP GRO with
      shared fraglist") introduced a truesize increase for frag_list skbs.
      When uncloning skb, it will call pskb_expand_head and trusesize for
      frag_list skbs may increase. This can occur when allocators uses
      __netdev_alloc_skb and not jump into __alloc_skb. This flow does not
      use ksize(len) to calculate truesize while pskb_expand_head uses.
      skb_segment_list
      err = skb_unclone(nskb, GFP_ATOMIC);
      pskb_expand_head
              if (!skb->sk || skb->destructor == sock_edemux)
                      skb->truesize += size - osize;
      
      If we uses increased truesize adding as delta_truesize, it will be
      larger than before and even larger than previous total truesize value
      if skbs in frag_list are abundant. The main skb truesize will become
      smaller and even a minus value or a huge value for an unsigned int
      parameter. Then the following memory check will drop this abnormal skb.
      
      To avoid this error we should use the original truesize to segment the
      main skb.
      
      Fixes: 53475c5d ("net: fix use-after-free when UDP GRO with shared fraglist")
      Signed-off-by: default avatarlena wang <lena.wang@mediatek.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e178ed1
    • Sasha Neftin's avatar
      e1000e: Correct NVM checksum verification flow · eb5e444f
      Sasha Neftin authored
      commit ffd24fa2fcc76ecb2e61e7a4ef8588177bcb42a6 upstream.
      
      Update MAC type check e1000_pch_tgp because for e1000_pch_cnp,
      NVM checksum update is still possible.
      Emit a more detailed warning message.
      
      Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1191663
      
      
      Fixes: 4051f683 ("e1000e: Do not take care about recovery NVM checksum")
      Reported-by: default avatarThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: default avatarSasha Neftin <sasha.neftin@intel.com>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb5e444f
    • Leon Romanovsky's avatar
      xfrm: enforce validity of offload input flags · b53d4bfd
      Leon Romanovsky authored
      
      commit 7c76ecd9c99b6e9a771d813ab1aa7fa428b3ade1 upstream.
      
      struct xfrm_user_offload has flags variable that received user input,
      but kernel didn't check if valid bits were provided. It caused a situation
      where not sanitized input was forwarded directly to the drivers.
      
      For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by
      strongswan, but not implemented in the kernel at all.
      
      As a solution, check and sanitize input flags to forward
      XFRM_OFFLOAD_INBOUND to the drivers.
      
      Fixes: d77e38e6 ("xfrm: Add an IPsec hardware offloading API")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b53d4bfd
    • Antony Antony's avatar
      xfrm: fix the if_id check in changelink · 2f0e6d80
      Antony Antony authored
      
      commit 6d0d95a1c2b07270870e7be16575c513c29af3f1 upstream.
      
      if_id will be always 0, because it was not yet initialized.
      
      Fixes: 8dce43919566 ("xfrm: interface with if_id 0 should return error")
      Reported-by: default avatarPavel Machek <pavel@denx.de>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f0e6d80
    • Eric Dumazet's avatar
      bpf, sockmap: Do not ignore orig_len parameter · 24efaae0
      Eric Dumazet authored
      
      commit 60ce37b03917e593d8e5d8bcc7ec820773daf81d upstream.
      
      Currently, sk_psock_verdict_recv() returns skb->len
      
      This is problematic because tcp_read_sock() might have
      passed orig_len < skb->len, due to the presence of TCP urgent data.
      
      This causes an infinite loop from tcp_read_sock()
      
      Followup patch will make tcp_read_sock() more robust vs bad actors.
      
      Fixes: ef565928 ("bpf, sockmap: Allow skipping sk_skb parser program")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Tested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20220302161723.3910001-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24efaae0
    • Eric Dumazet's avatar
      netfilter: fix use-after-free in __nf_register_net_hook() · 8b0142c4
      Eric Dumazet authored
      
      commit 56763f12b0f02706576a088e85ef856deacc98a0 upstream.
      
      We must not dereference @new_hooks after nf_hook_mutex has been released,
      because other threads might have freed our allocated hooks already.
      
      BUG: KASAN: use-after-free in nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline]
      BUG: KASAN: use-after-free in hooks_validate net/netfilter/core.c:171 [inline]
      BUG: KASAN: use-after-free in __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438
      Read of size 2 at addr ffff88801c1a8000 by task syz-executor237/4430
      
      CPU: 1 PID: 4430 Comm: syz-executor237 Not tainted 5.17.0-rc5-syzkaller-00306-g2293be58d6a1 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline]
       hooks_validate net/netfilter/core.c:171 [inline]
       __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438
       nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571
       nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587
       nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218
       synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81
       xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038
       check_target net/ipv6/netfilter/ip6_tables.c:530 [inline]
       find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573
       translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735
       do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline]
       do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639
       nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101
       ipv6_setsockopt+0x122/0x180 net/ipv6/ipv6_sockglue.c:1024
       rawv6_setsockopt+0xd3/0x6a0 net/ipv6/raw.c:1084
       __sys_setsockopt+0x2db/0x610 net/socket.c:2180
       __do_sys_setsockopt net/socket.c:2191 [inline]
       __se_sys_setsockopt net/socket.c:2188 [inline]
       __x64_sys_setsockopt+0xba/0x150 net/socket.c:2188
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f65a1ace7d9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 71 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f65a1a7f308 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f65a1ace7d9
      RDX: 0000000000000040 RSI: 0000000000000029 RDI: 0000000000000003
      RBP: 00007f65a1b574c8 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000020000000 R11: 0000000000000246 R12: 00007f65a1b55130
      R13: 00007f65a1b574c0 R14: 00007f65a1b24090 R15: 0000000000022000
       </TASK>
      
      The buggy address belongs to the page:
      page:ffffea0000706a00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1c1a8
      flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000000 ffffea0001c1b108 ffffea000046dd08 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as freed
      page last allocated via order 2, migratetype Unmovable, gfp_mask 0x52dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO), pid 4430, ts 1061781545818, free_ts 1061791488993
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       __alloc_pages_node include/linux/gfp.h:572 [inline]
       alloc_pages_node include/linux/gfp.h:595 [inline]
       kmalloc_large_node+0x62/0x130 mm/slub.c:4438
       __kmalloc_node+0x35a/0x4a0 mm/slub.c:4454
       kmalloc_node include/linux/slab.h:604 [inline]
       kvmalloc_node+0x97/0x100 mm/util.c:580
       kvmalloc include/linux/slab.h:731 [inline]
       kvzalloc include/linux/slab.h:739 [inline]
       allocate_hook_entries_size net/netfilter/core.c:61 [inline]
       nf_hook_entries_grow+0x140/0x780 net/netfilter/core.c:128
       __nf_register_net_hook+0x144/0x820 net/netfilter/core.c:429
       nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571
       nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587
       nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218
       synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81
       xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038
       check_target net/ipv6/netfilter/ip6_tables.c:530 [inline]
       find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573
       translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735
       do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline]
       do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639
       nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       kvfree+0x42/0x50 mm/util.c:613
       rcu_do_batch kernel/rcu/tree.c:2527 [inline]
       rcu_core+0x7b1/0x1820 kernel/rcu/tree.c:2778
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Memory state around the buggy address:
       ffff88801c1a7f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88801c1a7f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      >ffff88801c1a8000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                         ^
       ffff88801c1a8080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88801c1a8100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      Fixes: 2420b79f ("netfilter: debug: check for sorted array")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b0142c4
    • Jiri Bohac's avatar
      xfrm: fix MTU regression · 4952faa7
      Jiri Bohac authored
      
      commit 6596a0229541270fb8d38d989f91b78838e5e9da upstream.
      
      Commit 749439bf ("ipv6: fix udpv6
      sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
      
      A Packet Too Big ICMPv6 message received in response to an ESP
      packet will prevent all further communication through the tunnel
      if the reported MTU minus the ESP overhead is smaller than 1280.
      
      E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
      is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
      in all further packets in the tunnel dropped. A ping through the
      tunnel fails with "ping: sendmsg: Invalid argument".
      
      Apparently the MTU on the xfrm route is smaller than 1280 and
      fails the check inside ip6_setup_cork() added by 749439bf.
      
      We found this by debugging USGv6/ipv6ready failures. Failing
      tests are: "Phase-2 Interoperability Test Scenario IPsec" /
      5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).
      
      Commit b515d263 ("xfrm:
      xfrm_state_mtu should return at least 1280 for ipv6") attempted
      to fix this but caused another regression in TCP MSS calculations
      and had to be reverted.
      
      The patch below fixes the situation by dropping the MTU
      check and instead checking for the underflows described in the
      749439bf commit message.
      
      Signed-off-by: default avatarJiri Bohac <jbohac@suse.cz>
      Fixes: 749439bf ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4952faa7
    • Daniel Borkmann's avatar
      mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls · e93f2be3
      Daniel Borkmann authored
      
      commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream.
      
      syzkaller was recently triggering an oversized kvmalloc() warning via
      xdp_umem_create().
      
      The triggered warning was added back in 7661809d ("mm: don't allow
      oversized kvmalloc() calls"). The rationale for the warning for huge
      kvmalloc sizes was as a reaction to a security bug where the size was
      more than UINT_MAX but not everything was prepared to handle unsigned
      long sizes.
      
      Anyway, the AF_XDP related call trace from this syzkaller report was:
      
        kvmalloc include/linux/mm.h:806 [inline]
        kvmalloc_array include/linux/mm.h:824 [inline]
        kvcalloc include/linux/mm.h:829 [inline]
        xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
        xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
        xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
        xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
        __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
        __do_sys_setsockopt net/socket.c:2187 [inline]
        __se_sys_setsockopt net/socket.c:2184 [inline]
        __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Björn mentioned that requests for >2GB allocation can still be valid:
      
        The structure that is being allocated is the page-pinning accounting.
        AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
        still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
        PAGE_SIZE on 64 bit systems). [...]
      
        I could just change from U32_MAX to INT_MAX, but as I stated earlier
        that has a hacky feeling to it. [...] From my perspective, the code
        isn't broken, with the memcg limits in consideration. [...]
      
      Linus says:
      
        [...] Pretty much every time this has come up, the kernel warning has
        shown that yes, the code was broken and there really wasn't a reason
        for doing allocations that big.
      
        Of course, some people would be perfectly fine with the allocation
        failing, they just don't want the warning. I didn't want __GFP_NOWARN
        to shut it up originally because I wanted people to see all those
        cases, but these days I think we can just say "yeah, people can shut
        it up explicitly by saying 'go ahead and fail this allocation, don't
        warn about it'".
      
        So enough time has passed that by now I'd certainly be ok with [it].
      
      Thus allow call-sites to silence such userspace triggered splats if the
      allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
      to kvcalloc() this is already the case, so nothing else needed there.
      
      Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
      Reported-by: default avatar <syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatar <syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com>
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org
      
      
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Ackd-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e93f2be3
    • Dave Jiang's avatar
      ntb: intel: fix port config status offset for SPR · 912186db
      Dave Jiang authored
      
      commit d5081bf5dcfb1cb83fb538708b0ac07a10a79cc4 upstream.
      
      The field offset for port configuration status on SPR has been changed to
      bit 14 from ICX where it resides at bit 12. By chance link status detection
      continued to work on SPR. This is due to bit 12 being a configuration bit
      which is in sync with the status bit. Fix this by checking for a SPR device
      and checking correct status bit.
      
      Fixes: 26bfe3d0 ("ntb: intel: Add Icelake (gen4) support for Intel NTB")
      Tested-by: default avatarJerry Dai <jerry.dai@intel.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      912186db
    • Nicolas Cavallari's avatar
      thermal: core: Fix TZ_GET_TRIP NULL pointer dereference · 1c0b51e6
      Nicolas Cavallari authored
      
      commit 5838a14832d447990827d85e90afe17e6fb9c175 upstream.
      
      Do not call get_trip_hyst() from thermal_genl_cmd_tz_get_trip() if
      the thermal zone does not define one.
      
      Fixes: 1ce50e7d ("thermal: core: genetlink support for events/cmd/sampling")
      Signed-off-by: default avatarNicolas Cavallari <nicolas.cavallari@green-communications.fr>
      Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c0b51e6
    • Marek Marczykowski-Górecki's avatar
      xen/netfront: destroy queues before real_num_tx_queues is zeroed · a1753d5c
      Marek Marczykowski-Górecki authored
      commit dcf4ff7a48e7598e6b10126cc02177abb8ae4f3f upstream.
      
      xennet_destroy_queues() relies on info->netdev->real_num_tx_queues to
      delete queues. Since d7dac083414eb5bb99a6d2ed53dc2c1b405224e5
      ("net-sysfs: update the queue counts in the unregistration path"),
      unregister_netdev() indirectly sets real_num_tx_queues to 0. Those two
      facts together means, that xennet_destroy_queues() called from
      xennet_remove() cannot do its job, because it's called after
      unregister_netdev(). This results in kfree-ing queues that are still
      linked in napi, which ultimately crashes:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000000
          #PF: supervisor read access in kernel mode
          #PF: error_code(0x0000) - not-present page
          PGD 0 P4D 0
          Oops: 0000 [#1] PREEMPT SMP PTI
          CPU: 1 PID: 52 Comm: xenwatch Tainted: G        W         5.16.10-1.32.fc32.qubes.x86_64+ #226
          RIP: 0010:free_netdev+0xa3/0x1a0
          Code: ff 48 89 df e8 2e e9 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 ed c1 66 ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00
          RSP: 0000:ffffc90000bcfd00 EFLAGS: 00010286
          RAX: 0000000000000000 RBX: ffff88800edad000 RCX: 0000000000000000
          RDX: 0000000000000001 RSI: ffffc90000bcfc30 RDI: 00000000ffffffff
          RBP: fffffffffffffea0 R08: 0000000000000000 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000001 R12: ffff88800edad050
          R13: ffff8880065f8f88 R14: 0000000000000000 R15: ffff8880066c6680
          FS:  0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000000 CR3: 00000000e998c006 CR4: 00000000003706e0
          Call Trace:
           <TASK>
           xennet_remove+0x13d/0x300 [xen_netfront]
           xenbus_dev_remove+0x6d/0xf0
           __device_release_driver+0x17a/0x240
           device_release_driver+0x24/0x30
           bus_remove_device+0xd8/0x140
           device_del+0x18b/0x410
           ? _raw_spin_unlock+0x16/0x30
           ? klist_iter_exit+0x14/0x20
           ? xenbus_dev_request_and_reply+0x80/0x80
           device_unregister+0x13/0x60
           xenbus_dev_changed+0x18e/0x1f0
           xenwatch_thread+0xc0/0x1a0
           ? do_wait_intr_irq+0xa0/0xa0
           kthread+0x16b/0x190
           ? set_kthread_struct+0x40/0x40
           ret_from_fork+0x22/0x30
           </TASK>
      
      Fix this by calling xennet_destroy_queues() from xennet_uninit(),
      when real_num_tx_queues is still available. This ensures that queues are
      destroyed when real_num_tx_queues is set to 0, regardless of how
      unregister_netdev() was called.
      
      Originally reported at
      https://github.com/QubesOS/qubes-issues/issues/7257
      
      
      
      Fixes: d7dac083414eb5bb9 ("net-sysfs: update the queue counts in the unregistration path")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1753d5c
    • Ville Syrjälä's avatar
      drm/i915: s/JSP2/ICP2/ PCH · ce41d803
      Ville Syrjälä authored
      commit 08783aa7693f55619859f4f63f384abf17cb58c5 upstream.
      
      This JSP2 PCH actually seems to be some special Apple
      specific ICP variant rather than a JSP. Make it so. Or at
      least all the references to it seem to be some Apple ICL
      machines. Didn't manage to find these PCI IDs in any
      public chipset docs unfortunately.
      
      The only thing we're losing here with this JSP->ICP change
      is Wa_14011294188, but based on the HSD that isn't actually
      needed on any ICP based design (including JSP), only TGP
      based stuff (including MCC) really need it. The documented
      w/a just never made that distinction because Windows didn't
      want to differentiate between JSP and MCC (not sure how
      they handle hpd/ddc/etc. then though...).
      
      Cc: stable@vger.kernel.org
      Cc: Matt Roper <matthew.d.roper@intel.com>
      Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
      Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/4226
      
      
      Fixes: 943682e3 ("drm/i915: Introduce Jasper Lake PCH")
      Signed-off-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20220224132142.12927-1-ville.syrjala@linux.intel.com
      
      
      Acked-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Tested-by: default avatarTomas Bzatek <bugs@bzatek.net>
      (cherry picked from commit 53581504a8e216d435f114a4f2596ad0dfd902fc)
      Signed-off-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce41d803
    • Lennert Buytenhek's avatar
      iommu/amd: Recover from event log overflow · 61a895da
      Lennert Buytenhek authored
      
      commit 5ce97f4e upstream.
      
      The AMD IOMMU logs I/O page faults and such to a ring buffer in
      system memory, and this ring buffer can overflow.  The AMD IOMMU
      spec has the following to say about the interrupt status bit that
      signals this overflow condition:
      
      	EventOverflow: Event log overflow. RW1C. Reset 0b. 1 = IOMMU
      	event log overflow has occurred. This bit is set when a new
      	event is to be written to the event log and there is no usable
      	entry in the event log, causing the new event information to
      	be discarded. An interrupt is generated when EventOverflow = 1b
      	and MMIO Offset 0018h[EventIntEn] = 1b. No new event log
      	entries are written while this bit is set. Software Note: To
      	resume logging, clear EventOverflow (W1C), and write a 1 to
      	MMIO Offset 0018h[EventLogEn].
      
      The AMD IOMMU driver doesn't currently implement this recovery
      sequence, meaning that if a ring buffer overflow occurs, logging
      of EVT/PPR/GA events will cease entirely.
      
      This patch implements the spec-mandated reset sequence, with the
      minor tweak that the hardware seems to want to have a 0 written to
      MMIO Offset 0018h[EventLogEn] first, before writing an 1 into this
      field, or the IOMMU won't actually resume logging events.
      
      Signed-off-by: default avatarLennert Buytenhek <buytenh@arista.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/YVrSXEdW2rzEfOvk@wantstofly.org
      
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61a895da
    • Marek Vasut's avatar
      ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min · 6951a588
      Marek Vasut authored
      
      commit 9bdd10d5 upstream.
      
      While the $val/$val2 values passed in from userspace are always >= 0
      integers, the limits of the control can be signed integers and the $min
      can be non-zero and less than zero. To correctly validate $val/$val2
      against platform_max, add the $min offset to val first.
      
      Fixes: 817f7c93 ("ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()")
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220215130645.164025-1-marex@denx.de
      
      
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6951a588
    • Alexandre Ghiti's avatar
      riscv: Fix config KASAN && DEBUG_VIRTUAL · dd9dd24f
      Alexandre Ghiti authored
      
      commit c648c4bb upstream.
      
      __virt_to_phys function is called very early in the boot process (ie
      kasan_early_init) so it should not be instrumented by KASAN otherwise it
      bugs.
      
      Fix this by declaring phys_addr.c as non-kasan instrumentable.
      
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Fixes: 8ad8b727 (riscv: Add KASAN support)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd9dd24f
    • Alexandre Ghiti's avatar
      riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP · 7211aab2
      Alexandre Ghiti authored
      
      commit a3d328037846d013bb4c7f3777241e190e4c75e1 upstream.
      
      In order to get the pfn of a struct page* when sparsemem is enabled
      without vmemmap, the mem_section structures need to be initialized which
      happens in sparse_init.
      
      But kasan_early_init calls pfn_to_page way before sparse_init is called,
      which then tries to dereference a null mem_section pointer.
      
      Fix this by removing the usage of this function in kasan_early_init.
      
      Fixes: 8ad8b727 ("riscv: Add KASAN support")
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7211aab2
    • Sunil V L's avatar
      riscv/efi_stub: Fix get_boot_hartid_from_fdt() return value · 00fb385f
      Sunil V L authored
      
      commit dcf0c838854c86e1f41fb1934aea906845d69782 upstream.
      
      The get_boot_hartid_from_fdt() function currently returns U32_MAX
      for failure case which is not correct because U32_MAX is a valid
      hartid value. This patch fixes the issue by returning error code.
      
      Cc: <stable@vger.kernel.org>
      Fixes: d7071743 ("RISC-V: Add EFI stub support.")
      Signed-off-by: default avatarSunil V L <sunilvl@ventanamicro.com>
      Reviewed-by: default avatarHeinrich Schuchardt <heinrich.schuchardt@canonical.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00fb385f
    • Zhen Ni's avatar
      ALSA: intel_hdmi: Fix reference to PCM buffer address · 33687260
      Zhen Ni authored
      
      commit 0aa6b294b312d9710804679abd2c0c8ca52cc2bc upstream.
      
      PCM buffers might be allocated dynamically when the buffer
      preallocation failed or a larger buffer is requested, and it's not
      guaranteed that substream->dma_buffer points to the actually used
      buffer.  The driver needs to refer to substream->runtime->dma_addr
      instead for the buffer address.
      
      Signed-off-by: default avatarZhen Ni <nizhen@uniontech.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20220302074241.30469-1-nizhen@uniontech.com
      
      
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33687260
    • Steven Rostedt's avatar
      tracing: Add ustring operation to filtering string pointers · e57dfaf6
      Steven Rostedt authored
      [ Upstream commit f37c3bbc635994eda203a6da4ba0f9d05165a8d6 ]
      
      Since referencing user space pointers is special, if the user wants to
      filter on a field that is a pointer to user space, then they need to
      specify it.
      
      Add a ".ustring" attribute to the field name for filters to state that the
      field is pointing to user space such that the kernel can take the
      appropriate action to read that pointer.
      
      Link: https://lore.kernel.org/all/yt9d8rvmt2jq.fsf@linux.ibm.com/
      
      
      
      Fixes: 77360f9bbc7e ("tracing: Add test for user space strings when filtering on string pointers")
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e57dfaf6
    • Qiang Yu's avatar
      drm/amdgpu: check vm ready by amdgpu_vm->evicting flag · 4a9d2390
      Qiang Yu authored
      
      [ Upstream commit c1a66c3bc425ff93774fb2f6eefa67b83170dd7e ]
      
      Workstation application ANSA/META v21.1.4 get this error dmesg when
      running CI test suite provided by ANSA/META:
      [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
      
      This is caused by:
      1. create a 256MB buffer in invisible VRAM
      2. CPU map the buffer and access it causes vm_fault and try to move
         it to visible VRAM
      3. force visible VRAM space and traverse all VRAM bos to check if
         evicting this bo is valuable
      4. when checking a VM bo (in invisible VRAM), amdgpu_vm_evictable()
         will set amdgpu_vm->evicting, but latter due to not in visible
         VRAM, won't really evict it so not add it to amdgpu_vm->evicted
      5. before next CS to clear the amdgpu_vm->evicting, user VM ops
         ioctl will pass amdgpu_vm_ready() (check amdgpu_vm->evicted)
         but fail in amdgpu_vm_bo_update_mapping() (check
         amdgpu_vm->evicting) and get this error log
      
      This error won't affect functionality as next CS will finish the
      waiting VM ops. But we'd better clear the error log by checking
      the amdgpu_vm->evicting flag in amdgpu_vm_ready() to stop calling
      amdgpu_vm_bo_update_mapping() later.
      
      Another reason is amdgpu_vm->evicted list holds all BOs (both
      user buffer and page table), but only page table BOs' eviction
      prevent VM ops. amdgpu_vm->evicting flag is set only for page
      table BOs, so we should use evicting flag instead of evicted list
      in amdgpu_vm_ready().
      
      The side effect of this change is: previously blocked VM op (user
      buffer in "evicted" list but no page table in it) gets done
      immediately.
      
      v2: update commit comments.
      
      Acked-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarQiang Yu <qiang.yu@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4a9d2390
    • Sergey Shtylyov's avatar
      ata: pata_hpt37x: fix PCI clock detection · 67e25eb1
      Sergey Shtylyov authored
      
      [ Upstream commit 5f6b0f2d ]
      
      The f_CNT register (at the PCI config. address 0x78) is 16-bit, not
      8-bit! The bug was there from the very start... :-(
      
      Signed-off-by: default avatarSergey Shtylyov <s.shtylyov@omp.ru>
      Fixes: 669a5db4 ("[libata] Add a bunch of PATA drivers.")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      67e25eb1
    • Valentin Caron's avatar
      serial: stm32: prevent TDR register overwrite when sending x_char · 335f11ff
      Valentin Caron authored
      
      [ Upstream commit d3d079bde07e1b7deaeb57506dc0b86010121d17 ]
      
      When sending x_char in stm32_usart_transmit_chars(), driver can overwrite
      the value of TDR register by the value of x_char. If this happens, the
      previous value that was present in TDR register will not be sent through
      uart.
      
      This code checks if the previous value in TDR register is sent before
      writing the x_char value into register.
      
      Fixes: 48a6092f ("serial: stm32-usart: Add STM32 USART Driver")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarValentin Caron <valentin.caron@foss.st.com>
      Link: https://lore.kernel.org/r/20220111164441.6178-2-valentin.caron@foss.st.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      335f11ff
    • Steven Rostedt's avatar
      tracing: Add test for user space strings when filtering on string pointers · c999c592
      Steven Rostedt authored
      [ Upstream commit 77360f9bbc7e5e2ab7a2c8b4c0244fbbfcfc6f62 ]
      
      Pingfan reported that the following causes a fault:
      
        echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
        echo 1 > events/syscalls/sys_enter_at/enable
      
      The reason is that trace event filter treats the user space pointer
      defined by "filename" as a normal pointer to compare against the "cpu"
      string. The following bug happened:
      
       kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0001) - permissions violation
       PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867
       Oops: 0001 [#1] PREEMPT SMP PTI
       CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
       RIP: 0010:strlen+0x0/0x20
       Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11
             48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8
             48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
       RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246
       RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000
       RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60
       RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000
       R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380
       R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0
       FS:  00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        filter_pred_pchar+0x18/0x40
        filter_match_preds+0x31/0x70
        ftrace_syscall_enter+0x27a/0x2c0
        syscall_trace_enter.constprop.0+0x1aa/0x1d0
        do_syscall_64+0x16/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7fa861d88664
      
      The above happened because the kernel tried to access user space directly
      and triggered a "supervisor read access in kernel mode" fault. Worse yet,
      the memory could not even be loaded yet, and a SEGFAULT could happen as
      well. This could be true for kernel space accessing as well.
      
      To be even more robust, test both kernel and user space strings. If the
      string fails to read, then simply have the filter fail.
      
      Note, TASK_SIZE is used to determine if the pointer is user or kernel space
      and the appropriate strncpy_from_kernel/user_nofault() function is used to
      copy the memory. For some architectures, the compare to TASK_SIZE may always
      pick user space or kernel space. If it gets it wrong, the only thing is that
      the filter will fail to match. In the future, this needs to be fixed to have
      the event denote which should be used. But failing a filter is much better
      than panicing the machine, and that can be solved later.
      
      Link: https://lore.kernel.org/all/20220107044951.22080-1-kernelfans@gmail.com/
      Link: https://lkml.kernel.org/r/20220110115532.536088fd@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Tom Zanussi <zanussi@kernel.org>
      Reported-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Tested-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Fixes: 87a342f5 ("tracing/filters: Support filtering for char * strings")
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c999c592
    • Christophe Vu-Brugier's avatar
      exfat: fix i_blocks for files truncated over 4 GiB · db36a94e
      Christophe Vu-Brugier authored
      
      [ Upstream commit 92fba084b79e6bc7b12fc118209f1922c1a2df56 ]
      
      In exfat_truncate(), the computation of inode->i_blocks is wrong if
      the file is larger than 4 GiB because a 32-bit variable is used as a
      mask. This is fixed and simplified by using round_up().
      
      Also fix the same buggy computation in exfat_read_root() and another
      (correct) one in exfat_fill_inode(). The latter was fixed another way
      last month but can be simplified by using round_up() as well. See:
      
        commit 0c336d6e ("exfat: fix incorrect loading of i_blocks for
                              large files")
      
      Fixes: 98d91704 ("exfat: add file operations")
      Cc: stable@vger.kernel.org # v5.7+
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarSungjong Seo <sj1557.seo@samsung.com>
      Signed-off-by: default avatarChristophe Vu-Brugier <christophe.vu-brugier@seagate.com>
      Signed-off-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      db36a94e
Loading