Skip to content
Snippets Groups Projects
  1. Jan 29, 2016
    • Johannes Berg's avatar
      wext: fix message delay/ordering · 8bf86273
      Johannes Berg authored
      
      Beniamino reported that he was getting an RTM_NEWLINK message for a
      given interface, after the RTM_DELLINK for it. It turns out that the
      message is a wireless extensions message, which was sent because the
      interface had been connected and disconnection while it was deleted
      caused a wext message.
      
      For its netlink messages, wext uses RTM_NEWLINK, but the message is
      without all the regular rtnetlink attributes, so "ip monitor link"
      prints just rudimentary information:
      
      5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default
          link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
      Deleted 5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
          link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
      5: wlan1: <BROADCAST,MULTICAST,UP>
          link/ether
      (from my hwsim reproduction)
      
      This can cause userspace to get confused since it doesn't expect an
      RTM_NEWLINK message after RTM_DELLINK.
      
      The reason for this is that wext schedules a worker to send out the
      messages, and the scheduling delay can cause the messages to get out
      to userspace in different order.
      
      To fix this, have wext register a netdevice notifier and flush out
      any pending messages when netdevice state changes. This fixes any
      ordering whenever the original message wasn't sent by a notifier
      itself.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarBeniamino Galvani <bgalvani@redhat.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      8bf86273
  2. Jan 26, 2016
  3. Jan 14, 2016
    • Helmut Schaa's avatar
      mac80211: Don't buffer non-bufferable MMPDUs · da629cf1
      Helmut Schaa authored
      
      Non-bufferable MMPDUs are sent out to STAs even while in PS mode
      (for example probe responses). Applying filtered frame handling for
      these doesn't seem to make much sense and will only create more
      air utilization when the STA wakes up. Hence, apply filtered frame
      handling only for bufferable MMPDUs.
      
      Discovered while testing an old VOIP phone that started probing
      for APs while in PS mode. The mac80211/ath9k AP where the STA is
      associated would reply with a probe response but the phone sometimes
      moved to a new channel already and couldn't ack the probe response
      anymore. In that case mac80211 applied filtered frame handling
      for the un-acked probe response.
      
      Signed-off-by: default avatarHelmut Schaa <helmut.schaa@googlemail.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      da629cf1
    • Eliad Peller's avatar
      mac80211: handle sched_scan_stopped vs. hw restart race · 2bc533bd
      Eliad Peller authored
      
      On hw restart, mac80211 might try to reconfigure already
      stopped sched scan, if ieee80211_sched_scan_stopped_work()
      wasn't scheduled yet.
      
      This in turn will keep the device driver with scheduled scan
      configured, while both mac80211 and cfg80211 will clear
      their sched scan state once the work is scheduled.
      
      Fix it by ignoring ieee80211_sched_scan_stopped() calls
      while in hw restart, and flush the work before starting
      the reconfiguration.
      
      Signed-off-by: default avatarEliad Peller <eliadx.peller@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      2bc533bd
    • Emmanuel Grumbach's avatar
      mac80211: fix PS-Poll handling · 1a57081a
      Emmanuel Grumbach authored
      
      My commit below broken PS-Poll handling. In case the driver
      has no frames buffered, driver_release_tids will be 0, but
      calling find_highest_prio_tid() with 0 as a parameter is
      not a good idea:
      fls(0) - 1 = -1.
      This bug caused mac80211 to think that frames were buffered
      in the driver which in turn was confused because mac80211
      was asking to release frames that were not reported to
      exist.
      On iwlwifi, this led to the WARNING below:
      
      WARNING: CPU: 0 PID: 11230 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:1733 iwl_mvm_sta_modify_sleep_tx_count+0x2af/0x320 [iwlmvm]()
      ffffffffc0627c60 ffff8800069b7648 ffffffff81888913 0000000000000000
      0000000000000000 ffff8800069b7688 ffffffff81089d6a ffff8800069b7678
      0000000000000001 ffff88003b35abf0 ffff88000698b128 ffff8800069b76d4
      Call Trace:
      [<ffffffff81888913>] dump_stack+0x4c/0x65
      [<ffffffff81089d6a>] warn_slowpath_common+0x8a/0xc0
      [<ffffffff81089e5a>] warn_slowpath_null+0x1a/0x20
      [<ffffffffc05f36bf>] iwl_mvm_sta_modify_sleep_tx_count+0x2af/0x320 [iwlmvm]
      [<ffffffffc05dae41>] iwl_mvm_mac_release_buffered_frames+0x31/0x40 [iwlmvm]
      [<ffffffffc045d8b6>] ieee80211_sta_ps_deliver_response+0x6e6/0xd80 [mac80211]
      [<ffffffffc0461296>] ieee80211_sta_ps_deliver_poll_response+0x26/0x30 [mac80211]
      [<ffffffffc048f743>] ieee80211_rx_handlers+0xa83/0x2900 [mac80211]
      [<ffffffffc04917ad>] ieee80211_prepare_and_rx_handle+0x1ed/0xa70 [mac80211]
      [<ffffffffc045e3d5>] ? sta_info_get_bss+0x5/0x4a0 [mac80211]
      [<ffffffffc04925b6>] ieee80211_rx_napi+0x586/0xcd0 [mac80211]
      [<ffffffffc05eaa3e>] iwl_mvm_rx_rx_mpdu+0x59e/0xc60 [iwlmvm]
      
      Fixes: 0ead2510 ("mac80211: allow the driver to send EOSP when needed")
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      1a57081a
    • Eliad Peller's avatar
      mac80211: clear local->sched_scan_req properly on reconfig · b9f628fc
      Eliad Peller authored
      
      On reconfig, in case of sched_scan_req->n_scan_plans > 1,
      local->sched_scan_req was never cleared, although
      cfg80211_sched_scan_stopped_rtnl() was called, resulting
      in local->sched_scan_req holding a stale and preventing
      further scheduled scan requests.
      
      Clear it explicitly in this case.
      
      Fixes: 42a7e82c6792 ("mac80211: Do not restart scheduled scan if multiple scan plans are set")
      Signed-off-by: default avatarEliad Peller <eliadx.peller@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      b9f628fc
    • Eliad Peller's avatar
      mac80211: avoid ROC during hw restart · 470f4d61
      Eliad Peller authored
      
      Defer ROC requests during hw restart, as the driver
      might not be fully configured in this stage (e.g.
      channel contexts were not added yet)
      
      Signed-off-by: default avatarEliad Peller <eliadx.peller@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      470f4d61
    • Johannes Berg's avatar
      regulatory: fix world regulatory domain data · c3826807
      Johannes Berg authored
      
      The rule definitions here aren't really valid, they would
      be rejected if it came from userspace due to the bandwidth
      specified being bigger than the rule's width.
      
      This is fairly much inconsequential since the other rules
      around them do enable the bandwidth, but express that better
      using the NL80211_RRF_AUTO_BW flag.
      
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      c3826807
    • Dave Young's avatar
      wireless: change cfg80211 regulatory domain info as debug messages · 94c4fd64
      Dave Young authored
      
      cfg80211 module prints a lot of messages like below. Actually printing
      once is acceptable but sometimes it will print again and again, it looks
      very annoying. It is better to change these detail messages to debugging
      only.
      
      cfg80211: World regulatory domain updated:
      cfg80211:  DFS Master region: unset
      cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
      cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
      cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
      cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
      cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (N/A)
      cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000 KHz AUTO), (N/A, 2000 mBm), (0 s)
      cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
      cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
      cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
      
      The changes in this patch is to replace pr_info with pr_debug in function
      print_rd_rules and print_regdomain_info
      
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      [change some pr_err() statements to at least keep the alpha2]
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      94c4fd64
    • Johannes Berg's avatar
      mac80211: fix remain-on-channel cancellation · e6a8a3aa
      Johannes Berg authored
      
      Ilan's previous commit 1b894521 ("mac80211: handle HW
      ROC expired properly") neglected to take into account that
      hw_begun was now always set in the software implementation
      as well as the offloaded case.
      
      Fix hw_begun to only apply to the offloaded case to make
      the check in Ilan's commit safe and correct.
      
      Reported-by: default avatarJouni Malinen <j@w1.fi>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      e6a8a3aa
    • Johannes Berg's avatar
      mac80211: recalculate SW ROC only when needed · e9db4557
      Johannes Berg authored
      
      The current (new) code recalculates the new work timeout
      for software remain-on-channel whenever any item started.
      In two of the callers of ieee80211_handle_roc_started(),
      this is completely pointless since they're for hardware
      and will skip the recalculation entirely; it's necessary
      only in the case of having just added a new item to the
      list, as in the last remaining case the recalculation had
      just been done.
      
      This last case, however, is also problematic - if one of
      the items on the list actually expires during the recalc
      the list iteration outside becomes corrupted and crashes.
      
      Fix this by moving the recalculation to the only place
      where it's required.
      
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      e9db4557
  4. Jan 13, 2016
  5. Jan 12, 2016
  6. Jan 11, 2016
    • Michal Kubeček's avatar
      udp: disallow UFO for sockets with SO_NO_CHECK option · 40ba3302
      Michal Kubeček authored
      
      Commit acf8dd0a ("udp: only allow UFO for packets from SOCK_DGRAM
      sockets") disallows UFO for packets sent from raw sockets. We need to do
      the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
      for a bit different reason: while such socket would override the
      CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
      bad offloading flags warning is triggered in __skb_gso_segment().
      
      In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
      UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.
      
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Tested-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40ba3302
    • John Fastabend's avatar
      net: pktgen: fix null ptr deref in skb allocation · 3de03596
      John Fastabend authored
      
      Fix possible null pointer dereference that may occur when calling
      skb_reserve() on a null skb.
      
      Fixes: 879c7220 ("net: pktgen: Observe needed_headroom of the device")
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3de03596
    • Daniel Borkmann's avatar
      bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key · c6c33454
      Daniel Borkmann authored
      
      After IPv6 support has recently been added to metadata dst and related
      encaps, add support for populating/reading it from an eBPF program.
      
      Commit d3aa45ce ("bpf: add helpers to access tunnel metadata") started
      with initial IPv4-only support back then (due to IPv6 metadata support
      not being available yet).
      
      To stay compatible with older programs, we need to test for the passed
      structure size. Also TOS and TTL support from the ip_tunnel_info key has
      been added. Tested with vxlan devs in collect meta data mode with IPv4,
      IPv6 and in compat mode over different network namespaces.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6c33454
    • Daniel Borkmann's avatar
      bpf: export helper function flags and reject invalid ones · 781c53bc
      Daniel Borkmann authored
      
      Export flags used by eBPF helper functions through UAPI, so they can be
      used by programs (instead of them redefining all flags each time or just
      using the hard-coded values). It also gives a better overview what flags
      are used where and we can further get rid of the extra macros defined in
      filter.c. Moreover, reject invalid flags.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      781c53bc
    • Jamal Hadi Salim's avatar
      sched,cls_flower: set key address type when present · 66530bdf
      Jamal Hadi Salim authored
      
      only when user space passes the addresses should we consider their
      presence
      
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66530bdf
    • Neal Cardwell's avatar
      tcp_yeah: don't set ssthresh below 2 · 83d15e70
      Neal Cardwell authored
      
      For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
      and CUBIC, per RFC 5681 (equation 4).
      
      tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
      value if the intended reduction is as big or bigger than the current
      cwnd. Congestion control modules should never return a zero or
      negative ssthresh. A zero ssthresh generally results in a zero cwnd,
      causing the connection to stall. A negative ssthresh value will be
      interpreted as a u32 and will set a target cwnd for PRR near 4
      billion.
      
      Oleksandr Natalenko reported that a system using tcp_yeah with ECN
      could see a warning about a prior_cwnd of 0 in
      tcp_cwnd_reduction(). Testing verified that this was due to
      tcp_yeah_ssthresh() misbehaving in this way.
      
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83d15e70
    • Marcelo Ricardo Leitner's avatar
      sctp: fix use-after-free in pr_debug statement · 649621e3
      Marcelo Ricardo Leitner authored
      
      Dmitry Vyukov reported a use-after-free in the code expanded by the
      macro debug_post_sfx, which is caused by the use of the asoc pointer
      after it was freed within sctp_side_effect() scope.
      
      This patch fixes it by allowing sctp_side_effect to clear that asoc
      pointer when the TCB is freed.
      
      As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT case
      because it will trigger DELETE_TCB too on that same loop.
      
      Also, there were places issuing SCTP_CMD_INIT_FAILED and ASSOC_FAILED
      but returning SCTP_DISPOSITION_CONSUME, which would fool the scheme
      above. Fix it by returning SCTP_DISPOSITION_ABORT instead.
      
      The macro is already prepared to handle such NULL pointer.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      649621e3
    • Alexander Kuleshov's avatar
      net/rtnetlink: remove unused sz_idx variable · 617cfc75
      Alexander Kuleshov authored
      
      The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
      not used anywhere. Let's remove it.
      
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      617cfc75
    • willy tarreau's avatar
      unix: properly account for FDs passed over unix sockets · 712f4aad
      willy tarreau authored
      
      It is possible for a process to allocate and accumulate far more FDs than
      the process' limit by sending them over a unix socket then closing them
      to keep the process' fd count low.
      
      This change addresses this problem by keeping track of the number of FDs
      in flight per user and preventing non-privileged processes from having
      more FDs in flight than their configured FD limit.
      
      Reported-by: default avatar <socketpair@gmail.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      712f4aad
    • Jean Sacren's avatar
      openvswitch: update kernel doc for struct vport · c5420eb1
      Jean Sacren authored
      
      commit be4ace6e ("openvswitch: Move dev pointer into vport itself")
      
      The commit above added @dev and moved @rcu to the bottom of struct
      vport, but the change was not reflected in the kernel doc. So let's
      update the kernel doc as well.
      
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5420eb1
    • Jean Sacren's avatar
      openvswitch: fix struct geneve_port member name · 2f7066ad
      Jean Sacren authored
      
      commit 6b001e68 ("openvswitch: Use Geneve device.")
      
      The commit above introduced 'port_no' as the name for the member of
      struct geneve_port. The correct name should be 'dst_port' as described
      in the kernel doc. Let's fix that member name and all the pertinent
      instances so that both doc and code would be consistent.
      
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f7066ad
    • Jean Sacren's avatar
      openvswitch: clean up unused function · 5ea03042
      Jean Sacren authored
      
      commit 6b001e68 ("openvswitch: Use Geneve device.")
      
      The commit above deleted the only call site of ovs_tunnel_route_lookup()
      and now that function is not used any more. So let's delete the function
      definition as well.
      
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ea03042
    • Eric Dumazet's avatar
      ipv6: tcp: add rcu locking in tcp_v6_send_synack() · 3e4006f0
      Eric Dumazet authored
      
      When first SYNACK is sent, we already hold rcu_read_lock(), but this
      is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
      does not hold rcu_read_lock()
      
      Fixes: 45f6fad8 ("ipv6: add complete rcu protection around np->opt")
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e4006f0
    • Eric Dumazet's avatar
      net: add scheduling point in recvmmsg/sendmmsg · a78cb84c
      Eric Dumazet authored
      
      Applications often have to reduce number of datagrams
      they receive or send per system call to avoid starvation problems.
      
      Really the kernel should take care of this by using cond_resched(),
      so that applications can experiment bigger batch sizes.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a78cb84c
    • Lubomir Rintel's avatar
      ipv6: always add flag an address that failed DAD with DADFAILED · 3d171f39
      Lubomir Rintel authored
      
      The userspace needs to know why is the address being removed so that it can
      perhaps obtain a new address.
      
      Without the DADFAILED flag it's impossible to distinguish removal of a
      temporary and tentative address due to DAD failure from other reasons (device
      removed, manual address removal).
      
      Signed-off-by: default avatarLubomir Rintel <lkundrak@v3.sk>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d171f39
    • Daniel Borkmann's avatar
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann authored
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/
      
      
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f211a1b
  7. Jan 10, 2016
    • Sasha Levin's avatar
      net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory · 320f1a4a
      Sasha Levin authored
      
      proc_dostring() needs an initialized destination string, while the one
      provided in proc_sctp_do_hmac_alg() contains stack garbage.
      
      Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
      accessing invalid memory.
      
      Fixes: 3c68198e ("sctp: Make hmac algorithm selection for cookie generation dynamic")
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      320f1a4a
    • Daniel Borkmann's avatar
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann authored
      
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8ffad69
    • Daniel Borkmann's avatar
      net, sched: add skb_at_tc_ingress helper · fdc5432a
      Daniel Borkmann authored
      
      Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
      can hide the ugly ifdef.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdc5432a
    • Nikolay Borisov's avatar
      ipv4: Namespecify the tcp_keepalive_intvl sysctl knob · b840d15d
      Nikolay Borisov authored
      
      This is the final part required to namespaceify the tcp
      keep alive mechanism.
      
      Signed-off-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b840d15d
    • Nikolay Borisov's avatar
      ipv4: Namespecify tcp_keepalive_probes sysctl knob · 9bd6861b
      Nikolay Borisov authored
      
      This is required to have full tcp keepalive mechanism namespace
      support.
      
      Signed-off-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bd6861b
Loading