Meraki Auto RF Explained

Meraki loves to chalk up the secret sauce in their products to “Meraki Magic” and boasts “anyone can do it”. Yet our inner engineering geek wants to open the curtain and see the real show. An example of that is Auto RF, which is a form of Radio Resource Management (RRM) that allows Meraki Wi-Fi access points to dynamically plan WLAN channels and radio transmit (TX) power. The following sections will break down what Auto RF is and how it works.

Auto RF is made up of two major components: Auto Channel and Auto TX Power. The goal is to provide an initial channel plan, and then adjust dynamically over time based on the environment. Both features are enabled by default, reducing the number of steps required to deploy Meraki access points effectively.

All currently shipping Meraki access points are built with a dedicated 2.4GHz/5GHz scanning radio, which constantly scans the entire usable spectrum. This radio, among other things such as location analytics and WIPS, is used to detect neighboring BSS’s and make off-channel scans without consuming airtime on client-serving radios. The scanning radio dwells on every channel to monitor duty cycle and detect levels of non-802.11 interference. It also sends probes on non-DFS channels to detect neighboring BSS’s and listens for beacons on all channels.

AUTO CHANNEL

The current iteration of Auto Channel comes from an algorithm called TurboCA. Auto Channel is designed to react to degrading conditions while balancing client performance against the disruptiveness of changing channels. Fortunately, with the 802.11-2012 standard we have better adoption of 802.11h, which defines standardized Channel Switch Announcements (CSAs) that reduce the impact of moving to a new channel by notifying clients when they will change channels and what channel they are moving to, so that clients can follow. Auto Channel relies on this heavily where possible, but also takes into account that many clients do not support CSAs, so it tries not to change channels frequently unless necessary.

Channel Switch Announcements

How does it work?

You can refer to the above linked TurboCA article for the full mathematical detail, but this section will summarize the process.

The goal of Auto Channel is to build a channel plan that minimizes channel overlap, optimizes cell sizes for better roaming, and maximizes channel efficiency by picking the best channel available for each AP. Then, it regularly rebuilds the plan in search of an optimization. The computation for the channel plan happens in the Meraki Cloud, where all Meraki access points report their logging data.

The key metrics for the algorithm are:

  • Node (AP) Performance
  • Network (AP Set) Performance
  • Channel Quality (noise floor, non-802.11 interference, neighboring BSS’s, etc.)
  • Channel Width
  • AP Load (number of associated clients)
  • Channel Switch Penalty
  • Hop Limit

Node performance is a calculation of how well an access point should perform on a given channel and channel width. Network performance is the product of the performance of all nodes in a Meraki Network, which is important as an individual node score close to zero will bring down the Network performance score, ensuring that a channel plan will not create issues for one area while optimizing another.

One bad node performance can rule out a channel plan
(arbitrary numbers used for examples)

Channel quality measures non-802.11 interference, duty cycle, and channel width. Channel switch penalty is a metric designed to reduce the number of channel changes for negligible benefit to reduce negative impact impact, and is weighted heavier on 2.4GHz where fewer clients support CSAs.

Hop Limit is used to determine how many neighboring APs we will consider when planning an AP’s channel. This basically determines the “aggressiveness” of the calculation. Meraki runs this calculation at 3 different intervals with different hop limits:

  • Every 15 minutes with a hop limit of “0”.
  • Every 3 hours with a hop limit of “1”, then “0”.
  • Every 24 hours with a hop limit of “2”, then “1”, then “0”

With a hop limit of “0” an AP only considers itself and directly neighboring APs when planning its channel. By running this more frequently, an AP can react to significant events quickly (such as a jammed channel) but won’t change channels too frequently. The more aggressive plans are run less frequently to balance creating a more globally optimal plan vs. changing channels too frequently.

The Auto RF Process

The process starts by inputting the current channel plan (if one exists), and collecting the scanning results and load information from each AP. It then picks a pseudo-random AP and identifies the channel that will render the best “node performance” for that AP. This selection favors picking heavier loaded APs first, as more clients actively connected to an AP signals that it is more important. By doing this, more actively used APs will have a better chance at picking the best channel available rather than running last and taking whatever channels are left.

After all APs have been assigned a channel and the cloud has calculated the predicted node performance for each, the network performance of the plan is determined. If the network performance of the new plan is better than the current plan, the new plan becomes the proposed plan. The algorithm is run ten times to compare multiple possible configurations. Once all iterations are run, the proposed plan becomes the current plan, and updated APs will change their channels accordingly.

Note that previous iterations of Auto Channel (before 802.11h) would not switch channels if a client was currently associated. Because Meraki now changes channels while clients are associated, it can lead to disruptions with clients using real-time applications that don’t support CSAs. If this is causing a negative impact, Meraki Support can revert this behavior upon request.

Exceptions

As with all things RF-related, an automatic algorithm will not fit every environment. In challenging RF environments, or high density deployments, Auto Channel can fall short. In these scenarios, manual channel assignment may be a better option, but Auto Channel can still be used as a starting point to reduce the amount of manual configuration required.

Static channel assignment

APs with a static channel assignment will be used in the plan to identify a used channel, but will be used in the algorithm to generate a plan or calculate network performance.

DFS events will always override a static or auto channel plan and trigger an immediate channel change, as required by the FCC.

If a jammed channel is detected, meaning that levels of non-802.11 interference exceed 65% for longer than one minute, a channel change will occur without waiting for the next run of Auto Channel.

If an AP is being used for wireless mesh, it will not change channels as this will have a significant impact on all APs and clients using that mesh route.

AUTO TX POWER

As with Auto Channel, Auto TX Power calculations are done in the Cloud, and the process is run every twenty minutes. A neighbor report is collected from each AP in the network, which contains the Signal-to-Noise Ratio (SNR) for all neighboring APs in the same Meraki network. The AP also reports its currently connected clients along with their SNR.

Using these lists, the Cloud compiles a list of “direct neighbors” for each AP (defined as any AP in the Meraki network with an SNR of 8dB or greater), and calculates what the ideal TX power should be. For each AP, the Cloud attempts to keep the SNR for its strongest direct neighbor at 30dB and always higher than 17dB for every direct neighbor.

An AP will never reduce its transmit power if a client is connected with SNR <10dB. Generally, if a client is connected with SNR <10dB it is looking for a better AP to roam to. If it hasn’t roamed, it can be assumed that a better AP is not available, so reducing the transmit power will only worsen that client’s performance.

To prevent dramatic changes in TX power which could have unintended results, at each twenty minute run an AP can increase its transmit power by 1dB or lower by 1-3dB. When a new Meraki AP is deployed, it starts at the highest transmit power supported by AP within the regulatory domain of which it is a member, unless overridden by an RF Profile or otherwise statically configured. This means that it could take several iterations before an AP reaches its optimal transmit power level.

RF Profiles can be used to define operating parameters for Auto RF

EXCEPTIONS

Auto TX Power will never set the transmit power lower than 5dBm on the 2.4GHz radios or 8dBm on the 5GHz radios to avoid setting a value which is unusably low where there are a high density of APs. There are valid use cases, such as when using directional antennas or in challenging RF environments, where such a low value is warranted. These environments usually require manual tuning anyway, in which case static values can be set in the Dashboard.

Static transmit power assignment

If an AP has an active mesh neighbor, it will not increase or decrease its transmit power. When using mesh, if an AP has no client serving SSIDs enabled it will always use its maximum available transmit power.

Active mesh prevents transmit power changes

If an AP only has one direct neighbor, it’s considered risky to reduce transmit power so it’s not done as often.

Monitoring

The Meraki Dashboard allows for several tools to monitor the current channel plan and any changes that have been made by Auto RF.

The Wireless > Radio Settings page allows you to identify the current channel and transmit power being used by each AP, as well as the target power range that Auto TX Power is using:

Radio Settings

Clicking on any AP takes you to the Status Page, where the RF tab displays a lot of information about client count, channel utilization, and any changes made to the access point by Auto RF. In the below screenshot, we can see that Auto TX power has adjusted the transmit power, and clicking Details will show exactly what was changed:

RF Tab in the Status Page
Transmit power was increased from 8dBm to 9dBm on the 5GHz radio

Summary

As you can see, there’s a lot more to Auto RF than is evident at first glance. Meraki leverages the analytics of the Dashboard and the metadata from millions of access points to create and refine these algorithms so that less time and effort needs to be spent tuning and tweaking configuration during deployment.

Designing Wi-Fi for High Density

In technical interviews, I often ask (and am often asked):
How would you design a Wi-Fi network to support a large room with 1000 devices?

The question is purposely vague to identify how someone thinks through a problem that doesn’t have a single answer, and to observe how thoroughly they respond. Below I’ll take my own stab at a response.

Step 1: Requirements Gathering

Starting off by talking about antenna types or software tuning is the wrong first step, every time. As much information as is provided in the question, it’s never enough. Wi-Fi is a fickle beast, and collecting requirements is certainly the most important step. I would start by asking qualifying questions such as:

  • What types of devices will be associating?
  • What types of applications are we expecting to support, and/or how much bandwidth is needed per client?
  • What is the construction and layout of the room?
  • What are the restrictions on AP location, such as cabling, mounting, or aesthetic requirements, etc.?

Step 2: Hardware

The correct hardware choice is usually determined by the answers to the questions in the previous section. Some environments, such as stadiums, allow for access points to be mounted under seats, where integrated omnidirectional antennas are adequate. In other areas, such as conference centers where chairs and tables may be moved, access points need to be mounting on walls or high ceilings.

Above 25ft, omnidirectional antennas lose a lot of their performance, as most of the attenuation is into space where there are no clients. In these cases, downtilt omni-directional antennas can provide a similar horizontal range, but better propagated toward the floor. In cases where limiting the propagation is desired, semi-directional or directional antennas will limit the horizontal propagation while also improving the vertical reach.

Step 3: Software Tuning

While every environment is different and requires unique exact configurations, a high density environment almost certainly requires a high density of APs, and with that there are a certain set of options that are best practice for almost all such deployments

Data Rates

In a well designed Wi-Fi environment, it’s a best practice to increase the minimum data rate above the default. 12-18Mbps is a common setting, as it prevents 802.11b devices from joining the BSSID and bringing other clients down, and it reduces the airtime required for management frames, leaving more space for meaningful traffic. It can also reduce effective cell sizes by not supporting clients that have too weak an RSSI to transmit at the increased minimum rate. However, caution is needed as setting the minimum data rate too high can lead to high amounts of corruption

Channel Planning

More APs means more chance for co-channel contention, which negatively impacts all clients on that channel. Where possible, enabling the use of 5GHz UNII-2 extended channels allows for more non-overlapping channels, as long as clients support them. On the 2.4GHz spectrum, with only 3 non-overlapping channels available in the US, disabling the 2.4GHz radio on select APs will reduce the number of APs in an area fighting for the same frequency.

In addition to enabling more 5GHz channels, it’s important to reduce the channel width to allow for more channels to be used concurrently. A high density environment configured for 80MHz-wide channels may only have six non-overlapping channels available, while the same environment configured for 20MHz-wide channels will have 25 non-overlapping channels. There is a tradeoff in throughput by reducing channel width, but that’s usually less important than having more channels available.

With a reduced number of 2.4GHz radios compared to 5GHz, band steering can also be effective at encouraging dual-band clients to connect on the 5GHz channels where there is less congestion.

Power Levels

With high client density, access points are generally placed to cover a chosen number of client devices. Because those clients are in a smaller area than lower density deployments, the AP doesn’t need to cover as large a physical area. Lowering the transmit (TX) power of the APs will reduce the cell size, and thus reduce the amount of co-channel contention.

Advanced Options

Some vendor-specific options, such as Cisco’s RX-SOP, can also impact client connectivity and roaming. While RX-SOP is marketed as helping to “reduce cell size ensuring clients are connected using the highest possible data rate”, this is not what it’s designed for, and improperly configuring these options can negatively impact connectivity. RX-SOP is used to lower the possible contention between APs on the same and adjacent channels by reducing the APs “sensitivity” to packets in determining transmit opportunity. When tuned correctly, it can increase the overall available airtime available.

Most vendors offer some type of Radio Resources Management (RRM) capabilities to automatically tune the settings above, to provide features such as: coverage hole detection and correction, dynamic channel assignment, dynamic transmit power control, and client balancing. However, many RRM solutions don’t do a great job of tuning for high-density environments out of the box, and almost always need tweaking and tuning.


As with any Wi-Fi deployment, there is no “one-size fits all” answer. Site surveys, both pre- and post-installation, are vital in ensuring success.

Multicast over Wireless

Multicast has brought a lot of efficiencies to IP networks. But multicast wasn’t designed for wireless, and especially isn’t well suited for high-bandwidth multicast applications like video. I’ll cover the challenges of multicast over wireless and design considerations.

But first, an overview of multicast:

To level set, I’ll briefly cover IP multicast. For the purposes of this article, I’ll focus specifically on Layer 2, If you’re already familiar with multicast over ethernet, feel free to skip this section.

What is multicast?

In short, multicast is a means of sending the same data to multiple recipients at the same time without the source having to generate copies for each recipient. Whereas broadcast traffic is sent to every device whether they want it or not, multicast allows recipients to subscribe to the traffic they want. As a result, efficiency is improved (traffic is only sent once) and overhead is reduced (unintended recipients don’t receive the traffic).

How does it work?

With multicast, the sender doesn’t know who the recipients are, or even how many there are. In order for a given set of traffic to reach its intended recipients, we send traffic to multicast groups. IANA has reserved 224.0.0.0 – 239.255.255.255 for multicast groups, with 239.0.0.0/8 commonly used within private organizations. Traffic is sent with the unicast source IP of the sender, and a destination IP of the chosen multicast group.

On the receiving side, recipients subscribe to a multicast group using Internet Group Management Protocol (IGMP). A station that wishes to join a multicast group sends an IGMP Membership Report / Join message for that given group. Most enterprise switches, WLCs, or APs use IGMP snooping to inspect IGMP packets and populate their multicast table, which matches ports/devices to multicast groups. Then, when a multicast packet is received, the network device can forward that packet to the intended receivers. Network devices that don’t support IGMP snooping will forward the packet the same as it would a broadcast, to every port except the port the packet came in on. Here’s an example of an IGMP Join request:

The problems with multicast over WiFi vs wired

In a switched wired network, all traffic is sent at the same data rate (generally 1Gbps today) and with each port being its own collision domain, collisions are rare. In addition, wired traffic uses a bounded medium, so interference and frame corruption is also rare. Because of this, there is no network impact to sending large amounts of wired traffic as multicast. WiFi does not share either of these characteristics, which makes multicast more complicated. Below are some of the issues with multicast to multicast over WiFi:

  1. Multicast traffic is sent at a mandatory data rate. As mentioned, WiFi clients share a collision domain. Because multicast is a single transmission that must be received by all transmitted receivers, access points are forced to send that frame at the lowest-common-denominator settings, to give the receivers the best chance of hearing the transmission uncorrupted. While this is fine for small broadcast traffic like beacons, it’s unsustainable for high-bandwidth applications.
  2. Low data-rate traffic consumes more air time. Because multicast traffic is sent at a low data rate, it takes longer for each of those transmissions to complete. A 1MB file sent at a data rate of 1 Mbps will take significantly longer than the same file at a data rate of 54Mbps. This means that all other stations must spend more time waiting for their turn to transmit.
  3. Battery-powered clients have reduced battery life. Multicast and broadcast traffic are sent at the DTIM interval, which all stations keep track of. When a multicast frame is sent, all stations must wake up to listen to the frame, and discard it if they don’t need it. This results in battery-powered devices staying awake for a lot longer than needed. If the DTIM interval is too high, the increased latency can impact real-time applications like video. But the lower the DTIM interval, the more often stations need to wake up.
  4. Multicast senders will not resend corrupt frames. Frame corruption and retransmissions are a standard part of any WiFi transaction. Every unicast frame, even if unacknowledged at upper OSI layers such as when using UDP, are acknowledged at Layer 2, and retransmitted by the sending station if necessary. This may not seem like a big deal at first, as unacknowledged traffic on a wired network works fine most of the time. But in an area of interference or poor RSSI level, it’s not unusual to see 10% of wireless frames retransmitted. 10% loss would be considered extremely high on a wired network, and most applications are unable to handle this level of loss.

So how do we fix it?

There’s no silver bullet to “fixing” multicast over wireless, but there are a few ways to design around the shortcomings.

  1. Increasing the minimum data rate. An increase to the minimum data rate means that broadcast and multicast frames must be sent at the higher rate. Unicast traffic is acknowledged at Layer 2, reducing loss experienced by the upper layers. As mentioned earlier, higher data rates reduce the time spent transmitting, and increase throughput for the multicast traffic. It also reduces the amount of time a battery powered device must spend listening to the frames. However, other design and configuration considerations must be made to ensure the wireless network can support this, as changing the minimum data rate can impact roaming, as well as connectivity for low-powered devices.
  2. Multicast-to-Unicast Conversion (M-to-U). Many vendors of wireless APs support multicast-to-unicast conversion, which sends a unicast copy the frame to each intended receiver, using IGMP snooping to determine those stations. This means that the frame can be sent at the receiving station’s best data rate, which should almost always be above the minimum. Several unicast transmissions at 54Mbps would still use less channel time than the same multicast transmission at 1Mbps. In addition, stations which aren’t the intended receivers don’t need to wake up to listen to the frame, reducing their battery consumption.

The pudding

Let’s take a look at the same multicast frame sent with and without Multicast-to-Unicast Conversion. Using iperf2 (since iperf3 doesn’t support multicast), we’ll generate multicast traffic at a rate of 20Mbps from a wired client and send it to a wireless client, using multicast address 239.255.1.2.

Parameters for this test:
Receiver: MacBook Pro (2015 edition). 3 spatial stream 802.11ac airport card.
Access Point: Cisco Meraki MR42E (802.11ac Wave 2, 3×3:3) with omni-directional dipole antennas.

Wired Multicast Source (10.1.1.216):

Mcast Source.png

Wireless Multicast Recipient (M-to-U enabled): 

Mcast MtoU.jpg

Wireless Multicast Recipient (M-to-U disabled):

Mcast No MtoU.jpg

The first thing to notice is the loss rate. With M-to-U enabled, my 20Mbps stream was successfully being transmitted with almost no loss. With M-to-U disabled, throughput was reduce by roughly 95%, with an average of 1Mbps throughput. There are two reasons for this: first, the mandatory data rate used for the multicast transmission was 6Mbps, of which ~40% is attributed to protocol overhead. In addition, with a unicast transmission the AP can buffer frames to a receiver, whereas a multicast transmission is best effort: it has no layer 2 acknowledgement or communication from the receivers. This can be improved with application-level handling, such as the application deciding to transmit at a lower quality, but there are no guarantees that the application is set up to handle that. iperf has no such throttling/accommodation.

To dive in further, let’s take a look at the differences in the frames transmitted:

Frame Capture (M-to-U enabled):

Unicast Multicast.png

Frame Capture (M-to-U disabled):

Demulticast Multicast.png

We can verify that the second frame is using multicast by the MAC address in the Destination Address field, since all multicast MACs begin with 01-00-5E. Notice also that the source address of the unicast frame is set to the MAC address of the access point as the AP had to generate that frame, whereas the multicast frame’s source is that of the sending station since there was no frame modification needed.

Next, we’ll look at the data rate. Multicast is always sent at the basic rate, which was 6 Mbps for this BSSID, and a transmission time of 2072μs. Compared to M-to-U with a data rate of 540Mbps and a transmission time of 46μs. That means that the multicast transmission held the channel 45 times longer than the unicast, and still only sent half as much data.

Also, since multicast must use the lowest-common-denominator parameters, it cannot take advantage of efficiency improvements such as A-MPDU and multiple spatial streams offered by this AP.

So wouldn’t M-to-U be the silver bullet solution?

As is often the case, the answer is “it depends”. In a lab where my 3 spatial stream MacBook Pro can connect at MCS 8, it may appear so. But, if the majority of clients are connected at a low data rate, and the content only consumes a small amount of bandwidth, the overhead caused by retransmitting small frames for a large number of receivers could add delay and consume more aggregate airtime than simply transmitting once at a low data rate.

Deploying Wi-Fi for Location Analytics

Many Wi-Fi vendors on the market now include the capability to leverage access points for location analytics in addition to serving clients. However, deploying location analytics has its own set of requirements, and attempting to simply leverage the same APs for location analytics may have suboptimal results if not planned out correctly. The following sections will detail some of these design considerations to optimize location accuracy and performance.

How do APs determine a device’s location?

Wi-Fi geolocation is done primarily by collecting the RSSI of frames sent from a client seen by multiple access points in an area, then applying trilateration algorithms to that data to approximate the location of a device. This requires careful placement of access points, as well as accurate placement on a floor plan or other location system within the access point controller.

AP Placement Considerations

First and foremost, for trilateration to work properly a client needs to be heard by at least three APs at any given time, and four would be ideal. On the flip side, more than five or six APs could limit the effectiveness by adding unnecessary noise and interference in the environment. A client that is only seen by two APs will be accurate in one dimension (the distance between the APs), but won’t be able to accurately detect the location in the second dimension.

Contrary to designing for coverage, location detection works best when the service area is completely encapsulated by the access points, meaning that APs are placed on the outer edge of the zone where devices will be located.

Because trilateration happens in the latitude and longitude planes and signal strength is used to determine the distance of a client between APs, placing APs in a perfect grid or line actually inhibits the APs from detecting the offset from each other. It’s recommended to place APs in an imperfect shape, which is especially important in long narrow spaces such as corridors or alleys.

Finally, minimize any major line-of-sight obstructions between APs, especially in areas of heavy traffic. Shelves and walls between APs will impact the RSSI received by the AP, which will place the client further away from the AP than it actually is.

Factors Impacting Accuracy

Traffic Frequency

It’s important to note that the accuracy will be limited by how often the access points see frames from a client. For a mobile phone with the screen turned off, such as in someone’s pocket, APs will rely on the periodic probes that a device will send out, which may be as few as a couple of times per minute, meaning our location detection will only be current to the last probe.

Trilateration Frequency

Because it can take a lot of processing power to constantly detect and triangulate a large number of clients, many Wi-Fi vendors will aggregate the received data and perform the trilateration at regular intervals, such as once per minute. It’s important to review the vendor’s documentation and set expectations accordingly.

MAC Randomization

Both iOS and Android support MAC randomization, which masks the device’s true MAC address in many management frames. This can make triangulating a device, or keeping track of subsequent visits, significantly more difficult. iOS has this feature enabled by default, whereas most Android phones default to disabled. There are ways to de-anonymize these devices, but it’s usually more hassle than it’s worth. The easiest way to overcome MAC randomization is to encourage devices to join the Wi-Fi network, as the real MAC address is used for association.

Beyond Wi-Fi

Because most Wi-Fi clients are mobile and probe frequency is sparse, sub-meter accuracy will be difficult-to-impossible to achieve. Other technologies, such as BLE, RFID, and RTLS may be used in place of, or in addition to, relying on Wi-Fi for location analytics. Some vendors, such as Meraki, include BLE scanning radios in their access points. While BLE can be more accurate than Wi-Fi, a larger percentage of devices are either not BLE-enabled, or users are disabling the BLE radio in their client device.

Wifi and Meraki Widgets for Mac and Windows

I recently decided to try to learn how to write python a little bit. I’m still not very good at it, however I did create something recently that I feel should be shared! Meraki local status pages can provide some very useful information for troubleshooting, however having to browse to ap.meraki.com/switch.meraki.com/wired.meraki.com is not always desirable, nor does it update quickly if you are walking around troubleshooting and connecting to different devices. So I figured hey, lets create a widget or skin for some common overlay tools out there (Ubersicht for Mac, and Rainmeter for Windows) and try and populate some useful information. So without further droning on, I want to introduce the tools I created! 

Meraki Skin for Rainmeter (Windows)

This skin requires the use of Rainmeter for Windows. For those not familiar, Rainmeter is a free tool that allows you to do anything from display useful data about your computer, to writing entire user interfaces to perform just about any function.
My rainmeter skin combines a number of bits of data, from the wifi stats out of netsh, IP info from netsh, and a bunch of data points from the first MR, MS, and/or MX that you are connected behind. There is also a hard requirement for Python3 to be installed in your PATH. This is so rainmeter can execute the script associated without having to derive the correct path based on installation. 
For more information please check out the github repo at: Meraki Rainmeter

Meraki Widget for Ubersicht (Mac)

Colorized based on connection quality

This widget requires the use of Ubersicht as a widget overlay tool. For those not familiar, Ubersicht is extremely lightweight and has a number of really cool widgets you can install. 

This widget has a hard requirement for python3 to be installed as well. One of the neat features of Uebersicht is I was able to color code some of the values for RSSI, Noise Floor, and if connected to an MR, the SNR from an AP perspective. These will change colors based on connectivity from green to yellow, to red. (Special thanks to Nathan Wiens @nwiens for helping me with the HTML)

To install please either browse to the Ubersicht widgets repo 
Or to my GitHub Repo

As always, thanks for reading, and if you have any feedback please leave it in the comments section below. Thanks!

Deconstructing the RADIUS CoA process

If you need to brush up on the RADIUS process, please read my previous post:
Following the 802.1X AAA process with Packet Captures

Everyone talks about it, yet I rarely meet folks that really understand what CoA (Change of Authorization) means for RADIUS authentication and client access. I recently spent a few hours troubleshooting RADIUS CoA and figure since it is fresh in my mind maybe I can share and hopefully help others out in the field.

So….

In Summary: RADIUS Change of Authorization (RFC 3576 & RFC 5176) Allows a RADIUS server to send unsolicited messages to the Network Access Server (aka Network Access Device/Authenticator in Cisco terminology e.g. AP/WLC/Switch/Firewall) to change the connected client’s authorized state. This could mean anything from disconnecting the client, to sending different attribute value pairs to the Authenticator to change the device’s VLAN/ACL and more. It is fairly robust in what it can do so I may not go too deep as I want this to be consumable.

What RADIUS CoA is NOT: Magic!

I will be walking through CoA Use Cases, what CoA looks like from a PCAP perspective,  and how to gather data for troubleshooting.


RADIUS CoA Typical Use Cases:

Central captive portal (Open SSID with MAC filtering) – Especially with Cisco ISE, RADIUS CoA is the core feature set required for the captive portal. In the example below, we are redirecting a client to a splash page for either Authentication or Acceptable Use Policy review. As you can see below we have a pretty simple process.

  1. The client connects to the network (wired/wireless)
  2. Client MAC address is sent to RADIUS server as a username and password (Access-Request)
  3. RADIUS server responds with an Access-Accept and a URL redirect. (could also include a VLAN assignment)
  4. The client is redirected to the splash portal
  5. User logs in using the credentials required
  6. RADIUS server then sends a CoA with a request to reauthenticate
  7. Authenticator (AP/Switch/WLC) sends a CoA-ACK
  8. Authenticator sends an Access_Request with existing Session-Id and authentication data.
  9. RADIUS server then responds back with Access-Accept and any extra functions e.g. a Filter-ID for group policy assignment in Meraki Wireless.
  10. YAY INTERNET!

Wireless and Wired CoA-Reauthenticate Process

Screen Shot 2018-01-16 at 2.17.07 PM

The above process is also used for secure device registration and URL redirects for blacklisting etc. but would involve a complete client authentication/reauthentication via EAP instead of MAC authentication. For an example check the shared captures labeled 1-of-2 and 2-of-2. These contain the EAPoL side and the RADIUS side.

Client Posturing – In some cases you may want to perform posturing on the end client. This, more often than not, requires a client on the end machine, whether it is a dissolvable agent with java, or a thick client like Cisco AnyConnect. The whole goal of posturing is making sure the clients that have access to your internal resources are properly secured from threats. A common scenario is a user removing or disabling Anti-Virus. When this event occurs it may be desired to limit that client’s access to the network until AV is reinstalled or enabled. This could be done through an ACL or VLAN change.

One of the difficult situations that arises when changing VLANs is the client may not release their IP address. In 802.11 this is easily handled by sending a disconnect-request instead of reauthentication. In wired authentication scenarios this is not typically recommended as it requires a port bounce and can take some tweaking to make work well, if at all. Instead of a VLAN change it is recommended to perform ACL changes to wired clients. On a catalyst switch this could be a dACL (Downloadable ACL) for instance.

Dynamic Network Restrictions – Closely following the use case above, a client’s access may need to be dynamically changed if they are not adhering to the network policy. Using products such as Cisco’s Stealthwatch in tandem with Cisco ISE, we could monitor a client for data dumping thresholds and change the VLAN/ACL applied to them or shut down the port to minimize the impact. This is just one example of the many possibilities.

Wireless Disconnect-Request Flow

Screen Shot 2018-01-17 at 11.28.18 AM


Now on to the Fun stuff….

To capture CoA packets:

The CoA packets are only seen between the authenticator and the authentication server. Therefore we need to capture between the authenticator and the authentication server as depicted below.

Screen Shot 2018-01-16 at 11.15.39 AM

In most environments this consists of using a SPAN/RSPAN port to capture traffic. Some vendors do provide the ability to perform tcpdumps/pcaps which can be a little easier, especially if you are offsite. For capture applications I tend to lean towards using wireshark as it is free and powerful. To download please go to Wireshark.org.

CoA Messages are sent on two different udp ports depending on the platform. Cisco standardizes on UDP port 1700, while the actual RFC calls out using UDP port 3799. These messages are all included in the “radius” wireshark filter.

Just in case you don’t have a test network please feel free to use the pcaps in this share:

CoA PCAP Examples


RADIUS CoA Packet Types

There are two different RADIUS CoA packets that are sent from the RADIUS Server (Authentication Server):

  • Disconnect-Request – Requests to terminate the session of the client.
  • CoA-Request – Requests to do a number of things from reauthenticate to port-bounce, shutdown, and more.

And there are four that are sent from the NAS/NAD/Authenticator:

  • Disconnect-ACK – Acknowledgment of successful disconnect
  • Disconnect-NAK – Failed session disconnect
  • CoA-ACK – Acknowledgment of successful CoA action
  • CoA-NAK – Failed CoA action

RADIUS Server Sourced Packets

In this section we will review the two CoA messages that are sent from the RADIUS server and the useful material in the packet.

Disconnect-Request Message

Wireshark Filter: radius.code == 40

This packet is sent from the RADIUS server and is used to simply disconnect the client from the current session. This also typically involves an immediate re-authentication by the client. Disconnect-Requests can/should be used in 802.11 situations where a VLAN change needs to occur. If we simply used a CoA-Request (as we’ll see later), the client may be changed to a new VLAN while keeping the IP address it obtained from the former VLAN, clearly causing problems.

Screen Shot 2018-01-12 at 10.29.29 AM

A few useful attributes in this message are:

Account-Terminate-Cause 

The account terminate cause will let you know the reason for the request. This can vary but typically is classified as an Admin-Reset from a Cisco ISE Perspective.

Screen Shot 2018-01-12 at 10.31.56 AM

Audit-Session-ID & Calling-Station-ID

These fields can be used to filter information from your RADIUS server regarding the client MAC address (Calling-Station-Id) and session ID. So when you need to hunt down a particular failure in a log, you can correlate the logs via these two attributes.

Screen Shot 2018-01-12 at 10.33.09 AM

NAS Response Link

Wireshark helpfully gives a link to the frame that is the NAS response to the RADIUS server. This Disconnect-Ack packet will be reviewed in the authenticator sourced packets section later in this post.

Screen Shot 2018-01-12 at 11.02.31 AM

 

CoA Request

Wireshark Filter: radius.code == 43

Unlike the Disconnect-Request above, a CoA-Request can contain a number of actions. This can include anything from reauthentication to bouncing or shutting down a port. A lot of these can be vendor specific responses. So in this instance I am going to use the Cisco ISE CoA Request info. One thing to note is useful attributes are also still the Audit-Session-Id, Calling-Station-ID, and the Response Link as well as the attributes below.

Screen Shot 2018-01-12 at 4.49.13 PM

Useful Info:

Cisco-AVPair: subscriber:command = XXXXXX

This is where we are able to request that the authenticator perform a function. With Cisco ISE it is rolled into a Cisco-AVPair: subscriber:command.

For instance:

  • subscriber:command=reauthenticate

This request will cause a reauthentication either for the client via EAP, or the authenticator may send the MAC address and session ID again in the event that it is a MAC authenticated session.

Screen Shot 2018-01-17 at 2.29.52 PM

 

  • subscriber:command=bounce-host-port

This is a wired only CoA Request. A request to bounce the host port will end up with a link-down link-up event on the switchport. This can be useful for trying to move a client to a new VLAN if possible. This is not something I recommend defaulting to for guest portals however as it can take some tweaking to the core CoA configurations. In ISE this would involve rewriting the Network Device Profile and CoA ReAuth requests to include a port-bounce, which I do not believe is a recommended practice.

Screen Shot 2018-01-17 at 2.37.52 PM

 

  • subscriber:command=disable-host-port

This is another wired only CoA request. This will disable the switchport if the switch supports it. I have seen cases where the end switch may not support a port shutdown and will bounce the port instead. This is not a recommended CoA request for most situations as it takes manual intervention to resolve. Instead a VLAN or ACL change is far more effective, even if the VLAN doesn’t exist (blackhole).

Screen Shot 2018-01-17 at 3.41.15 PM


Authenticator Sourced Packets

Now we will review the packets that are sent in response to the CoA or disconnect request from the server. These are fairly simple and usually only include an ACK for pass or NAK for failure.

Disconnect-ACK

Wireshark Filter: radius.code == 41

This is an acknowledgment of a successful disconnect-request instruction from the authenticator to the RADIUS server. This packet can contain attributes such as the session that was disconnected, calling-station-id, or just simply the Message-Authenticator.

Screen Shot 2018-01-17 at 4.18.15 PM

Disconnect-NAK

Wireshark Filter: radius.code == 42

This is an acknowledgement of a failed disconnect-request. This might happen if the client is already disconnected, or if the session has ended prior to the disconnect request. In the example screenshot we can see a bit of useful information in the error cause attribute.

Screen Shot 2018-01-17 at 4.15.00 PM

CoA-ACK

Wireshark Filter: radius.code == 44

As with the Disconnect-ACK, the CoA-ACK is just an acknowledgement of the success of the CoA requested action. This packet can contain attributes such as the session that was disconnected, calling-station-id, or just simply the Message-Authenticator.

Screen Shot 2018-01-17 at 4.17.05 PM

CoA-NAK

Wireshark Filter: radius.code == 45

Once again just like the Disconnect-NAK, the CoA-NAK is an acknowledgement of a failed CoA action. This could be due to lack of support or the session has ended prior to the CoA-Request. Just like the Disconnect-NAK we get a nice Error-Cause for further troubleshooting.

Screen Shot 2018-01-17 at 4.15.38 PM

 

In closing

One thing to remember is CoA can be used to create some very complex if-this-then-that type scenarios. In the end however it is not a complex feature and definitely not magic! I hope this post was informative for you. If you find anything incorrect please let me know. Thanks and good luck!

Using editcap to prune a packet capture

This is just a quick one. I recently needed to filter out a couple packet captures of unneeded frames/packets for some training material. I had unfortunately captured a crap ton of data though and really didn’t want to post the whole 20M pcap file. I ran across the wireshark function called “editcap”. There is a lot that you can do with editcap as you can see in the following link:

Editing trace files with Editcap

However all I needed to do was remove the frames before and after the area of interest. To do this on a macbook:

editcap -r {source file} {destination file} {packets/frames you want to keep} 

e.g.

editcap -r wirelesseapol.pcap wirelesseapolfiltered.pcap 790-900

Screen Shot 2017-07-08 at 12.26.22 PM

This will take the wirelesseapol.pcap, remove frames 1-789 and 901 – end, leaving me 790-900 in a new packet captures called wirelesseapolfiltered.pcap.

You can also do this in wireshark using the export specified packets feature. However CLI can be fun.

Following the 802.1X AAA process with Packet Captures

EDIT: After chatting with David Westcott (@davidwestcott) I have made a few additions to this post. He has graciously asked that I add a little more details including the packet captures so everyone can follow along. This was a great idea, so please enjoy!

802.1X is typically the first step in one of the more advanced security implementations you will have to dip your toes into when moving your network to a secure state. That being said a lot of times we as engineers get stuck in a state of understanding enough to be dangerous and not enough to be highly successful and more importantly, capable. I personally learned a lot about dot1x via trial and error through implementations in the past as well as lab time at home and during my CCIE studies.

So the question is, how do we become masters of a protocol that is literally quite capable of being the success or demise of a network’s security and operation? 

…By taking advantage of the resources that are out there as well as practical labs. One thing to keep in mind is that while every vendor of a dot1x solution (Cisco, Aruba, FreeRADIUS, Microsoft) has a certain way of going about authentication, they all fall back to the same protocol: 802.1X/RADIUS. As long as you can understand that protocol to a fairly deep level you can troubleshoot any RADIUS environment.

A few great resources I have used and/or reviewed are:

Lab Minutes

Cisco ISE Communities Aggregate Post

Meraki Common RADIUS Error Codes with Microsoft NPS

Packet6 FreeRADIUS Install Guide

Katherine McNamara’s Network Node Blog

Now that being said I also recommend learning about not only how to debug the RADIUS servers via logs and troubleshooting tools, but being able to understand what it looks like from a packet capture perspective both wired and wireless and all the information you can glean from a pcap.

Overview of RADIUS/Dot1x Process

EAP &amp; Radius Process

The above diagram is utilizing EAP-PEAP, this may look a little different depending on the Auth method.

Getting Started with Captures:

There are technically 2 points that you can grab packet captures, both of which will provide relevant information for troubleshooting. If you capture at Point 1 in the diagram below, the frame exchanges will be using EAPoL as a protocol. If you capture at Point 2 we will be seeing the traffic as RADIUS messages. That being said lets dive in!

dot1x packet Captures

Example Environment Details:

In this write up I used a Nexus 6P, Cisco Meraki MR33, and a Cisco ISE server. Included below are links to download the .pcap files.

Wireless EAPoL and Wired RADIUS packet capture files

Please download the files in the link above if you would like to follow along and review.

Wireless EAPoL capture clients:

STA (Supplicant) = 00:9a:cd:b7:c9:f0

AP (Authenticator) = e2:55:2d:f2:d1:54

Wired Radius Capture Clients:

AP (Authenticator) = 10.10.0.102

RADIUS Server (Authentication Server) = 10.10.1.37

Point 1: Supplicant to Authenticator (EAPoL)

You can capture between the client device (supplicant) and the access device (authenticator). At this point in the network you are looking at EAPoL messages. This is the communication method utilized that provides the Authenticator and the Client a line of communication prior to network access.

This is what the capture will look like:

Screen Shot 2017-07-08 at 11.22.32 AM

To perform this capture from a wireless perspective you will need either an access point capable of dumping/monitoring traffic, or a client that is capable of turning its adapter into monitor mode. Most of the time for wireless captures I use a Macbook Pro and Adrian’s Airtool App. This way is fairly flexible and quick.

If you are capturing for a wired authentication, you will want to perform a port SPAN or mirror to receive the needed frames.

Once you have performed the frame capture, you will want to open it in Wireshark and apply a few filter/s to get only the most useful info.

Example:

Screen Shot 2017-07-08 at 11.24.08 AM

In this case we combined 3 filters by the use of the &&. First we want to make sure we only see the eapol messages. We then want to filter on the Access point or switch as well as the client. This can be done in the case of wireless with wlan.addr == {mac address}. Quite often you will see wlan.ta/ra/dst etc. however in this case we want to simply isolate the mac addresses, not which position they are in the frame.

For wired we would still use the filter “eapol” and instead of wlan.addr we would use eth.addr.

The EAPoL portion of communication will vary depending on the authentication type. In my examples, we are using EAP-PEAP w/EAP-MsCHAPv2. This is a fairly standard form of authentication, and from a .pcap would resemble closely to EAP-TTLS w/PAP.

The useful portions that can usually be derived from a pcap are:

EAP-Identity Response:

Screen Shot 2017-07-08 at 11.18.59 AM

In this frame (frame #29 in the .pcap) you can see the Client’s (Supplicant) Identity being used of “employeealex“. This can be extremely useful when trying to determine if the supplicant is going to authenticate as the user or the machine account as well as what the user could be typing into the username prompt.

EAP Auth Method Negotiation and Credential Exchange:

Screen Shot 2017-07-08 at 11.52.33 AM.png

The first message in this clip is the server’s proposal of EAP-TLS (frame #32 in the .pcap), then the client’s response with, “Hey what about EAP-PEAP?” In some situations, depending on the RADIUS server configuration, the client may try to propose a method that is not permitted or supported by the server. This is where you would see that negotiation fail, and ultimately an Access-Reject/EAP-Failure.

However in this capture you can see the client and server negotiate EAP-PEAP. Once that is completed, the Server will present the client with its certificate. If the client does not trust the certificate from the server, and the user does not accept the certificate, the exchange will fail after the first frame or two of the handshake.

In this situation however the client trusts the server certificate and the two endpoints secure the medium with a TLS tunnel. Once secured you should notice that the protocol becomes purely TLS and since the traffic is encrypted, we can only see that the frames are “Application Data”. This is the point at which the client and server are exchanging inner authentication data such as EAP-MsCHAPv2 or EAP-TLS.

EAP Success(wired and wireless) and 4 Way Handshake (when the client is wireless):

Screen Shot 2017-07-08 at 11.20.25 AM

Once the client has been successfully authenticated and authorized, there is an EAP Success message sent back (frame #78) to signify the end of the process. If this is a wired client the process is over and the client is able to start transmitting and receiving data frames. If this is a wireless client, the station will utilize a few EAP attributes and the AP will utilize two MPPE key attributes in the Radius Access-Accept response to perform the 4 way handshake and create the encryption keys for secure communication (Frames 80, 82, 85, and 87).

Point 2: Authenticator to Authentication Server (RADIUS)

In the previous section we were only able to see the authentication process and not the authorization or accounting as that communication is only between the authenticator and authentication server. Luckily you are also able to capture traffic from the Access/Authenticator Device (i.e. switch/WLC/AP) directly to the Authentication Server (RADIUS server). I HIGHLY recommend capturing this traffic at the same time as your client EAPoL capture so you can reference/correlate the same data. At this point in the process you will be looking at RADIUS messages. These messages are typically not encrypted unless using AES-KeyWrap or RADSEC which is not common today unless in a highly secured environment or public environment where traffic is inherently insecure.

This is what the capture will look like:

Screen Shot 2017-07-08 at 11.32.10 AM

To perform this capture, we would want to span the traffic from the Authenticator or from the RADIUS server. The choice is yours. Once we have spanned the port we would perform a capture as usual and open in wireshark.

There are a few very useful filters in this capture as well that we will want to use to weed out the undesirable packets.

Example:

Screen Shot 2017-07-08 at 11.33.12 AM

As you can see we are not getting too complicated with our filters. We just want to combine filters once again with the && delimiter. Then we also want to filter on radius as the protocol, and the ip addresses of the Authenticator and Authentication server with ip.addr == {ip address}.

As you can see from the capture example above, the communication between the access device and RADIUS server are comprised of a ton of Access-Requests with Access-Challenges as the response from the RADIUS server. One thing to note is that all of the communication between the client and server during authentication is encapsulated in a RADIUS request. This can be extremely helpful in following how the Authenticator translates the information to the RADIUS server, as well as all the extra information that is appended. Now on that note….

The useful portions of the RADIUS exchanges are:

The initial Access-Request packet:

Screen Shot 2017-07-08 at 11.36.56 AM

In this packet (in the capture #31) we can see a ton of useful information in the form of Attribute Value Pairs. Here we can see for instance the username that is being sent as well as the connectivity information. When troubleshooting RADIUS environments it is a good idea to verify the NAS-Port-TypeService-Type, and Called-Station-ID. If you have watched any of my videos that I have created on how to configure Cisco ISE you will know we use these three attributes quite frequently for granular authorization policies.

One other piece of information that is extremely useful for troubleshooting is the NAS-IP-Address. This information is typically what is used to differentiate RADIUS clients in the RADIUS server. In my video on Network Devices and Groups you can see where in Cisco ISE we configure the Authenticator devices via their IP address. This information does need to match the NAS-IP-Address. If it does not the radius server will most likely not respond as the RADIUS client cannot be found.

The Final Access-Accept Packet:

Screen Shot 2017-07-08 at 11.38.05 AM

This packet (#55) could also be the Access-Reject packet if the authentication failed. In this instance however we passed authentication. As part of the authorization policies I even returned a Filter-ID that specifies a Group Policy in the Meraki Dashboard only giving me BYOD-Access. There can be many AVPs included depending on the platform including Airespace ACLs, AVC Rulesets, QoS parameters, VLAN assignment, User Roles, and much much more.

For those that are not familiar, in a Cisco Meraki wireless environment we can use Group Policies to specify different access through L3-L7 firewall rules as well as perform traffic shaping, policing, and QoS remarking on specific application traffic. 

RADIUS Accounting Messages:

Screen Shot 2017-07-08 at 11.39.20 AM

These messages (#56 & 57) provide the RADIUS server with connectivity information for authenticated and authorized clients. Notice the UDP ports change from 1812 to 1813 when the communication changes to Accounting. These are not mandatory for 802.1X however are useful in providing current connectivity stats to the RADIUS server. These will also show when a client disconnects with an Accounting-Stop message. As you can see in the above screenshot the message also include Event-Timestamps, and Session-IDs on top of the other connectivity information. You can also find the client IP address information under the Framed-IP-Address field.

In closing…

I hope this breakdown of the process was informative. It took me a while to honestly get to the point of being able to grab captures and decipher what they mean. If all of this is above your head just remember practice makes perfect. I recommend if you would like to get deeper into RADIUS to install something like FreeRADIUS and play around. Hands on experience beats any other form of study.

Thank you for reading!

Single SSID BYOD Onboarding

**This video builds on top of the previous video of BYOD with Device Registration and Native Supplicant Provisioning. So please be sure to watch it for configuring the certificate templates and some of the SSID configuration. **

In this video we configure ISE and wireless with a single SSID for WPA2-Enterprise to perform device registration and EAP-TLS provisioning.