Introducing Multi-User Organizations: Share An Account Without Sharing A Login

April 29, 2015, 8:30 am

≫ Next: An introduction to JavaScript-based DDoS

An enterprise needs security and controls around access.

Your web developer needs to update your website’s logo and make sure it’s live immediately, but doesn’t need access to your SSL keys. Your sysadmin manages your DNS, but doesn’t need to see your visitor traffic. Your marketing team needs to see traffic, but shouldn’t have access to your WAF.

Today CloudFlare is introducing new Multi-User functionality so that many members of a team can work together to manage one CloudFlare account, each with different levels of access.

The Super Admin, and Role-Based Permissions

CloudFlare Multi-User accounts are hierarchical, with the root privileges given to the account’s Super Administrator. The Super Administrator can add or delete users in the organization, change the permissions given to each user, and see and edit all CloudFlare settings. If there is more than one Super Administrator, the Super Administrators can remove each other, which is good practice when an employee leaves the company or switches jobs.

When a user joins a multi-user organization on CloudFlare, they can only see and access the settings that a Super Admin has delegated to them. For example, a user added to the organization as a DNS Administrator would only be able to access the DNS app:

A user added to the organization as an Analytics Administrator would only be able to access analytics, but not see or change other settings:

Roles can be mixed and matched so that a team can customize the permissions for their needs.

Here is what a user sees that has both DNS and Analytics roles:

Multi-User and Two-Factor Authentication

Passwords are known to be weak, (try yours here: how quickly does it break?) but stronger methods of authentication such as two-factor authentication (2FA) typically don’t work when multiple people need to share access to a single account.

Over the past year, the world has witnessed many prominent corporate and government accounts compromised as a result of not using two-factor authentication. We’ve supported two-factor authentication for years, and we encourage everyone to turn it on. With Multi-User, any size organization can use 2FA for account security.

Beyond 2FA, Multi-User also provides each user with their own API key so that independent keys can easily be revoked and reissued.

Multi-User Rollout

Multi-User is an Enterprise-only feature and is already in use by large multinational organizations and governments. Starting today, Multi-User is available for all CloudFlare Enterprise customers. If you are an Enterprise customer who would like to have Multi-User enabled, contact your account manager. Not yet a customer? Contact our sales team.

This new functionality is only possible on the new CloudFlare dashboard that we are in the process of rolling out to all users this week

↧

An introduction to JavaScript-based DDoS

April 30, 2015, 6:02 am

≫ Next: Redesigning CloudFlare

≪ Previous: Introducing Multi-User Organizations: Share An Account Without Sharing A Login

An introduction to JavaScript-based DDoS CloudFlare protects millions of websites from online threats. One of the oldest and most pervasive attacks launched against websites is the Distributed Denial of Service (DDoS) attack. In a typical DDoS attack, an attacker causes a large number of computers to send data to a server, overwhelming its capacity and preventing legitimate users from accessing it.

In recent years, DDoS techniques have become more diversified: attackers are tricking unsuspecting computers into participating in attacks in new and interesting ways. Last year, we saw what was likely the largest attack in history (>400Gbps) performed using NTP reflection. In this attack, the unsuspecting participants were misconfigured NTP servers worldwide. This year, we’re seeing a disturbing new trend: attackers are using malicious JavaScript to trick unsuspecting web users into participating in DDoS attacks.

The total damage that can be caused by a NTP or DNS reflection attack is limited by the number of vulnerable servers. Over time, this number decreases as networks patch their servers, and the maximum size of the attack is capped at the outbound capacity of all the vulnerable servers. For JavaScript-based DDoS, any computer with a browser can be enrolled in the attack, making the potential attack volume nearly unlimited.

In this blog post, we’ll go over how attackers have been using malicious sites, server hijacking, and man-in-the-middle attacks to launch DDoS attacks. We’ll also describe how to protect your site from being used in these attacks by using HTTPS and an upcoming web technology called Subresource Integrity (SRI).

How JavaScript DDoS Works

Most of the interactivity in modern websites comes from JavaScript. Sites include interactive elements by adding JavaScript directly into HTML, or by loading JavaScript from a remote location via the <script src=""> HTML element. Browsers fetch the code pointed to by src and run it in the context of the website.

The fundamental concept that fueled the Web 2.0 boom of the mid-2000s was the ability for sites to load content asynchronously from JavaScript. Web pages became more interactive once new content could be loaded without having to follow links or load new pages. While the ability to make HTTP(S) requests from JavaScript can be used to make websites more fun to use, it can also be used to turn the browser into a weapon.

For example, the following (slightly modified) script was found to be sending floods of requests to a victim website:

function imgflood() {
  var TARGET = 'victim-website.com'
  var URI = '/index.php?'
  var pic = new Image()
  var rand = Math.floor(Math.random() * 1000)
  pic.src = 'http://'+TARGET+URI+rand+'=val'
}
setInterval(imgflood, 10)

This script creates an image tag on the page 100 times per second. That image points to “victim-website.com” with randomized query parameters. Every visitor to a site that contains this script becomes an unwitting participant in a DDoS attack against “victim-website.com”. The messages sent by the browser are valid HTTP requests, making this a Layer 7 attack. Such attacks can be more dangerous than network-based attacks like NTP and DNS reflection. Rather than just “clogging up the pipes” with a lot of data, Layer 7 attacks cause the web server and backend to do work, overloading the website’s resources and causing it to be unresponsive.

An introduction to JavaScript-based DDoS

If an attacker sets up a site with this JavaScript embedded in the page, site visitors become DDoS participants. The higher-traffic the site, the bigger the DDoS. Since purpose-built attack sites typically don’t have many visitors, the attack volume is typically low. Performing a truly massive DDoS attack with this technique requires some more creativity.

Shared JavaScript Compromise

Many websites are built using a common set of JavaScript libraries. In order to save bandwidth and improve performance, many sites end up using JavaScript libraries hosted by a third party. The most popular JavaScript library on the Web is jQuery, with around 30% of all websites using some version of it as of 2014. Other popular scripts included on many websites include the Facebook SDK, and Google Analytics.

If a website has a script tag that points to a third-party hosted JavaScript file, all visitors to that site will download the JavaScript and execute it. If an attacker is able to compromise a server that is hosting a popular JavaScript file and add DDoS code to it, the visitors of all the sites that reference that script become part of the DDoS.

An introduction to JavaScript-based DDoS

In September 2014, RiskIQ reported that jQuery.com’s website was compromised, which hosts a very popular JavaScript library that could have easily been replaced with a malicious one. The threat of attackers injecting malicious JavaScript into millions of sites is no longer theoretical.

An Aside: Introducing Subresource Integrity

The problem of third party assets being compromised is an old one. There are no mechanisms in HTTP to allow a website to block a script from running if it has been tampered with. To solve this problem, the W3C has proposed a new feature called Subresource Integrity (SRI). This feature allows a website to tell the browser to only run a script if it matches what the site expects.

Take the following script tag:
<script src="https://code.jquery.com/jquery-1.10.2.min.js">

The browser will download the .js file and run it no matter what the content of the file is. If someone were to compromise the hosting site’s servers and replace the file with a malicious script, the browser would run it without question.

With SRI, you can tell the browser to not run the script if it does not match what you expect. This is done using a cryptographic hash. A cryptographic hash is a way to uniquely identify a piece of data. It’s like a fingerprint: no two files have the same hash. With SRI, you can include a hash of the authentic version of the script using an attribute called “integrity”. After downloading the script, the browser will compute its hash and compare it with the expected hash from the script tag. If they don’t match, then the script has been tampered with and the browser will not use it.

<script src="https://code.jquery.com/jquery-1.10.2.min.js"
        integrity="sha256-C6CB9UYIS9UJeqinPHWTHVqh/E1uhG5Twh+Y5qFQmYg="
        crossorigin="anonymous">

Adding this tag protects your site visitors from a compromise in the third party JavaScript host. Computing the tag is a simple process that only has to be done once. There’s even a service that can compute the hash for you.

UPDATE: The crossorigin attribute and Cross-Origin Resource Sharing (CORS) headers are required for scripts with SRI in order to ensure the proper enforcement of the browser same-origin policy and prevent Cross-Site Scripting (XSS) attacks.

This feature is not currently supported by many browsers, but it is under development for Chrome and Firefox.

Server compromises are often detected and fixed quickly, so attackers have turned to other ways to insert malicious JavaScript into websites. The newest way they are doing this is a technique we’ve discussed previously: a man-in-the-middle attack.

Man-in-the-middle

Websites get to your browser from a web server by traversing the networks of the Internet and hopping from machine to machine. Any one of the machines that sits between your browser and the server is able to modify the data in any manner, including changing the content of HTML or JavaScript. If the computer sitting in the middle of a communication does something malicious, like adding bad JavaScript to a webpage, this is called a man-in-the-middle attack.

Modifying websites in transit is common monetization technique for ISPs and WiFi providers. This is how some hotel networks, cellular networks and others inject ads and tracking cookies into websites. Self-respecting businesses typically don’t inject attack code into websites, but as part of the Internet they have the capability of doing so.

If an attacker gains a privileged network position similar to an ISP —like a network interconnect or peering exchange— they can also inject JavaScript into websites that pass through their network. If the injected JavaScript contains a DDoS script, all the visitors to the website become DDoS participants. This can happen to any website or web asset that passes through the rogue network.

To make things worse, if the path to popular JavaScript file happens to go through the attacker’s network, the number of browsers participating in the attack can increase dramatically.

An introduction to JavaScript-based DDoS

The technique that fully stops this sort of code injection is encryption. With HTTPS, all communication between a browser and a web server is encrypted and authenticated, preventing intermediate parties from modifying it. Making your site HTTPS-only prevents your site from being modified in transit. This not only prevents ISPs and WiFi providers from injecting ads and tracking cookies, but it prevents your site from being used in a JavaScript DDoS.

An introduction to JavaScript-based DDoS

Conclusion

JavaScript-based DDoS attacks are a growing problem on the Internet. CloudFlare sees, and routinely blocks, these attacks for the millions of websites that use our service, and we learn from every attack we see. JavaScript-based DDoS can be launched at any time, so prevent your site from being part of the problem by going HTTPS-only. CloudFlare provides HTTPS and the ability to go HTTPS-only to all customers including those on the free plan.

↧

Redesigning CloudFlare

May 1, 2015, 4:58 am

≫ Next: Go crypto: bridging the performance gap

≪ Previous: An introduction to JavaScript-based DDoS

Redesigning CloudFlare

CloudFlare’s original interface grew at an amazing speed. Visually, it hadn't changed much since CloudFlare’s launch in 2010. After several years of new features, settings, and ancillary UIs buried beneath clicks, it became clear that the user experience was lacking and would only get worse as we continued to add features. The question became: How could we make a UI that was versatile, scalable, and consistent?

If you haven’t yet, make sure you read Matthew’s post about the philosophy behind our new interface. This post will go into the details and the thought process behind designing our new dashboard.

Why a redesign?

We needed versatility for a growing variety of users and devices

As CloudFlare has grown, we now have a large variety of customers spanning four very different plan levels. We needed an interface that would work well for both the casual owner of a single blog, an agency managing many client sites, and enterprise customers that demand ultimate control. Also, the rise of responsive design was something we wanted to take seriously — the dashboard should be versatile enough to work just as well on every device.

We needed a platform that we could build upon

Redesigning CloudFlare

We couldn’t shove new features into a ‘gear menu’ forever. We needed a navigation metaphor that would be scalable, as well as a consistent visual framework to make the development of new features easier, faster, and more easily understood by the user. We had big plans for new features when we started out; I’ll go into some of them below.

We needed consistency with a more mature brand

The CloudFlare brand has grown and evolved in the past several years. During the redesign process, we updated our visual identity system with new fonts, color palettes, and imagery for our business cards, printed items, booth displays, and videos. Our new dashboard is another step toward a newer, cleaner look.

The process

A new concept

Starting with a clean slate, we had the ability to completely rethink the information architecture and visual hierarchy of the interface. Keeping years of user feedback in mind, it was time to sketch out some ideas. Below is a small sampling of many sketches that were made early in the process:

Redesigning CloudFlare

Apps

Redesigning CloudFlare

With a layout framework started, the concept of categorizing our settings and features started to take shape. These categories evolved into ‘apps' as we noticed the visual similarity of the navigation row to today’s smartphones. It also fit with the concept of adding third party apps (something we’re going to be implementing soon) to apps provided by CloudFlare. Using apps as the beginning of our framework, we had a hierarchy: a user would select a website, have a clear list of CloudFlare and third party apps, and whichever app was selected, specific settings and features would be displayed.

Redesigning CloudFlare

Modules

As our navigation solidified around the idea of apps, we got down to the smaller pieces. Within apps we put ‘modules’ that encapsulate each setting. As the name implies, these modules are modular — they’re made of several pieces that lock together creating a consistent look across the UI, even when used in a variety of ways. We decided that each module would contain a name and description, as well as a collapsible panel at the bottom. Some modules would feature a “control area” on the right side. Other more complicated modules could contain data tables, or even be a combination of both data tables and controls. Take a look at the dashboard the next time you log in and you’ll see them in their many variations.

Redesigning CloudFlare

One of the biggest improvements we wanted to integrate was inline help content. No more going off to other tabs to learn more about a setting. Help content is contained in a collapsible panel that can be expanded for further reading. We also plan on using this collapsible panel for other module-specific content, like API information, in the future.

Responsive design

Redesigning CloudFlare

We started this design process with a goal of making every CloudFlare feature adjustable from your phone. The modules mentioned above were designed to resize and reflow to your device size, allowing you to easily control CloudFlare settings for your websites. Checking the performance of your web site should be possible from a desktop machine and while running between meetings. And while it might be a bit crazy to set up an SSL certificate via your phone, we definitely don’t want to stop you!

The new look

We established a new style guide for the dashboard with new fonts, colors, and interface elements like buttons, toggles, and select menus. When trying to make everything fully responsive, high-density display ready, and overall more performant, we decided to throw out all images with lots of drop shadows, gradients, and noise that made up our old buttons and controls and focused on making everything out of CSS. The results are a crisper, cleaner look, and a faster load.

Redesigning CloudFlare

We also created a custom icon set to use for our app icons and buttons, as well as in some new illustrations that we use to introduce a little fun here and there.

Redesigning CloudFlare

Just the beginning

This new dashboard gives us a platform to deliver more features to users while continually evolving the UI itself with greater ease. What you see today is only a fraction of what we hope to do going forward, and this isn’t the only thing getting a refresh. As Matthew hinted, our old marketing site will be getting a facelift, and a new support site is in the works. Stay tuned!

If you think our design process is something you'd like to take part in, we’re hiring.

↧

Go crypto: bridging the performance gap

May 7, 2015, 4:06 am

≫ Next: Google PageSpeed Service customers: migrate to CloudFlare for acceleration

≪ Previous: Redesigning CloudFlare

It is no secret that we at CloudFlare love Go. We use it, and we use it a LOT. There are many things to love about Go, but what I personally find appealing is the ability to write assembly code!

CC BY 2.0 image by Jon Curnow

That is probably not the first thing that pops to your mind when you think of Go, but yes, it does allow you to write code "close to the metal" if you need the performance!

Another thing we do a lot in CloudFlare is... cryptography. To keep your data safe we encrypt everything. And everything in CloudFlare is a LOT.

Unfortunately the built-in cryptography libraries in Go do not perform nearly as well as state-of-the-art implementations such as OpenSSL. That is not acceptable at CloudFlare's scale, therefore we created assembly implementations of Elliptic Curves and AES-GCM for Go on the amd64 architecture, supporting the AES and CLMUL NI to bring performance up to par with the OpenSSL implementation we use for Universal SSL.

We have been using those improved implementations for a while, and attempting to make them part of the official Go build for the good of the community. For now you can use our special fork of Go to enjoy the improved performance.

Both implementations are constant-time and side-channel protected. In addition the fork includes small improvements to Go's RSA implementation.

The performance benefits of this fork over the standard Go 1.4.2 library on the tested Haswell CPU are:

                         CloudFlare          Go 1.4.2        Speedup
AES-128-GCM           2,138.4 MB/sec          91.4 MB/sec     23.4X

P256 operations:
Base Multiply           26,249 ns/op        737,363 ns/op     28.1X
Multiply/ECDH comp     110,003 ns/op      1,995,365 ns/op     18.1X
Generate Key/ECDH gen   31,824 ns/op        753,174 ns/op     23.7X
ECDSA Sign              48,741 ns/op      1,015,006 ns/op     20.8X
ECDSA Verify           146,991 ns/op      3,086,282 ns/op     21.0X

RSA2048:
Sign                 3,733,747 ns/op      7,979,705 ns/op      2.1X
Sign 3-prime         1,973,009 ns/op      5,035,561 ns/op      2.6X

AES-GCM in a brief

So what is AES-GCM and why do we care? Well, it is an AEAD - Authenticated Encryption with Associated Data. Specifically AEAD is a special combination of a cipher and a MAC (Message Authentication Code) algorithm into a single robust algorithm, using a single key. This is different from the other method of performing authenticated encryption "encrypt-then-MAC" (or as TLS does it with CBC-SHAx, "MAC-then-encrypt"), where you can use any combination of cipher and MAC.

Using a dedicated AEAD reduces the dangers of bad combinations of ciphers and MACs, and other mistakes, such as using related keys for encryption and authentication.

Given the many vulnerabilities related to the use of AES-CBC with HMAC, and the weakness of RC4, AES-GCM is the de-facto secure standard on the web right now, as the only IETF-approved AEAD to use with TLS at the moment.

Another AEAD you may have heard of is ChaCha20-Poly1305, which CloudFlare also supports, but it is not a standard just yet.

That is why we use AES-GCM as the preferred cipher for customer HTTPS only prioritizing ChaCha20-Poly1305 for mobile browsers that support it. You can see it in our configuration. As such today more than 60% of our client facing traffic is encrypted with AES-GCM, and about 10% is encrypted with ChaCha20-Poly1305. This percentage grows every day, as browser support improves. We also use AES-GCM to encrypt all the traffic between our data centers.

CC BY 2.0 image by 3:19

AES-GCM as an AEAD

As I mentioned AEAD is a special combination of a cipher and a MAC. In the case of AES-GCM the cipher is the AES block cipher in Counter Mode (AES-CTR). For the MAC it uses a universal hash called GHASH, encrypted with AES-CTR.

The inputs to the AES-GCM AEAD encryption are as follows:

The secret key (K), that may be 128, 192 or 256 bit long. In TLS, the key is usually valid for the entire connection.
A special unique value called IV (initialization value) - in TLS it is 96 bits. The IV is not secret, but the same IV may not be used for more than one message with the same key under any circumstance! To achieve that, usually part of the IV is generated as a nonce value, and the rest of it is incremented as a counter. In TLS the IV counter is also the record sequence number. The IV of GCM is unlike the IV in CBC mode, which must also be unpredictable. The disadvantage of using this type of IV, is that in order to avoid collisions, one must change the encryption key, before the IV counter overflows.
The additional data (A) - this data is not secret, and therefore not encrypted, but it is being authenticated by the GHASH. In TLS the additional data is 13 bytes, and includes data such as the record sequence number, type, length and the protocol version.
The plaintext (P) - this is the secret data, it is both encrypted and authenticated.

The operation outputs the ciphertext (C) and the authentication tag (T). The length of C is identical to that of P, and the length of T is 128 bits (although some applications allow for shorter tags). The tag T is computed over A and C, so if even a single bit of either of them is changed, the decryption process will detect the tampering attempt and refuse to use the data. In TLS, the tag T is attached at the end of the ciphertext C.

When decrypting the data, the function will receive A, C and T and compute the authentication tag of the received A and C. It will compare the resulting value to T, and only if both are equal it will output the plaintext P.

By supporting the two state of the art AEADs - AES-GCM and ChaCha20-Poly1305, together with ECDSA and ECDH algorithms, CloudFlare is able to provide the fastest, most flexible and most secure TLS experience possible to date on all platforms, be it a PC or a mobile phone.

Bottom line

Go is a very easy to learn and fun to use, yet it is one of the most powerful languages for system programming. It allows us to deliver robust web-scale software in short time frames. With the performance improvements CloudFlare brings to its crypto stack, Go can now be used for high performance TLS servers as well!

↧

Google PageSpeed Service customers: migrate to CloudFlare for acceleration

May 8, 2015, 4:02 am

≫ Next: CloudFlare "Interview Questions"

≪ Previous: Go crypto: bridging the performance gap

Google PageSpeed Service customers: migrate to CloudFlare for acceleration

This week, Google announced that its hosted PageSpeed Service will be shut down. Everyone using the hosted service needs to move their site elsewhere before August 3 2015 to avoid breaking their website.

We're inviting these hosted customers: don't wait...migrate your site to CloudFlare for global acceleration (and more) right now.

Google PageSpeed Service customers: migrate to CloudFlare for acceleration CC BY 2.0 image by Roger
As TechCrunch wrote: "In many ways, PageSpeed Service is similar to what CloudFlare does but without the focus on security."

What is PageSpeed?

PageSpeed started as — and continues — as a Google-created software module for the Apache webserver to rewrite webpages to reduce latency and bandwidth, to help make the web faster.

Google introduced their hosted PageSpeed Service in July 2011, to save webmasters the hassle of installing the software module.

It's the hosted service that is being discontinued.

CloudFlare performance

CloudFlare provides similar capabilities to PageSpeed, such as minification, image compression, and asynchronous loading.

Additionally, CloudFlare offers more performance gains through a global network footprint, Railgun for dynamic content acceleration, built-in SPDY support, and more.

Not just speed

PageSpeed Service customers care about performance, and CloudFlare delivers. CloudFlare also includes security, SSL, DNS, and more, at all plans, including Free.

Setting up CloudFlare

There's no need to install any software or change your host. CloudFlare works at the network level, with a five minute sign up process.

Here's simple instructions for migrating from the hosted PageSpeed Service to CloudFlare.

↧

CloudFlare "Interview Questions"

May 11, 2015, 4:08 am

≫ Next: CloudFlare Supports the Passage of the USA Freedom Act

≪ Previous: Google PageSpeed Service customers: migrate to CloudFlare for acceleration

For quite some time we've been grilling our candidates about dirty corners of TCP/IP stack. Every engineer here must prove his/her comprehensive understanding of the full network stack. For example: what are the differences in checksumming algorithms between IPv4 and IPv6 stacks?

I'm joking of course, but in the spirit of the old TCP/IP pub game I want to share some of the amusing TCP/IP quirks I've bumped into over the last few months while working on CloudFlare's automatic attack mitigation systems.

CC BY-SA 2.0 image by Daan Berg

Don't worry if you don't know the correct answer: you may always come up with a funny one!

Some of the questions are fairly obvious, some don't have a direct answer and are supposed to provoke a longer discussion. The goal is to encourage our readers to review the dusty RFCs, get interested in the inner workings of the network stack and generally spread the knowledge about the protocols we rely on so much.

Don't forget to add a comment below if you want to share a response!

You think you know all about TCP/IP? Let's find out.

Archaeology

1) What is the lowest TCP port number?

2) The TCP frame has an URG pointer field, when is it used?

3) Can the RST packet have a payload?

4) When is the "flow" field in IPv6 used?

5) What does the IP_FREEBIND socket option do?

Forgotten Quirks

6) What does the PSH flag actually do?

7) The TCP timestamp is implicated in SYN cookies. How?

8) Can a "UDP" packet have a checksum field set to zero?

9) How does TCP simultaneous open work? Does it actually work?

Fragmentation and Congestion

10) What is a stupid window syndrome?

11) What are the CWE and ECE flags in TCP header?

12) What is the IP ID field and what does it have to do with DF bit? Why do some packets have a non-zero IP ID and a DF set?

Fresh Ideas

13) Can a SYN packet have a payload? (hint: new RFC proposals)

14) Can a SYN+ACK packet have a payload?

ICMP Path MTU

15) ICMP packet-too-big messages are returned by routers and contain a part of the original packet in the payload. What is the minimal length of this payload that is accepted by Linux?

16) When an ICMP packet-too-big message is returned by an intermediate router it will have the source IP of that router. In practice though, we often see a source IP of the ICMP message to be identical to the destination IP of the original packet. Why could that happen?

Linux Configuration

17) Linux has a "tcp_no_metrics_save" sysctl setting. What does it save and for how long?

18) Linux uses two queues to handle incoming TCP connections: the SYN queue and the accept queue. What is the length of the SYN queue?

19) What happens if the SYN queue grows too large and overflows?

Touching the router

20) What are BGP bogons, and why are they less of a problem now?

21) TCP has an extension which adds MD5 checksums to packets. When is it useful?

And finally:

22) What are the differences in checksumming algorithms in IPv4 and IPv6?

At CloudFlare we touch low level things like that every day. Is this something that interests you? Consider applying!

↧

CloudFlare Supports the Passage of the USA Freedom Act

May 13, 2015, 4:13 pm

≫ Next: Logjam: the latest TLS vulnerability explained

≪ Previous: CloudFlare "Interview Questions"

alt

Earlier today, the lower house in the U.S. Congress (the House of Representatives) passed the USA FREEDOM Act. The Act, if passed by the Senate and signed by the President, would seek to sunset the National Security Agency’s bulk collection and mass surveillance programs, which may or may not be authorized by Section 215 of the PATRIOT Act. Under this authority the U.S. government has established its broad surveillance programs to indiscriminately collect information. Other governments have followed this lead to create additional surveillance capabilities—most recently, the French Parliament has moved a bill that would allow broad surveillance powers with little judicial oversight.

Restricting routine bulk collection is important: it’s not the government’s job to collect everything that passes over the Internet. The new version of the USA FREEDOM Act keeps useful authorities but ends bulk collection of private data under the PATRIOT Act. It also increases the transparency of the secret FISA court, which reviews surveillance programs—a key start to understanding and fixing broken policies around surveillance. The Act would also allow companies to be more transparent in their reporting related to FISA orders.

To be clear, we continue to be supportive of law enforcement and work productively with them on their investigations, under the due process of law. However, numerous assessments in the U.S. and elsewhere have shown that bulk collection programs are ineffectual.

CloudFlare wholeheartedly supports the USA FREEDOM Act’s passage and adoption as law. We believe that the NSA’s mass surveillance programs are illegal and unconstitutional. CloudFlare has a long track record of opposing these programs since they deprive Internet users the due process of law that the Constitution requires. We recently joined with the Center for Democracy and Technology, the Open Technology Institute, Access, and other leading tech policy organizations and tech companies to support the Act’s passage. The passage of the USA FREEDOM Acts is just one step towards ensuring national security under the rule of law. There is more work to be done to ensure that the agencies charged with keeping us safe do so responsibly, in order to maintain trust in the web.

In the meanwhile, we urge the Senate and President Obama to complete the work of the House and pass the USA FREEDOM Act into law.

↧

Logjam: the latest TLS vulnerability explained

May 20, 2015, 5:52 pm

≫ Next: Welcome Acquia!

≪ Previous: CloudFlare Supports the Passage of the USA Freedom Act

log-jam Image: "Logjam" as interpreted by @0xabad1dea.

Yesterday, a group from INRIA, Microsoft Research, Johns Hopkins, the University of Michigan, and the University of Pennsylvania published a deep analysis of the Diffie-Hellman algorithm as used in TLS and other protocols. This analysis included a novel downgrade attack against the TLS protocol itself called Logjam, which exploits EXPORT cryptography (just like FREAK).

First, let me start by saying that CloudFlare customers are not and were never affected. We don’t support non-EC Diffie-Hellman ciphersuites on either the client or origin side. We also won't touch EXPORT-grade cryptography with a 20ft stick.

But why are CloudFlare customers safe, and how does Logjam work anyway?

Diffie-Hellman and TLS

This is a detailed technical introduction to how DH works and how it’s used in TLS—if you already know this and want to read about the attack, skip to “Enter export crypto, enter Logjam” below. If, instead, you are not interested in the nuts and bolts and want to know who’s at risk, skip to “So, what’s affected?”

To start a TLS connection, the two sides—client (the browser) and server (CloudFlare)—need to agree securely on a secret key. This process is called Key Exchange and it happens during the TLS Handshake: the exchange of messages that take place before encrypted data can be transmitted.

There is a detailed description of the TLS handshake in the first part of this previous blog post by Nick Sullivan. In the following, I’ll only discuss the ideas you’ll need to understand the attack at hand.

There are many types of Key Exchanges: static RSA, Diffie-Hellman (DHE cipher suites), Elliptic Curve Diffie-Hellman (ECDHE cipher suites), and some less used methods.

An important property of DHE and ECDHE key exchanges is that they provide Forward Secrecy. That is, even if the server key is compromised at some point, it can’t be used to decrypt past connections. It’s important to protect the information exchanged from future breakthroughs, and we’re proud to say that 94% of CloudFlare connections provide it.

This research—and this attack—applies to the normal non-EC Diffie-Hellman key exchange (DHE). This is how it works at a high level (don’t worry, I’ll explain each part in more detail below):

The client advertises support for DHE cipher suites when opening a connection (in what is called a Client Hello message)
The server picks the parameters and performs its half of the DH computation using those parameters
The server signs parameters and its DH share with its long-term certificate and sends the whole thing to the client
The client checks the signature, uses the parameters to perform its half of the computation and sends the result to the server
Both parts put everything together and derive a shared secret, which they will use as the key to secure the connection

(For a more in depth analysis of each step see the link to Nick Sullivan’s blog post above, “Ephemeral Diffie-Hellman Handshake” section.)

Let’s explain some of the terms that just passed by your screen. “The client” is the browser and “the server” is the website (or CloudFlare’s edge serving the website).

“The parameters” (or group) are some big numbers that are used as base for the DH computations. They can be, and often are, fixed. The security of the final secret depends on the size of these parameters. This research deemed 512 and 768 bits to be weak, 1024 bits to be breakable by really powerful attackers like governments, and 2048 bits to be a safe size.

The certificate contains a public key and is what you (or CloudFlare for you) get issued from a CA for your website. The client makes sure it’s issued by a CA it trusts and that it’s valid for the visited website. The server uses the corresponding private key to cryptographically sign its share of the DH key exchange so that the client can be sure it’s agreeing on a connection key with the real owner of the website, not a MitM.

Finally, the DH computation: there’s a beautiful explanation of this on Wikipedia which uses paint. The tl;dr is:

The server picks a secret ‘a’
Then it computes—using some parameters as a base—a value ‘A’ from it and sends that to the client (not ‘a’!)
The client picks a secret ‘b’, takes the parameters from the server and likewise it computes a value ‘B’ that sends to the server
Both parts put together ‘a’ + ‘B’ or ‘b’ + ‘A’ to derive a shared, identical secret - which is impossible to compute from ‘A’ + ‘B’ which are the only things that travelled on the wire

The security of all this depends on the strength/size of the parameters.

Enter export crypto, enter Logjam

So far, so good. Diffie-Hellman is nice, it provides Forward Secrecy, it’s secure if the parameters are big enough, and the parameters are picked and signed by the server. So what’s the problem?

Enter “export cryptography”! Export cryptography is a relic of the 90’s US restrictions on cryptography export. In order to support SSL in countries to where the U.S. had disallowed exporting "strong cryptography", many implementations support weakened modes called EXPORT modes.

We’ve already seen an attack that succeeded because connections could be forced to use these modes even if they wouldn’t want to, this is what happened with the FREAK vulnerability. It’s telling that 20 years after these modes became useless we are still dealing with the outcome of the added complexity.

How it works with Diffie-Hellman is that the client requests a DHE_EXPORT ciphersuite instead of the corresponding DHE one. Seeing that, the server (if it supports DHE_EXPORT) picks small, breakable 512-bits parameters for the exchange, and carries on with a regular DHE key exchange. The server doesn’t signal back securely to the client that it picked such small DH parameters because of the EXPORT ciphersuite.

This is the protocol flaw at the heart of Logjam “downgrade attack”:

A MitM attacker intercepts a client connection and replaces all the accepted ciphersuites with only the DHE_EXPORT ones
The server picks weak 512-bits parameters, does its half of the computation, and signs the parameters with the certificate’s private key. Neither the Client Hello, the client ciphersuites, nor the chosen ciphersuite are signed by the server!
The client is led to believe that the server picked a DHE Key Exchange and just willingly decided for small parameters. From its point of view, it has have no way to know that the server was tricked by the MitM into doing so!
The attacker would then break one of the two weak DH shares, recover the connection key, and proceed with the TLS connection with the client

Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice - Figure 2

Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice - Figure 2

The client has no other way to protect itself besides drawing a line in the sand about how weak the DHE parameters can be (e.g. at least 1024 bits) and refuse to connect to servers that want to pick smaller ones. This is what all modern browsers are now doing, but it wasn’t done before because it causes breakage, and it was believed that there was no way to trick a server into choosing such weak parameters if it wouldn’t normally.

The servers can protect themselves by refusing EXPORT ciphersuites and never signing small parameters.

About parameters size and reuse

But how small is too small for DH parameters? The Logjam paper analyzes this in depth also. The first thing to understand is that parameters can be and often are reused. 17.9% of the Top 1 Million Alexa Domains used the same 1024-bit parameters.

An attacker can perform the bulk of the computation having only the parameters, and then break any DH exchange that uses them in minutes. So when many sites (or VPN servers, etc.) share the same parameters, the investment of time needed to “break” the parameters makes much more sense since it would then allow the attacker to break many connections with little extra effort.

The research team performed the precomputation on the most common 512-bit (EXPORT) parameters to demonstrate the impact of Logjam, but they express concerns that real, more powerful attackers might do the same with the common normal-DHE 1024-bit parameters.

Finally, in their Internet-wide scan they discovered that many servers will provide vulnerable 512-bit parameters even for non-EXPORT DHE, in order to support older TLS implementations (for example, old Java versions).

So, what’s affected?

A client/browser is affected if it accepts small DHE parameters as part of any connection, since it has no way to know that it’s being tricked into a weak EXPORT-level connection. Most major browsers at the time of this writing are vulnerable but are moving to restrict the size of DH parameters to 1024 bit. You can check yours visiting weakdh.org.

A server/website is vulnerable if it supports the DHE_EXPORT ciphersuites or if it uses small parameters for DHE. You can find a test and instructions on how to fix this at https://weakdh.org/sysadmin.html. 8.4% of Alexa Top Million HTTPS websites were initially vulnerable (with 82% and 10% of them using the same two parameters sets, making precomputation more viable). CloudFlare servers don’t accept either DHE_EXPORT or DHE. We offer ECDHE instead.

Some interesting related statistics: 94% of the TLS connections to CloudFlare customer sites uses ECDHE (more precisely 90% of them being ECDHE-RSA-AES of some sort and 10% ECDHE-RSA-CHACHA20-POLY1305) and provides Forward Secrecy. The rest use static RSA (5.5% with AES, 0.6% with 3DES).

Both the client and the server need to be vulnerable in order for the attack to succeed because the server must accept to sign small DHE_EXPORT parameters, and the client must accept them as valid DHE parameters.

A closing note: events like this are ultimately a good thing for the security industry and the web at large since they mean that skilled people are looking at what we rely on to secure our connections and fix its flaws. They also put a spotlight on how the added complexity of supporting reduced-strength crypto and older devices endangers and adds difficulty to all of our security efforts.

If you’ve read to here, found it interesting, and would like to work on things like this, remember that we’re hiring in London and San Francisco!

↧

Welcome Acquia!

May 21, 2015, 4:10 pm

≫ Next: Four years later and CloudFlare is still doing IPv6 automatically

≪ Previous: Logjam: the latest TLS vulnerability explained

alt

We’ve had the good fortune to share many great experiences with the Acquia team over the last few years. From breaking bread with founder and CTO Dries Buytaert at SXSW, to skiing the slopes of Park City with the company’s CEO Tom Erickson, to staying up late with their incredible team onboarding a joint customer under a DDoS attack. It’s always a pleasure to spend time with the Acquia team.

Today we are thrilled to welcome Acquia as a CloudFlare Partner. Together we developed Acquia Cloud Edge powered by CloudFlare making it easier for any of their customers to access CloudFlare’s web performance and security solutions. The Acquia Cloud Edge is a family of products that protects websites against security threats, ensures only clean traffic get served, and speeds up site performance no matter where visitors are located.

Acquia Cloud Edge powered by CloudFlare comes as Edge Protect and Edge CDN. Edge Protect defends against DDoS and other network-level attacks. CloudFlare sits on the network edge in front of Acquia web servers, allowing early identification of attack patterns and questionable visitors, and mitigating attacks before they reach a user’s site. Edge CDN accelerates the delivery of digital experiences through CloudFlare’s global network with the fastest, most full-featured CDN on the market. It cuts page load times by up to 50 percent and reduces bandwidth. Customers see up to a 95 percent cache hit rate, reducing the number of requests that have to be served from the Acquia infrastructure.

CloudFlare has been a fan of Drupal from the very beginning, and with nearly 2 million sites built on the open content management framework, we are excited to partner with Acquia to provide the best web performance and security for the Drupalverse.

Get started or learn more about CloudFlare and Acquia.

↧

Four years later and CloudFlare is still doing IPv6 automatically

June 5, 2015, 12:42 pm

≫ Next: iOS Developers — Migrate to iOS 9 with CloudFlare

≪ Previous: Welcome Acquia!

Over the past four years CloudFlare has helped well over two million websites join the modern web, making us one of the fastest growing providers of IPv6 web connectivity on the Internet. CloudFlare's Automatic IPv6 Gateway allows IPv4-only websites to support IPv6-only clients with zero clicks. No hardware. No software. No code changes. And no need to change your hosting provider.

alt Image by Andrew D. Ferguson

A Four Year Story

The story of IPv6 support for customers of CloudFlare is about as long as the story of CloudFlare itself. June 6th, 2011 (four years ago) was the original World IPv6 Day, and CloudFlare participated. Each year since, the global Internet community has pushed forward with additional IPv6 deployment. Now, four years later, CloudFlare is celebrating June 6th knowing that our customers are being provided with a solid IPv6 offering that requires zero configuration to enable. CloudFlare is the only global CDN that provides IPv4/IPv6 delivery of content by default and at scale.

IPv6 has been featured in our blog various times over the last four years. We have provided support for legacy logging systems to handle IPv6 addresses, provided DDoS protection on IPv6 alongside classic IPv4 address space, and provided the open source community with software to handle ECMP load balancing correctly.

It's all about the numbers

Today CloudFlare is celebrating four years of World IPv6 Day with a few new graphs:

First, let's measure IPv6 addresses (broken down by /64 blocks).

alt

While CloudFlare operates the CDN portion of web traffic delivery to end-users, we don’t control the end-user's deployment of IPv6. We do, however, see IP addresses when they are used, and we clearly see an uptick in deployed IPv6 at the end-user sites. Measuring unique IP addresses is a good indicator of IPv6 deployment.

Next we can look at how end users, using IPv6, access CloudFlare customers' websites.

alt

With nearly 25% of our IPv6 traffic being delivered to mobile devices as of today, CloudFlare is happy to help demonstrate that mobile operators have embraced IPv6 with gusto.

The third graph looks at traffic based on destination country and is a measure of end-users (eyeballs as they are called in the industry) that are enabled for IPv6. The graph shows the top countries that CloudFlare delivers IPv6 web traffic to.

alt

On the Y Axis, we have the percentage of traffic delivered via IPv6. On the X Axis we have each of the last eight months. Let's just have a shoutout to Latin America. In Latin America CloudFlare operates five data centers: Argentina, Brazil, Chile, Columbia, and Peru. CloudFlare would like to highlight our Peru datacenter because it has the highest percentage of traffic being delivered over IPv6 in Latin America. Others agree.

alt

What's next for IPv6?

The real question is: What's next for users stuck with only IPv4? Because the RIRs have depleted their IPv4 address space pools, ISPs are simply out of options regarding the need to embrace IPv6. Basically, there are no more IPv4 addresses available.

Supporting IPv6 is even more important today than it was four years ago and CloudFlare is happy that it provides this automatic IPv6 solution for all its customers. Come try it out and don’t let other web properties languish in the non-IPv6 world by telling a friend about our automatic IPv6 support.

↧

iOS Developers — Migrate to iOS 9 with CloudFlare

June 11, 2015, 4:31 am

≫ Next: How to receive a million packets per second

≪ Previous: Four years later and CloudFlare is still doing IPv6 automatically

Thousands of developers use CloudFlare to accelerate and secure the backend of their mobile applications and websites. This week is Apple’s Worldwide Developers Conference (WWDC), where thousands of Apple developers come to San Francisco to talk, learn and share best practices for developing software for Apple platforms. New announcements from Apple this week make CloudFlare an even more obvious choice for application developers.

New operating systems, new application requirements

The flagship announcement of WWDC 2015 was a new version of Apple’s mobile operating system, iOS 9, to be released in September with a developer preview available now. They also announced a new Mac operating system, OS X El Capitan, launching in the fall. Apple has a track record of developing and supporting technologies that enhance user privacy and security with iMessage and Facetime and the trend is continuing with these new operating systems. In both cases, Apple is requiring application developers to make use of two network technologies that CloudFlare is big fan of: HTTPS and IPv6.

For iOS 9 and El Capitan, all applications submitted to the iOS and Mac App Stores must work over IPv6. In previous versions, applications were allowed that only worked with IPv4.

From Sebastien Marineau, Apple’s VP of Core OS: "Because IPv6 support is so critical to ensuring your applications work across the world for every customer, we are making it an AppStore submission requirement, starting with iOS 9."

By default, all network connections in third party applications compiled for on iOS 9 and El Capitan use a new feature called App Transport Security. This feature forces the application to connect to backend APIs and the web via HTTPS. Plain unencrypted HTTP requests are disallowed unless the developer specifically modifies a configuration file to allow it.

From the iOS 9 developer documentation: "If you're developing a new app, you should use HTTPS exclusively. If you have an existing app, you should use HTTPS as much as you can right now, and create a plan for migrating the rest of your app as soon as possible."

What does this mean for application developers? If your application has a web backend component, you will need to update the backend to support these protocols. This can be difficult to do since not all hosting providers support IPv6, and HTTPS certificates can be tricky to obtain and difficult to configure correctly, not to mention maintain.

Free and automatic IPv6 and SSL

One of the best things CloudFlare does well is to take modern network protocols and make them affordable (or free) and accessible to everyone. Every CloudFlare-backed website and API backend has support for both IPv6 and HTTPS automatically, and with no configuration necessary.

With Universal SSL, this is now true even for customers on CloudFlare’s free plan. We also make sure your HTTPS configuration is the latest and greatest with TLS 1.2 support and Forward secrecy.

CloudFlare provides its Automatic IPv6 Gateway for free to all CloudFlare users. This makes any content or API required by an Apple iOS app instantly available using IPv4 and IPv6 independent of the inability of your hosting provider to supply IPv6 support. More information about CloudFlare’s IPv6 support can be found here.

Not only does CloudFlare help keep application service components up to date with the latest Apple requirements, it provides the performance benefits of a globally distributed network and protection against malicious attacks.

If you’re an iOS developer looking to upgrade your application backend to meet Apple’s new requirements, you can sign up for CloudFlare here.

To verify that IPv6 is enabled, open your DNS settings and ensure that the IPv6 toggle is on:

CloudFlare offers HTTPS to the backend, so if you already have HTTPS, you can keep full end-to-end encryption with CloudFlare’s Strict SSL mode. If your backend doesn’t support HTTPS, you can select the Flexible mode to encrypt the communication between your App and CloudFlare. These setting are available in your Crypto settings:

CloudFlare and iOS 9 were made for each other.

P.S.: We provide the same service to Android apps as well.

↧

How to receive a million packets per second

June 16, 2015, 7:47 am

≫ Next: Go has a debugger—and it's awesome!

≪ Previous: iOS Developers — Migrate to iOS 9 with CloudFlare

Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"

That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let's rephrase that to make it more fun:

On Linux, how hard is it to write a program that receives 1 million UDP packets per second?

Hopefully, answering this question will be a good lesson about the design of a modern networking stack.

CC BY-SA 2.0 image by Bob McCaffrey

First, let us assume:

Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve high Bps by better pipelining and sending longer packets. Improving pps is much harder.
Since we're interested in pps, our experiments will use short UDP messages. To be precise: 32 bytes of UDP payload. That means 74 bytes on the Ethernet layer.
For the experiments we will use two physical servers: "receiver" and "sender".
They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box. The boxes have a multi-queue 10G network card by Solarflare, with 11 receive queues configured. More on that later.
The source code of the test programs is available here: udpsender, udpreceiver.

Prerequisites

Let's use port 4321 for our UDP packets. Before we start we must ensure the traffic won't be interfered with by the iptables:

receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT
receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACK

A couple of explicitly defined IP addresses will later become handy:

receiver$ for i in `seq 1 20`; do \
              ip addr add 192.168.254.$i/24 dev eth2; \
          done
sender$ ip addr add 192.168.254.30/24 dev eth3

1. The naive approach

To start let's do the simplest experiment. How many packets will be delivered for a naive send and receive?

The sender pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 65400)) # select source port to reduce nondeterminism
fd.connect(("192.168.254.1", 4321))
while True:
    fd.sendmmsg(["\x00" * 32] * 1024)

While we could have used the usual send syscall, it wouldn't be efficient. Context switches to the kernel have a cost and it is be better to avoid it. Fortunately a handy syscall was recently added to Linux: sendmmsg. It allows us to send many packets in one go. Let's do 1,024 packets at once.

The receiver pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 4321))
while True:
    packets = [None] * 1024
    fd.recvmmsg(packets, MSG_WAITFORONE)

Similarly, recvmmsg is a more efficient version of the common recv syscall.

Let's try it out:

sender$ ./udpsender 192.168.254.1:4321
receiver$ ./udpreceiver1 0.0.0.0:4321
  0.352M pps  10.730MiB /  90.010Mb
  0.284M pps   8.655MiB /  72.603Mb
  0.262M pps   7.991MiB /  67.033Mb
  0.199M pps   6.081MiB /  51.013Mb
  0.195M pps   5.956MiB /  49.966Mb
  0.199M pps   6.060MiB /  50.836Mb
  0.200M pps   6.097MiB /  51.147Mb
  0.197M pps   6.021MiB /  50.509Mb

With the naive approach we can do between 197k and 350k pps. Not too bad. Unfortunately there is quite a bit of variability. It is caused by the kernel shuffling our programs between cores. Pinning the processes to CPUs will help:

sender$ taskset -c 1 ./udpsender 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.362M pps  11.058MiB /  92.760Mb
  0.374M pps  11.411MiB /  95.723Mb
  0.369M pps  11.252MiB /  94.389Mb
  0.370M pps  11.289MiB /  94.696Mb
  0.365M pps  11.152MiB /  93.552Mb
  0.360M pps  10.971MiB /  92.033Mb

Now, the kernel scheduler keeps the processes on the defined CPUs. This improves processor cache locality and makes the numbers more consistent, just what we wanted.

2. Send more packets

While 370k pps is not bad for a naive program, it's still quite far from the goal of 1Mpps. To receive more, first we must send more packets. How about sending independently from two threads:

sender$ taskset -c 1,2 ./udpsender \
            192.168.254.1:4321 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.349M pps  10.651MiB /  89.343Mb
  0.354M pps  10.815MiB /  90.724Mb
  0.354M pps  10.806MiB /  90.646Mb
  0.354M pps  10.811MiB /  90.690Mb

The numbers on the receiving side didn't increase. ethtool -S will reveal where the packets actually went:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx_nodesc_drop_cnt:    451.3k/s
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:     0.5/s
     rx-4.rx_packets:  355.2k/s
     rx-5.rx_packets:     0.0/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s

Through these stats, the NIC reports that it had successfully delivered around 350kpps to RX queue number #4. The rx_nodesc_drop_cnt is a Solarflare specific counter saying the NIC failed to deliver 450kpps to the kernel.

Sometimes it's not obvious why the packets weren't delivered. In our case though, it's very clear: the RX queue #4 delivers packets to CPU #4. And CPU #4 can't do any more work - it's totally busy just reading the 350kpps. Here's how that looks in htop:

htop one cpu

Crash course to multi-queue NICs

Historically, network cards had a single RX queue that was used to pass packets between hardware and kernel. This design had an obvious limitation - it was impossible to deliver more packets than a single CPU could handle.

To utilize multicore systems, NICs began to support multiple RX queues. The design is simple: each RX queue is pinned to a separate CPU, therefore, by delivering packets to all the RX queues a NIC can utilize all CPUs. But it raises a question: given a packet, how does the NIC decide to which RX queue to push it?

multiqueue nic

Round-robin balancing is not acceptable, as it might introduce reordering of packets within a single connection. An alternative is to use a hash from packet to decide the RX queue number. The hash is usually counted from a tuple (src IP, dst IP, src port, dst port). This guarantees that packets for a single flow will always end up on exactly the same RX queue, and reordering of packets within a single flow can't happen.

In our case, the hash could have been used like this:

RX_queue_number = hash('192.168.254.30', '192.168.254.1', 65400, 4321) % number_of_queues

Multi-queue hashing algorithms

The hash algorithm is configurable with ethtool. On our setup it is:

receiver$ ethtool -n eth2 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA

This reads as: for IPv4 UDP packets, the NIC will hash (src IP, dst IP) addresses. i.e.:

RX_queue_number = hash('192.168.254.30', '192.168.254.1') % number_of_queues

This is pretty limited, as it ignores the port numbers. Many NICs allow customization of the hash. Again, using ethtool we can select the tuple (src IP, dst IP, src port, dst port) for hashing:

receiver$ ethtool -N eth2 rx-flow-hash udp4 sdfn
Cannot change RX network flow hashing options: Operation not supported

Unfortunately our NIC doesn't support it - we are constrained to (src IP, dst IP) hashing.

A note on NUMA performance

So far all our packets flow to only one RX queue and hit only one CPU. Let's use this as an opportunity to benchmark the performance of different CPUs. In our setup the receiver host has two separate processor banks, each is a different NUMA node.

We can pin the single-threaded receiver to one of four interesting CPUs in our setup. The four options are:

Run receiver on another CPU, but on the same NUMA node as the RX queue. The performance as we saw above is around 360kpps.
With receiver on exactly same CPU as the RX queue we can get up to ~430kpps. But it creates high variability. The performance drops down to zero if the NIC is overwhelmed with packets.
When the receiver runs on the HT counterpart of the CPU handling RX queue, the performance is half the usual number at around 200kpps.
With receiver on a CPU on a different NUMA node than the RX queue we get ~330k pps. The numbers aren't too consistent though.

While a 10% penalty for running on a different NUMA node may not sound too bad, the problem only gets worse with scale. On some tests I was able to squeeze out only 250kpps per core. On all the cross-NUMA tests the variability was bad. The performance penalty across NUMA nodes is even more visible at higher throughput. In one of the tests I got a 4x penalty when running the receiver on a bad NUMA node.

3. Multiple receive IPs

Since the hashing algorithm on our NIC is pretty limited, the only way to distribute the packets across RX queues is to use many IP addresses. Here's how to send packets to different destination IPs:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321

ethtool confirms the packets go to distinct RX queues:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:  355.2k/s
     rx-4.rx_packets:     0.5/s
     rx-5.rx_packets:  297.0k/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s

The receiving part:

receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.609M pps  18.599MiB / 156.019Mb
  0.657M pps  20.039MiB / 168.102Mb
  0.649M pps  19.803MiB / 166.120Mb

Hurray! With two cores busy with handling RX queues, and third running the application, it's possible to get ~650k pps!

We can increase this number further by sending traffic to three or four RX queues, but soon the application will hit another limit. This time the rx_nodesc_drop_cnt is not growing, but the netstat "receiver errors" are:

receiver$ watch 'netstat -s --udp'
Udp:
      437.0k/s packets received
        0.0/s packets to unknown port received.
      386.9k/s packet receive errors
        0.0/s packets sent
    RcvbufErrors:  123.8k/s
    SndbufErrors: 0
    InCsumErrors: 0

This means that while the NIC is able to deliver the packets to the kernel, the kernel is not able to deliver the packets to the application. In our case it is able to deliver only 440kpps, the remaining 390kpps + 123kpps are dropped due to the application not receiving them fast enough.

4. Receive from many threads

We need to scale out the receiver application. The naive approach, to receive from many threads, won't work well:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321
receiver$ taskset -c 1,2 ./udpreceiver1 0.0.0.0:4321 2
  0.495M pps  15.108MiB / 126.733Mb
  0.480M pps  14.636MiB / 122.775Mb
  0.461M pps  14.071MiB / 118.038Mb
  0.486M pps  14.820MiB / 124.322Mb

The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer. This paper describes the problem in more detail.

Using many threads to receive from a single descriptor is not optimal.

5. SO_REUSEPORT

Fortunately, there is a workaround recently added to Linux: the SO_REUSEPORT flag. When this flag is set on a socket descriptor, Linux will allow many processes to bind to the same port. In fact, any number of processes will be allowed to bind and the load will be spread across them.

With SO_REUSEPORT each of the processes will have a separate socket descriptor. Therefore each will own a dedicated UDP receive buffer. This avoids the contention issues previously encountered:

receiver$ taskset -c 1,2,3,4 ./udpreceiver1 0.0.0.0:4321 4 1
  1.114M pps  34.007MiB / 285.271Mb
  1.147M pps  34.990MiB / 293.518Mb
  1.126M pps  34.374MiB / 288.354Mb

This is more like it! The throughput is decent now!

More investigation will reveal further room for improvement. Even though we started four receiving threads, the load is not being spread evenly across them:

port uneven

Two threads received all the work and the other two got no packets at all. This is caused by a hashing collision, but this time it is at the SO_REUSEPORT layer.

Final words

I've done some further tests, and with perfectly aligned RX queues and receiver threads on a single NUMA node it was possible to get 1.4Mpps. Running receiver on a different NUMA node caused the numbers to drop achieving at best 1Mpps.

To sum up, if you want a perfect performance you need to:

Ensure traffic is distributed evenly across many RX queues and SO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows).
You need to have enough spare CPU capacity to actually pick up the packets from the kernel.
To make the things harder, both RX queues and receiver processes should be on a single NUMA node.

While we had shown that it is technically possible to receive 1Mpps on a Linux machine, the application was not doing any actual processing of received packets - it didn't even look at the content of the traffic. Don't expect performance like that for any practical application without a lot more work.

Interested in this sort of low-level, high-performance packet wrangling? CloudFlare is hiring in London, San Francisco and Singapore.

↧

Go has a debugger—and it's awesome!

June 18, 2015, 5:14 am

≫ Next: EFF, CloudFlare Ask Federal Court Not To Force Internet Companies To Enforce Music Labels’ Trademarks

≪ Previous: How to receive a million packets per second

Something that often, uh... bugs¹ Go developers is the lack of a proper debugger. Sure, builds are ridiculously fast and easy, and println(hex.Dump(b)) is your friend, but sometimes it would be nice to just set a breakpoint and step through that endless if chain or print a bunch of values without recompiling ten times.

CC BY 2.0 image by Carl Milner

You could try to use some dirty gdb hacks that will work if you built your binary with a certain linker and ran it on some architectures when the moon was in a waxing crescent phase, but let's be honest, it isn't an enjoyable experience.

Well, worry no more! godebug is here!

godebug is an awesome cross-platform debugger created by the Mailgun team. You can read their introduction for some under-the-hood details, but here's the cool bit: instead of wrestling with half a dozen different ptrace interfaces that would not be portable, godebug rewrites your source code and injects function calls like godebug.Line on every line, godebug.Declare at every variable declaration, and godebug.SetTrace for breakpoints (i.e. wherever you type _ = "breakpoint").

I find this solution brilliant. What you get out of it is a (possibly cross-compiled) debug-enabled binary that you can drop on a staging server just like you would with a regular binary. When a breakpoint is reached, the program will stop inline and wait for you on stdin. It's the single-binary, zero-dependencies philosophy of Go that we love applied to debugging. Builds everywhere, runs everywhere, with no need for tools or permissions on the server. It even compiles to JavaScript with gopherjs (check out the Mailgun post above—show-offs ;) ).

You might ask, "But does it get a decent runtime speed or work with big applications?" Well, the other day I was seeing RRDNS—our in-house Go DNS server—hit a weird branch, so I placed a breakpoint a couple lines above the if in question, recompiled the whole of RRDNS with godebug instrumentation, dropped the binary on a staging server, and replayed some DNS traffic.

filippo@staging:~$ ./rrdns -config config.json
-> _ = "breakpoint"
(godebug) l

    q := r.Query.Question[0]

--> _ = "breakpoint"

    if !isQtypeSupported(q.Qtype) {
        return
(godebug) n
-> if !isQtypeSupported(q.Qtype) {
(godebug) q
dns.Question{Name:"filippo.io.", Qtype:0x1, Qclass:0x1}
(godebug) c

Boom. The request and the debug log paused (make sure to kill any timeout you have in your tools), waiting for me to step through the code.

Sold yet? Here's how you use it: simply run godebug {build|run|test} instead of go {build|run|test}. We adapted godebug to resemble the go tool as much as possible. Remember to use -instrument if you want to be able to step into packages that are not main.

For example, here is part of the RRDNS Makefile:

bin/rrdns:
ifdef GODEBUG
    GOPATH="${PWD}" go install github.com/mailgun/godebug
    GOPATH="${PWD}" ./bin/godebug build -instrument "${GODEBUG}" -o bin/rrdns rrdns
else
    GOPATH="${PWD}" go install rrdns
endif

test:
ifdef GODEBUG
    GOPATH="${PWD}" go install github.com/mailgun/godebug
    GOPATH="${PWD}" ./bin/godebug test -instrument "${GODEBUG}" rrdns/...
else
    GOPATH="${PWD}" go test rrdns/...
endif

Debugging is just a make bin/rrdns GODEBUG=rrdns/... away.

This tool is still young, but in my experience, perfectly functional. The UX could use some love if you can spare some time (as you can see above it's pretty spartan), but it should be easy to build on what's there already.

About source rewriting

Before closing, I'd like to say a few words about the technique of source rewriting in general. It powers many different Go tools, like test coverage, fuzzing and, indeed, debugging. It's made possible primarily by Go’s blazing-fast compiles, and it enables amazing cross-platform tools to be built easily.

However, since it's such a handy and powerful pattern, I feel like there should be a standard way to apply it in the context of the build process. After all, all the source rewriting tools need to implement a subset of the following features:

Wrap the main function
Conditionally rewrite source files
Keep global state

Why should every tool have to reinvent all the boilerplate to copy the source files, rewrite the source, make sure stale objects are not used, build the right packages, run the right tests, and interpret the CLI..? Basically, all of godebug/cmd.go. And what about gb, for example?

I think we need a framework for Go source code rewriting tools. (Spoiler, spoiler, ...)

If you’re interested in working on Go servers at scale and developing tools to do it better, remember we’re hiring in London, San Francisco, and Singapore!

I'm sorry. ↩

↧

EFF, CloudFlare Ask Federal Court Not To Force Internet Companies To Enforce Music Labels’ Trademarks

June 18, 2015, 8:28 pm

≫ Next: Osaka, Japan: CloudFlare's 35th data center

≪ Previous: Go has a debugger—and it's awesome!

This blog was originally posted by the Electronic Frontier Foundation who is represents CloudFlare in this case.

alt

JUNE 18, 2015 | BY MITCH STOLTZ

This month, CloudFlare and EFF pushed back against major music labels’ latest strategy to force Internet infrastructure companies like CloudFlare to become trademark and copyright enforcers, by challenging a broad court order that the labels obtained in secret. Unfortunately, the court denied CloudFlare’s challenge and ruled that the secretly-obtained order applied to CloudFlare. This decision, and the strategy that led to it, present a serious problem for Internet infrastructure companies of all sorts, and for Internet users, because they lay out a blueprint for quick, easy, potentially long-lasting censorship of expressive websites with little or no court review. The fight’s not over for CloudFlare, though. Yesterday, CloudFlare filed a motion with the federal court in Manhattan, asking Judge Alison J. Nathan to modify the order and put the responsibility of identifying infringing domain names back on the music labels.

We’ve reported recently about major entertainment companies’ quest to make websites disappear from the Internet at their say-so. The Internet blacklist bills SOPA and PIPA were part of that strategy, along with the Department of Homeland Security’s project of seizing websites based on unverified accusations of copyright infringement by entertainment companies. Entertainment distributors are also lobbying ICANN, the nonprofit organization that oversees the Internet’s domain name system, to gain the power to censor and de-anonymize Internet users without court review. (Fortunately, it looks like ICANN is pushing back)

The order that CloudFlare received in May is an example of another facet of the site-blocking strategy. It works like this: entertainment companies file a lawsuit in federal court against the (possibly anonymous) owners of a website they want to make disappear. The website owners typically don’t show up in court to defend themselves, so the court issues an order to the website to stop infringing some copyright or trademark. (Often this order is initially drafted by the entertainment companies.) The entertainment companies then send a copy of the order to service providers like domain name registrars, Web hosting providers, ISPs, and content delivery networks like CloudFlare, and demand that the service providers block the targeted website or domain name, as well as other websites and domains that the entertainment companies want gone.

This month’s case involved a website that called itself Grooveshark, and appeared to be a clone of the site by that name that shut down in April after settling a copyright lawsuit to the record labels. That settlement left the labels in control of Grooveshark’s trademarks, which they proceeded to use as a weapon against the copycat site. The labels applied to the U.S. District Court for the Southern District of New York for a secret order to shut down the site, which was then located at grooveshark.io. Judge Deborah A. Batts granted the order in secret. The order covered the site’s anonymous owners, and anyone “in active concert or participation” with them. The order also listed “domain name registrars . . . and Internet service providers” among those who have to comply with it. The labels sent a copy of the order to several companies, including CloudFlare.

When a federal court issues an order (called an injunction), court rules say it can apply to a party in the case or to anyone in “active concert or participation” with them. But the courts haven’t clarified what “active concert or participation” means in the Internet context. Communication over the Internet can involve dozens of service and infrastructure providers, from hosts to domain name registrars to ISPs, backbone providers, network exchanges, and CDN services. Under a broad reading, even an electric utility or a landlord that leases space for equipment could conceivably be in “active concert or participation” with a website.

CloudFlare decided to take a stand against the overbroad order, asking the court to clarify that the order did not apply to CloudFlare. As a CDN and reverse proxy service, CloudFlare makes websites faster and more secure, but can’t suspend a site’s domain name or render it unreachable. So even if making Internet intermediaries responsible for enforcing copyright and trademark laws was a good idea (it’s not), CloudFlare is not the right one to do it. And even if the mysterious owners of the “new Grooveshark” site are bad actors, CloudFlare wanted to protect its law-abiding customers by insisting on a correct and thorough court process before cutting off any customer.

Unfortunately, the court concluded that the initial order applied to CloudFlare. And even worse, the court said that CloudFlare has to block every user with a domain name that contains “grooveshark,” no matter who owns the site. That means that CloudFlare, or any Internet infrastructure company that gets served with a copy of the court order, would have to filter and ban sites called “groovesharknews,” “grooveshark-commentary,” or “grooveshark-sucks,” no matter who runs them or what they contain.

That’s a big deal. Laws like Section 512 of the Digital Millennium Copyright Act, Section 230 of the Communications Decency Act, and court decisions on trademark law, protect Internet intermediaries from legal responsibility for the actions of their users, including the responsibility to proactively block or filter users. That protection has been vital to the growth of the Internet as a medium for communication, innovation, and learning. Those laws help keep Internet companies and entertainment conglomerates from becoming the gatekeepers of speech with the power to decide what we can and can’t communicate. The record labels didn’t accuse CloudFlare of any copyright or trademark violation—nor could they, in part because of laws like the DMCA. Yet the order against CloudFlare might force CloudFlare, and other service providers, to filter its service for terms like “grooveshark”—and other words that might appear in future court orders. Service providers like CloudFlare could find themselves in the uncomfortable position of having to figure out who’s allowed to use “grooveshark” and who isn’t—or of having to block them all. Turning Internet companies into enforcers of who can say what on the Internet is exactly what laws like the DMCA were meant to avoid.

And CloudFlare is far from the only Internet company to be hit with an order like this. Many, including some domain name registrars, simply comply with overbroad court orders, asking no questions, instead of sticking up for their users.

Yesterday, CloudFlare took a new step. Represented by EFF and Goodwin Procter, CloudFlare asked the court to change the order so that in the future, CloudFlare will only be responsible for taking down user accounts that use variations on “grooveshark” if the music labels notify CloudFlare that a site is infringing. That change will put the job of enforcing trademarks back on the trademark holders, and preserve the balance created by laws like the DMCA. CloudFlare supports smart, effective, and careful trademark enforcement and wants to see it done right – not through broad orders that can impact free speech.

Other Internet companies that care about their users, and would rather not become unwilling trademark and copyright enforcers and arbiters of speech, should follow CloudFlare’s lead and push back against orders like this one.

↧

Osaka, Japan: CloudFlare's 35th data center

June 22, 2015, 9:02 am

≫ Next: How to build your own public key infrastructure

≪ Previous: EFF, CloudFlare Ask Federal Court Not To Force Internet Companies To Enforce Music Labels’ Trademarks

Move over Jurassic World, the long awaited sequel to our Tokyo deployment is here. Our Osaka data center is our 2nd in Japan, 5th in Asia (following deployments in Hong Kong, Tokyo, Singapore, and Seoul), and 35th globally. This latest deployment serves not only Osaka, Japan's second largest city, but also Nagoya, the 3rd largest, and the entire Keihanshin metropolitan area including Kyoto and Kobe. This means faster application delivery to the area's 30 million inhabitants, and full redundancy of traffic with our Tokyo facility. CloudFlare is now mere milliseconds away from all 110 million Internet users in Japan.

The Internet in Asia

Even though Asia is home to many of the most technologically advanced nations in the world, the delivery of Internet traffic across the region is hardly seamless. Many of the incumbent telecommunications providers in the region (e.g. NTT, Tata, PCCW, Hinet, Singtel, among many others) do not interconnect with one another locally. This is another way of saying that traffic is routed poorly in the region. Traffic sent from one network in Japan, for example, may have to pass through an entirely different country or, in some instances, even the United States, prior to being received by another network in Japan. What's worse is that the choke points where traffic is ultimately exchanged are often candidates for congestion (too much traffic, and too little capacity).

A metaphor (Shibuya crossing in Tokyo)

This isn't a new story, of course. European networks once found themselves in the same (sad) predicament with intra-European traffic exchanged through the United States. Fortunately, with the emergence of European based Internet exchange points (IXPs), nearly all Internet traffic in Europe now remains within Europe. Moreover, many of these exchange points such as AMS-IX in Amsterdam, DE-CIX in Frankfurt and LINX in London, among others, have become some of the biggest in the world (note: CloudFlare is a member of each of these exchanges, as well as 20 others in Europe, and over 80 globally).

As Europe goes, so goes Asia?

Not quite, but things are changing. Unlike Europe, where economies are tens of miles apart, Asian countries are often thousands of miles apart with an ocean in the middle. In addition, where many European countries share the same, or similar languages, Internet content is Asia is highly regional due to higher language barriers.

Commercial interests also play a role. The aforementioned incumbent carriers hold significant control over the exchange of Internet traffic in their respective markets. As a means to preserve market power and inhibit competition, many refuse to interconnect with other providers locally. For example, of the two Tier 1 Internet service providers in Asia, NTT Communications (Japan) and Tata (India), neither interconnect with one another in their home markets. If you were to purchase Internet connectivity from Tata in Japan, you would have to send your traffic through Hong Kong to reach NTT's Japanese customers.

Despite this, the growth in IXPs in the region is encouraging. Osaka alone has multiple exchange points—JPIX, JPNAP, BBIX and Equinix—providing local interconnection to many Japanese networks. CloudFlare is proud to be a member of each of these exchanges. Because of our Anycast network architecture, all networks connected to us over these exchanges prefer the, now, shorter path to our network in Osaka (more below).

Solving the challenges of the Internet

If you regularly follow our blog you will have noticed that the delivery of applications over the Internet is both remarkably challenging and complex. It takes a global network, direct interconnection with global and regional carriers, significant investment in hardware and infrastructure, and the teams of engineers necessary to build, operate and manage it all, in order to deploy high availability, low latency, and secure applications.

Fortunately we've done just this so that you don't have to. In Asia alone CloudFlare maintains direct connectivity to all Tier 1 operators (i.e. NTT and Tata), to major regional incumbents (e.g. Korea Telecom, PCCW, etc.), and nearly 20 Internet exchange points throughout the region.

A view from a network in Nagoya, Japan (now hitting Osaka vs. Tokyo)

The example above shows the latency improvement for a user in Nagoya (a 50 minute Shinkansen bullet train away from Osaka vs. 1 hour and 40 minutes to Tokyo). The decrease from 11ms to 7ms may not be much (although much faster than a Shinkansen train!), but it matters when you're as performance-obsessed as we are.

Whether your applications are hosted in your own data center, in the cloud (e.g. Azure, AWS, Rackspace, etc.) or with one of our many hosting partners, your applications are a 5 minute sign-up away from a faster, safer and better Internet.

Arigatō (ありがとう) - the CloudFlare team

↧

How to build your own public key infrastructure

June 24, 2015, 7:57 am

≫ Next: Check out these brand new videos on how to optimize CloudFlare

≪ Previous: Osaka, Japan: CloudFlare's 35th data center

A major part of securing a network as geographically diverse as CloudFlare’s is protecting data as it travels between datacenters. Customer data and logs are important to protect but so is all the control data that our applications use to communicate with each other. For example, our application servers need to securely communicate with our new datacenter in Osaka, Japan.

CC BY-SA 2.0 image by kris krüg

Great security architecture requires a defense system with multiple layers of protection. As CloudFlare’s services have grown, the need to secure application-to-application communication has grown with it. As a result, we needed a simple and maintainable way to ensure that all communication between CloudFlare’s internal services stay protected, so we built one based on known and reliable protocols.

Our system of trust is based on a Public Key Infrastructure (PKI) using internally-hosted Certificate Authorities (CAs). In this post we will describe how we built our PKI, how we use it internally, and how to run your own with our open source software. This is a long post with lots of information, grab a coffee!

Protection at the application layer

Most reasonably complex modern web services are not made up of one monolithic application. In order to handle complex operations, an application is often split up into multiple “services” that handle different portions of the business logic or data storage. Each of these services may live on different machines or even in different datacenters.

The software stacks of large service providers are made up of many components. For CloudFlare, this includes a web application to handle user actions, a master database to maintain DNS records and user rules, a data pipeline to distribute these rules to the edge network, services for caching, a log transport pipeline, data analysis services and much, much more.

Some service-to-service communication can happen within a machine, some within a datacenter and some across a broader network like the Internet. Managing which communications should use which type of network in our evolving services is not a simple task. A single accidental configuration change could result in messages that are supposed to never leave a machine going through an untrusted connection on the Internet. The system should be designed so that these messages are secure even if they go over the wrong network.

Enter TLS

One approach to mitigate the risks posed by attackers is to encrypt and authenticate data in transit. Our approach is to require that all new services use an encrypted protocol, Transport Layer Security (TLS), to keep inter-service communication protected. It was a natural choice: TLS is the “S” in HTTPS and is the foundation of the encrypted web. Furthermore, modern web services and APIs have embraced TLS as the de-facto standard for application layer encryption. It works seamlessly with RESTful services, is supported in Kyoto Tycoon, PostgreSQL, and the Go standard library.

As we have described in previous blog posts, unauthenticated encryption can be foiled by man-in-the-middle attackers. Encryption without authentication does not protect data in transit. For connections to be safe, each party needs to prove their identity to the other. Public key cryptography provides many mechanisms for trust, including PGP’s “web of trust” and HTTPS’s public key infrastructure (PKI) model. We chose the PKI model because of ease of use and deployment. TLS with PKI provides trusted communication.

Be picky with your PKI

Trust is the bedrock of secure communication. For two parties to securely exchange information, they need to know that the other party is legitimate. PKI provides just that: a mechanism for trusting identities online.

The tools that enable this are digital certificates and public key cryptography. A certificate lets a website or service prove its identity. Practically speaking, a certificate is a file with some identity information about the owner, a public key, and a signature from a certificate authority (CA). Each certificate also contains a public key. Each public key has an associated private key, which is kept securely under the certificate owner’s control. The private key can be used to create digital signatures that can be verified by the associated public key.

A certificate typically contains

Information about the organization that the certificate is issued to
A public key
Information about the organization that issued the certificate
The rights granted by the issuer
- The validity period for the certificate
- Which hostnames the certificate is valid for
- The allowed uses (client authentication, server authentication)
A digital signature by the issuer certificate’s private key

A certificate is a powerful tool for proving your identity online. The owner of a certificate can digitally sign data, and a verifier can use the public key from the certificate to verify it. The fact that the certificate is itself digitally signed by a third party CA means that if the verifier trusts the third party, they have assurances that the certificate is legitimate. The CA can give a certificate certain rights, such as a period of time in which the identity of the certificate should be trusted.

Sometimes certificates are signed by what’s called an intermediate CA, which is itself signed by a different CA. In this case, a certificate verifier can follow the chain until they find a certificate that they trust — the root.

This chain of trust model can be very useful for the CA. It allows the root certificate’s private key to be kept offline and only used for signing intermediate certificates. Intermediate CA certificates can be shorter lived and be used to sign endpoint certificates on demand. Shorter-lived online intermediates are easier to manage and revoke if compromised.

This is the same system used for HTTPS on the web. For example, cloudflare.com has a certificate signed by Comodo’s Intermediate certificate, which is in turn signed by the Comodo offline root. Browsers trust the Comodo root, and therefore also trust the intermediate and web site certificate.

This model works for the web because browsers only trust a small set of certificate authorities, who each have stringent requirements to only issue certificates after validating the ownership of a website. For internal services that are not accessed via browsers, there is *no need * to go through a third party certificate authority. Trusted certificates do not have to be from Globalsign, Comodo, Verisign or any other CA — they can be from a CA you operate yourself.

Building your own CA

The most painful part of getting a certificate for a website is going through the process of obtaining it. For websites, we eliminated this pain by launching Universal SSL. The most painful part of running a CA is the administration. When we decided to build our internal CA, we sought to make both obtaining certificates and operating the CA painless and even fun.

The software we are using is CFSSL, CloudFlare’s open source PKI toolkit. This tool was open sourced last year and has all the capabilities needed to run a certificate authority. Although CFSSL was built for an internal CA, it’s robust enough to be use a publicly trusted CA; in fact, the Let’s Encrypt project is using CFSSL as a core part of their CA infrastructure.

Key protection

To run a CA, you need the CA certificate and corresponding private key. This private key is extremely sensitive. Any person who knows the value of the key can act as the CA and issue certificates. Browser-trusted certificate authorities are required to keep their private keys inside of specialized hardware known as Hardware Security Modules (HSMs). The requirements for protecting private keys for corporate infrastructures are not necessarily as stringent, so we provided several mechanisms to protect keys.

CFSSL supports three different modes of protection for private keys:

Hardware Security Module (HSM)
CFSSL allows the CA server to use an HSM to compute digital signatures. Most HSMs use an interface called PKCS#11 to interact with them, and CFSSL natively supports this interface. Using an HSM ensures that private keys do not live in memory and it provides tamper protection against physical adversaries.
Red October
Private Keys can be encrypted using Red October (another open source CloudFlare project). A key protected with Red October can only be decrypted with the permission of multiple key owners. In order to use CFSSL with a Red October key, the key owners need to authorize the use of the private key. This ensures that the CA key is never unencrypted on disk, in source control, or in configuration management. Note: Red October support in CFSSL is experimental and subject to change.
Plaintext
CFSSL accepts plain unencrypted private keys. This works well when the private key is generated on the machine running CFSSL or by another program. If the machine that is running the CA is highly secure, this mode is a compromise between security, cost, and usability. This is also useful in development mode, allowing users to test changes to their infrastructure designs.

Next I’ll show you how to quickly configure an internal CA using plaintext private keys. Note: The following expects CFSSL to be installed.

Generating a CA key and certificate

To start, you need some information about what metadata you want to include in your certificate. Start by creating a file called csr_ca.json containing this basic information (feel free to fill in your own organization's details):

{
  "CN": "My Awesome CA",
  "key": {
    "algo": "rsa",
    "size": 2048
  },
    "names": [
       {
         "C": "US",
         "L": "San Francisco",
         "O": "My Awesome Company",
         "OU": "CA Services",
         "ST": "California"
       }
    ]
}

With this you can create your CA with a single call"
$ cfssl gencert -initca csr_ca.json | cfssljson -bare ca

This generates the two files you need to start your CA:

ca-key.pem: your private key
ca.pem: your certificate
ca.csr: your certificate signing request (needed to get your CA cross-signed by another CA)

The key and certificate are the bare minimum you need to start running a CA.

Policy

Once the CA certificate and key are created, the CA software needs to know what kind of certificates it will issue. This is determined in the CFSSL configuration file’s signing policy section.

Here’s an example of a simple policy for a CA that can issue certificates that are valid for a year and can be used for server authentication.

config_ca.json
{
  "signing": {
    "default": {
      "auth_key": "key1",
      "expiry": "8760h",
      "usages": [
         "signing",
         "key encipherment",
         "server auth"
       ]
     }
  },
  "auth_keys": {
    "key1": {
      "key": <16 byte hex API key here>,
      "type": "standard"
    }
  }
}

We also added an authentication key to this signing policy. This authentication key should be randomly generated and kept private. The API key is a basic authentication mechanism that prevents unauthorized parties from requesting certificates. There are several other features you can use for the CA (subject name whitelisting, etc.), CFSSL documentation for more information.

To run the service, call
$ cfssl serve -ca-key ca-key.pem -ca ca.pem -config config_ca.json

This opens up a CA service listening on port 8888.

Issuing certificates

Certificate authorities do not just create certificates out of a private key and thin air, they need a public key and metadata to populate the certificate’s data fields. This information is typically communicated to a CA via a certificate signing request (CSR).

A CSR is very similar in structure to a certificate. The CSR contains:

Information about the organization that is requesting the certificate
A public key
A digital signature by the requestor’s private key

Given a CSR, a certificate authority can create a certificate. First, it verifies that the requestor has control over the associated private key. It does this by checking the CSR’s signature. Then the CA will check to see if the requesting party should be given a certificate and which domains/IPs it should be valid for. This can be done with a database lookup or through a registration authority. If everything checks out, the CA uses its private key to create and sign the certificate to send back to the requestor.

Requesting Certificates

Let’s say you have CFSSL set up as CA as described above and it’s running on a server called “ca1.mysite.com” with an authentication API key. How do you get this CA to issue a certificate? CFSSL provides two commands to help with that: gencert and sign. They are available as JSON API endpoints or command line options.

The gencert command will automatically handle the whole certificate generation process. It will create your private key, generate a CSR, send the CSR to the CA to be signed and return your signed certificate.

There are two configuration files needed for this. One to tell the local CFSSL client where the CA is and how to authenticate the request, and a CSR configuration used to populate the CSR.

Here’s an example for creating a certificate for a generic database server, db1.mysite.com.

csr_client.json

{
  "hosts": [
        "db1.mysite.com"
  ],
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "US",
      "L": "San Francisco",
      "O": “My Awesome Company",
      "OU": “Data Services",
      "ST": "California"
    }
  ]
}

config_client.json

{
  "signing": {
    "default": {
      "auth_key": "key1",
      "remote": "caserver"
    }
  },
  "auth_keys": {
    "key1": {
    "key": <16 byte hex API key here>,
    "type": "standard"
    }
  },
  "remotes": {
    "caserver": “ca1.mysite.com:8888"
  }
}

With these two configuration files set, you can create your certificate.
$ cfssl gencert -config config_client.json csr_client.json | cfssljson -bare db

This results in three files:

db-key.pem: your private key
db.pem: your certificate
db.csr: your CSR

The CSR can be resubmitted to the CA to be signed again at any point with the sign command

$ cfssl sign -config config_client.json db.csr | cfssljson -bare db-new

resulting in:
* db-new.pem: your re-signed certificate

These two commands let you easily and conveniently set up a private PKI. As a startup or a growing business moving to a service-oriented or even a microservice architecture, having a PKI can be very convenient. Next we’ll describe how CloudFlare set up its own internal PKI to help make its services encrypted by default.

Using a PKI for services

So now you have a complex set of services that can all speak TLS, and a central certificate authority server that can issue certificates. What’s next? Getting certificates and keys for the applications. There are several ways to do this including a centrally managed way and a distributed way.

Centralized distribution vs on demand

One way to create certificates and keys for your applications is to create them all on a central provisioning server and then send them out to each of the servers. In this model, a central server creates certificates and keys and sends them over a secure tunnel to the application servers.

This model creates a sort of chicken and egg problem. How do you transport private keys if you need those private keys to encrypt your transport?

A distributed key management model fits better with the way modern services are typically deployed and run. The trick is creating the private keys directly in the application server. At install or run time, a service creates its own private key and sends a request to a certificate authority to issue a certificate. This can be repeated on demand if the current certificate is close to expiring.

For example, many companies are starting to use Docker, or other lightweight container technologies, to encapsulate and run individual services. Under load, services can be scaled up by automatically running new containers. In a centralized distribution model, certificates and keys for each container need to be created before the containers are deployed.

In the centralized distribution model, the provisioning service needs to create and manage a key and certificate for each service. Keeping this sort of central database in a complex and dynamic topology seems like the wrong approach. It would be preferable if the CA itself was stateless and generated a set of logs.

The idea that keys need to be transported to their destination instead of generated locally is also troubling. Transporting private keys introduces an unnecessary risk to the system. When a new service comes into being, it should generate its key locally and request a certificate for use.

Trust across services

Internal services need to trust each other. Browsers validate website certificates by checking the signature on the certificate and checking the hostname against the list of Subject Alternative Names (SANs) in the certificate. This type of explicit check is useful, but it has a record of not working as expected. Another way for services to trust each other is an implicit check based on per-service CAs.

The idea is simple: use a different CA for each set of services. Issue all database certificates from a database CA and all API servers from an API server CA.

When setting these services up to talk to each other with mutual TLS authentication configure the trust stores as follows:

API server only trusts DB CA
DB only trusts API CA

This approach is less fine-grained than an explicit check against a SAN, but it is more robust and easier to manage on the CA policy side. With an implicit trust system in place, you can guarantee that services of type A can only communicate with services of type B.

The following diagram describes how two applications can trust each other with mutually authenticated TLS.

In this diagram, the API server trusts the DB CA (in red). It will therefore only accept certificates that are signed by the DB CA (i.e. with a red ribbon). Conversely, the database server only accepts certificates with a signature from the API CA (orange ribbon). To establish a trusted connection, each party sends a key share to the other, signed with their certificate’s private key. The key shares are combined to create a session key, with which both parties can encrypt their data. The chain of verification from key share to certificate to CA assure that the other party is authentic.

Establishing a mutually authenticated trusted tunnel between services prevents attackers from accessing or modifying data in transit and causing havoc on your services. With a strong PKI in place, every service can communicate securely and confidentially over any network, even the Internet.

Using a PKI for remote services

Internal PKIs are very flexible and can be used to issue certificates to third parties who are integrating with your network. For example, CloudFlare has a service called Railgun that can be used to optimize connections between CloudFlare and an origin server. Communication between Railgun and CloudFlare is done over an encrypted and authenticated channel using certificates from a CloudFlare certificate authority.

This ensures that data is secure in transit. When a new Railgun instance is set up on the origin server, it creates a private key and sends a CSR to CloudFlare, which then issues the appropriate certificate. The Railgun server keeps the private key in memory and deletes it when it shuts down, preventing other services from getting access to it.

This model works great for not only Railgun, but several other initiatives at CloudFlare such as the Origin CA and Keyless SSL.

Conclusion

Securing data at the application level is an important step for securing a distributed systems architecture, but is only truly effective with a strong PKI in place.

↧

Check out these brand new videos on how to optimize CloudFlare

June 24, 2015, 10:40 am

≫ Next: How to achieve low latency with 10Gbps Ethernet

≪ Previous: How to build your own public key infrastructure

alt

Someone once said that the best things in life are free and I can’t agree more. I want to draw the attention of the CloudFlare community to a great resource that helps maximize the value of our product. Troy Hunt, an experienced trainer and blogger, has produced a video course on using CloudFlare. The video series is available through Pluralsight, an online training site for developers.

Because the folks at Pluralsight think that this is a great resource, the video tutorials are being offered to everyone for a week absolutely for free.

So what can you expect to learn? The course kicks off by explaining what CloudFlare brings to the table, and then sets up a site on CloudFlare, including configuring the name server records with your DNS provider. All of this helps get things up and running quickly. Then it gets deeper.

One module of the course is devoted to understanding more about SSL and further strengthening the implementation. For example, CloudFlare’s SSL rates high on the Qualys SSL Labs Test and scores an “A” right out of the box. But you can make it better – an “A+” – just by enabling HSTS. However, you really want to understand what this means before turning it on as it can have undesired consequences as well. The course goes through this as well as explaining and configuring other aspects of the SSL implementation which further strengthens the security profile of the website behind CloudFlare.

Another module looks at managing the firewall application, an important aspect of how CloudFlare protects your website. Understanding the role of things like challenge pages and interstitial pages is important as is understanding what events cause them to be triggered and how you can customize the thresholds to suit your specific needs. Again, it’s about helping people really understand what’s going on inside CloudFlare’s free offering so that they can make the most of the service.

So that’s it – an hour and 38 minutes of everything you need to start really maximizing your CloudFlare experience and it’s free for the next week on Pluralsight. Enjoy!

P.S. CloudFlare is not affiliated nor will it profit from the course. We help spread the word because it benefits our customers.

↧

How to achieve low latency with 10Gbps Ethernet

June 30, 2015, 5:38 am

≫ Next: Setting Go variables from the outside

≪ Previous: Check out these brand new videos on how to optimize CloudFlare

Good morning!

In a recent blog post we explained how to tweak a simple UDP application to maximize throughput. This time we are going to optimize our UDP application for latency. Fighting with latency is a great excuse to discuss modern features of multiqueue NICs. Some of the techniques covered here are also discussed in the scaling.txt kernel document.

CC BY-SA 2.0 image by Xiaojun Deng

Our experiment will be setup up as follows:

We will have two physical Linux hosts: the 'client' and the 'server'. They communicate with a simple UDP echo protocol.
Client sends a small UDP frame (32 bytes of payload) and waits for the reply, measuring the round trip time (RTT). Server echoes back the packets immediately after they are received.
Both hosts have 2GHz Xeon CPU's, with two sockets of 6 cores and Hyper Threading (HT) enabled - so 24 CPUs per host.
The client has a Solarflare 10Gb NIC, the server has an Intel 82599 10Gb NIC. Both cards have fiber connected to a 10Gb switch.
We're going to measure the round trip time. Since the numbers are pretty small, there is a lot of jitter when counting the averages. Instead, it makes more sense to take a stable value - the lowest RTT from many runs done over one second.
As usual, code used here is available on GitHub: udpclient.c, udpserver.c.

Prerequisites

First, let's explicitly assign the IP addresses:

client$ ip addr add 192.168.254.1/24 dev eth2
server$ ip addr add 192.168.254.30/24 dev eth3

Make sure iptables and conntrack don't interfere with our traffic:

client$ iptables -I INPUT 1 --src 192.168.254.0/24 -j ACCEPT
client$ iptables -t raw -I PREROUTING 1 --src 192.168.254.0/24 -j NOTRACK
server$ iptables -I INPUT 1 --src 192.168.254.0/24 -j ACCEPT
server$ iptables -t raw -I PREROUTING 1 --src 192.168.254.0/24 -j NOTRACK

Finally, ensure the interrupts of multiqueue network cards are evenly distributed between CPUs. The irqbalance service is stopped and the interrupts are manually assigned. For simplicity let's pin the RX queue #0 to CPU #0, RX queue #1 to CPU #1 and so on.

client$ (let CPU=0; cd /sys/class/net/eth2/device/msi_irqs/;
         for IRQ in *; do
            echo $CPU > /proc/irq/$IRQ/smp_affinity_list
            let CPU+=1
         done)
server$ (let CPU=0; cd /sys/class/net/eth3/device/msi_irqs/;
         for IRQ in *; do
            echo $CPU > /proc/irq/$IRQ/smp_affinity_list
            let CPU+=1
         done)

These scripts assign the interrupts fired by each RX queue to a selected CPU. Finally, some network cards have Ethernet flow control enabled by default. In our experiment we won't push too many packets, so it shouldn't matter. In any case - it's unlikely the average user actually wants flow control - it can introduce unpredictable latency spikes during higher load:

client$ sudo ethtool -A eth2 autoneg off rx off tx off
server$ sudo ethtool -A eth3 autoneg off rx off tx off

Naive round trip

Here's the sketch of our client code. Nothing too fancy: send a packet and measure the time until we hear the response.

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 65400)) # pin source port to reduce nondeterminism
fd.connect(("192.168.254.30", 4321))
while True:
    t1 = time.time()
    fd.sendmsg("\x00" * 32)
    fd.readmsg()
    t2 = time.time()
    print "rtt=%.3fus" % ((t2-t1) * 1000000)

The server is similarly uncomplicated. It waits for packets and echoes them back to the source:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 4321))
while True:
    data, client_addr = fd.recvmsg()
    fd.sendmsg(data, client_addr)

Let's run them:

server$ ./udpserver
client$ ./udpclient 192.168.254.30:4321
[*] Sending to 192.168.254.30:4321, src_port=65500
pps= 16815 avg= 57.224us dev= 24.970us min=45.594us
pps= 15910 avg= 60.170us dev= 28.810us min=45.679us
pps= 15463 avg= 61.892us dev= 33.332us min=44.881us

In the naive run we're able to do around 15k round trips a second, with an average RTT of 60 microseconds (us) and minimal 44us. The standard deviation is pretty scary and suggests high jitter.

Kernel polling - SO_BUSY_POLL

Linux 3.11 added support for the SO_BUSY_POLL socket option. The idea is to ask the kernel to poll for incoming packets for a given amount of time. This, of course, increases the CPU usage on the machine, but will reduce the latency. The benefit comes from avoiding the major context switch when a packet is received. In our experiment enabling SO_BUSY_POLL brings the min latency down by 7us:

server$ sudo ./udpserver  --busy-poll=50
client$ sudo ./udpclient 192.168.254.30:4321 --busy-poll=50
pps= 19440 avg= 49.886us dev= 16.405us min=36.588us
pps= 19316 avg= 50.224us dev= 15.764us min=37.283us
pps= 19024 avg= 50.960us dev= 18.570us min=37.116us

While I'm not thrilled by these numbers, there may be some valid use cases for SO_BUSY_POLL. To my understanding, as opposed to other approaches, it works well with interrupt coalescing (rx-usecs).

Userspace busy polling

Instead of doing polling in the kernel with SO_BUSY_POLL, we can just do it in the application. Let's avoid blocking on recvmsg and run a non-blocking variant in a busy loop. The server pseudo code will look like:

while True:
    while True:
        data, client_addr = fd.recvmsg(MSG_DONTWAIT)
        if data:
            break
    fd.sendmsg(data, client_addr)

This approach is surprisingly effective:

server$ ./udpserver  --polling
client$ ./udpclient 192.168.254.30:4321 --polling
pps= 25812 avg= 37.426us dev= 11.865us min=31.837us
pps= 23877 avg= 40.399us dev= 14.665us min=31.832us
pps= 24746 avg= 39.086us dev= 14.041us min=32.540us

Not only has the min time dropped by further 4us, but also the average and deviation look healthier.

Pin processes

So far we allowed the Linux scheduler to allocate CPU for our busy polling applications. Some of the jitter came from the processes being moved around. Let's try pinning them to specific cores:

server$ taskset -c 3 ./udpserver --polling
client$ taskset -c 3 ./udpclient 192.168.254.30:4321 --polling
pps= 26824 avg= 35.879us dev= 11.450us min=30.060us
pps= 26424 avg= 36.464us dev= 12.463us min=30.090us
pps= 26604 avg= 36.149us dev= 11.321us min=30.421us

This shaved off further 1us. Unfortunately running our applications on a "bad" CPU might actually degrade the numbers. To understand why we need to revisit how the packets are being dispatched across RX queues.

RSS - Receive Side Scaling

In the previous article we mentioned that the NIC hashes packets in order to spread the load across many RX queues. This technique is called RSS - Receive Side Scaling. We can see it in action by observing the /proc/net/softnet_stat file with the softnet.sh script:

This makes sense - since we send only one flow (connection) from the udpclient, all the packets hit the same RX queue. In this case it is RX queue #1, which is bound to CPU #1. Indeed, when we run the client on that very CPU the latency goes up by around 2us:

client$ taskset -c 1 ./udpclient 192.168.254.30:4321  --polling
pps= 25517 avg= 37.615us dev= 12.551us min=31.709us
pps= 25425 avg= 37.787us dev= 12.090us min=32.119us
pps= 25279 avg= 38.041us dev= 12.565us min=32.235us

Turns out that keeping the process on the same core as arriving interrupts slightly degrades the latency. But how can the process know if it's running on the offending CPU?

One way is for the application to query the kernel. Kernel 3.19 introduced SO_INCOMING_CPU socket option. With that the process can figure out to which CPU the packet was originally delivered.

An alternative approach is to ensure the packets will be dispatched only to the CPUs we want. We can do that by adjusting NIC settings with ethtool.

Indirection Table

RSS is intended to distribute the load across RX queues. In our case, a UDP flow, the RX queue is chosen by this formula:

RX_queue = INDIR[hash(src_ip, dst_ip) % 128]

As discussed previously, the hash function for UDP flows is not configurable on Solarflare NICs. Fortunately the INDIR table is!

But what is INDIR anyway? Well, it's an indirection table that maps the least significant bits of a hash to an RX queue number. To view the indirection table run ethtool -x:

This reads as: packets with hash equal to, say, 72 will go to RX queue #6, hash of 126 will go to RX queue #5, and so on.

This table can be configured. For example, to make sure all traffic goes only to CPUs #0-#5 (the first NUMA node in our setup), we run:

client$ sudo ethtool -X eth2 weight 1 1 1 1 1 1 0 0 0 0 0
server$ sudo ethtool -X eth3 weight 1 1 1 1 1 1 0 0 0 0 0

The adjusted indirection table:

With that set up our latency numbers don't change much, but at least we are sure the packets will always hit only the CPUs on the first NUMA node. The lack of NUMA locality usually costs around 2us.

I should mention that Intel 82599 supports adjusting the RSS hash function for UDP flows, but the driver does not support indirection table manipulation. To get it running I used this patch.

Flow steering

There is also another way to ensure packets hit specific CPU: flow steering rules. Flow steering rules are used to specify exceptions on top of the indirection table. They tell the NIC to send specific flow to specific RX queue. For example, in our case, this is how we could pin our flows to RX queue #1 on both the server and the client:

client$ sudo ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 65500 action 1
Added rule with ID 12401
server$ sudo ethtool -N eth3 flow-type udp4 dst-port 4321 action 1
Added rule with ID 2045

Flow steering can be used for more magical things. For example we can ask the NIC to drop selected traffic by specifying action -1. This is very useful during DDoS packet floods and is often a feasible alternative to dropping packets on a router firewall.

Some administrators also use flow steering to make sure things like SSH or BGP continue working even during high server load. It can be achieved by manipulating the indirection table to move production traffic off, say, RX queue #0 and CPU #0. On top of that they explicitly forward packets going to destination port 22 or 179 to always hit only CPU #0 with flow steering rules. With that setup no matter how big the network load on the server, the SSH and BGP will keep on working since they hit a dedicated CPU and an RX queue.

XFS - Transmit flow steering

Going back to our experiment, let's take a look at /proc/interrupts:

We can see the 27k interrupts corresponding to received packets on RX queue #1. But what about these 27k interrupts on RX queue #7? We surely disabled this RX queue with our RSS / indirection table setup.

It turns out these interrupts are caused by packet transmissions. As weird as it sounds, Linux has no way of figuring out what is the "correct" transmit queue to send packets on. To avoid packet reordering and to play safe by default, transmissions are distributed based on another flow hash. This pretty much guarantees that the transmissions will be done on a different CPU than the one running our application, therefore increasing the latency.

To ensure the transmissions are done on a local TX queue, Linux has a mechanism called XFS. To configure it we need to set a CPU mask for every TX queue. It's a bit more complex since we have 24 CPUs and only 11 TX queues:

XPS=("0 12" "1 13" "2 14" "3 15" "4 16" "5 17" "6 18" "7 19" "8 20" "9 21" "10 22 11 23");
(let TX=0; for CPUS in "$XPS[@]"; do
     let mask=0
     for CPU in $CPUS; do let mask=$((mask | 1 << $CPU)); done
     printf %X $mask > /sys/class/net/eth2/queues/tx-$TX/xps_cpus
     let TX+=1
done)

With XPS enabled and the CPU affinity carefully managed (to keep the process on different core than RX interrupts) I was able to squeeze 28us:

pps= 27613 avg= 34.724us dev= 12.430us min=28.536us
pps= 27845 avg= 34.435us dev= 12.050us min=28.379us
pps= 27046 avg= 35.365us dev= 12.577us min=29.234us

RFS - Receive flow steering

While RSS solves the problem of balancing the load across many CPUs, it doesn't solve the problem of locality. In fact, the NIC has no understanding on which core the relevant application is waiting. This is where RFS kicks in.

RFS is a kernel technology that keeps the map of (flow_hash, CPU) for all ongoing flows. When a packet is received the kernel uses this map to quickly figure out to which CPU to direct the packet. Of course, if the application is rescheduled to another CPU the RFS will miss. If you use RFS, consider pinning the applications to specific cores.

To enable RFS on the client we run:

client$ echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
client$ for RX in `seq 0 10`; do
            echo 2048 > /sys/class/net/eth2/queues/rx-$RX/rps_flow_cnt
        done

To debug use sixth column softnet.sh script:

I don't fully understand why, but enabling software RFS increases the latency by around 2us. On the other hand RFS is said to be effective in increasing throughput.

RFS on Solarflare NICs

The software implementation of RFS is not perfect, since the kernel still needs to pass packets between CPUs. Fortunately, this can be improved with hardware support - it's called an Accelerated RFS. With network card drivers supporting ARFS, the kernel shares the (flow_hash, CPU) mapping with the NIC allowing it to correctly steer packets in hardware.

On Solarflare NICs ARFS is automatically enabled whenever RFS is enabled. This is visible in /proc/interrupts:

Here, I pinned the client process to CPU #2. As you see both receives and transmissions are visible on the same CPU. If the process was rescheduled to another CPU, the interrupts would "follow" it.

RFS works on any "connected" socket, whether TCP or UDP. In our case it will work on our udpclient, since it uses connect(), but it won't work on our udpserver since it sends messages directly on the bind() socket.

RFS on Intel 82599

Intel drivers don't support ARFS, but they have their own branding of this technology called "Flow Director". It's enabled by default, but the documentation is misleading and it won't work with ntuple routing enabled. To make it work run:

$ sudo ethtool -K eth3 ntuple off

To debug its effectiveness look for fdir_miss and fdir_match in ethtool statistics:

$ sudo ethtool -S eth3|grep fdir
     fdir_match: 7169438
     fdir_miss: 128191066

One more thing - Flow Director is implemented by the NIC driver, and is not integrated with the kernel. It has no way of truly knowing on which core the application actually lives. Instead, it takes a guess by looking at the transmitted packets every now and then. In fact it inspects every 20th (ATR setting) packet or all packets with a SYN flag set. Furthermore, it may introduce packet reordering and doesn't support UDP.

Low level tweaks

Finally, our last resort in fighting the latency is to adjust a series of low level network card settings. Interrupt coalescing is often used to reduce the number of interrupts fired by a network card, but as a result it adds some latency to the system. To disable interrupt coalescing:

client$ sudo ethtool -C eth2 rx-usecs 0
server$ sudo ethtool -C eth3 rx-usecs 0

This reduces the latency by another couple of microseconds:

pps= 31096 avg= 30.703us dev=  8.635us min=25.681us
pps= 30809 avg= 30.991us dev=  9.065us min=25.833us
pps= 30179 avg= 31.659us dev=  9.978us min=25.735us

Further tuning gets more complex. People recommend turning off advanced features like GRO or LRO, but I don't think they make much difference for our UDP application. Disabling them should improve the latency for TCP streams, but will harm the throughput.

An even more extreme option is to disable C-sleep states of the processor in BIOS.

Hardware timestamps with SO_TIMESTAMPNS

On Linux we can retrieve a packet hardware timestamp with a SO_TIMESTAMPNS socket option. Comparing that value with the wall clock allows us to measure the delay added by the kernel network stack:

client$ taskset -c 1 ./udpclient 192.168.254.30:4321 --polling --timestamp
pps=27564 avg=34.722us dev=14.836us min=26.828us packet=4.796us/1.622
pps=29385 avg=32.504us dev=10.670us min=26.897us packet=4.274us/1.415
pps=28679 avg=33.282us dev=12.249us min=26.106us packet=4.491us/1.440

It takes the kernel between 4.3 and 5us to deliver a packet to our busy-polling application.

OpenOnload for the extreme

Since we have a Solarflare network card handy, we can use the OpenOnload kernel bypass technology to skip the kernel network stack all together:

client$ onload ./udpclient 192.168.254.30:4321 --polling --timestamp
pps=48881 avg=19.187us dev=1.401us min=17.660us packet=0.470us/0.457
pps=49733 avg=18.804us dev=1.306us min=17.702us packet=0.487us/0.390
pps=49735 avg=18.788us dev=1.220us min=17.654us packet=0.491us/0.466

The latency reduction is one thing, but look at the deviation! OpenOnload reduces the time spent in packet delivery 10x, from 4.3us to 0.47us.

Final notes

In this article we discussed what network latency to expect between two Linux hosts connected with 10Gb Ethernet. Without major tweaks it is possible to get 40us round trip time, with software tweaks like busy polling or CPU affinity, this can be trimmed to 30us. Reducing by further a 5us is harder, but can be done with some ethtool toggles.

Generally speaking it takes the kernel around 8us to deliver a packet to an idling application, and only ~4us to a busy-polling process. In our configuration it took around 4us to physically pass the packet between two network cards via a switch.

Low latency settings like low rx-usecs or disabled LRO may reduce throughput and increase the number of interrupts. This means tweaking the system for low latency makes it more vulnerable to denial of service problems.

While we haven't discussed TCP specifically, we've covered ARFS and XFS, the techniques which increase data locality. Enabling them reduces cross-CPU chatter and is said to increase the throughput of TCP connections. In practice, for TCP the throughput is often way more important than latency.

For a general setup I would suggest having one TX queue for each CPU, enabling XFS and RFS/ARFS. To dispatch incoming packets effectively I recommend setting the RSS indirection table to skip CPU #0 and all the fake HT cores. To increase locality for RFS I recommend pinning applications to dedicated CPUs.

Interested in this sort of low-level, high-performance packet wrangling? CloudFlare is hiring in London, San Francisco and Singapore.

↧

Setting Go variables from the outside

July 1, 2015, 7:26 am

≫ Next: Welcome UK2 Group!

≪ Previous: How to achieve low latency with 10Gbps Ethernet

CloudFlare's DNS server, RRDNS, is written in Go and the DNS team used to generate a file called version.go in our Makefile. version.go looked something like this:

// THIS FILE IS AUTOGENERATED BY THE MAKEFILE. DO NOT EDIT.

// +build    make

package version

var (
    Version   = "2015.6.2-6-gfd7e2d1-dev"
    BuildTime = "2015-06-16-0431 UTC"
)

and was used to embed version information in RRDNS. It was built inside the Makefile using sed and git describe from a template file. It worked, but was pretty ugly.

Today we noticed that another Go team at CloudFlare, the Data team, had a much smarter way to bake version numbers into binaries using the -X linker option.

The -X Go linker option, which you can set with -ldflags, sets the value of a string variable in the Go program being linked. You use it like this: -X main.version 1.0.0.

A simple example: let's say you have this source file saved as hello.go.

package main

import "fmt"

var who = "World"

func main() {
    fmt.Printf("Hello, %s.\n", who)
}

Then you can use go run (or other build commands like go build or go install) with the -ldflags option to modify the value of the who variable:

$ go run hello.go
Hello, World.
$ go run -ldflags="-X main.who CloudFlare" hello.go
Hello, CloudFlare.

The format is importpath.name string, so it's possible to set the value of any string anywhere in the Go program, not just in main. Note that from Go 1.5 the syntax has changed to importpath.name=string. The old style is still supported but the linker will complain.

I was worried this would not work with external linking (for example when using cgo) but as we can see with -ldflags="-linkmode=external -v" the Go linker runs first anyway and takes care of our -X.

$ go build -x -ldflags="-X main.who CloudFlare -linkmode=external -v" hello.go
WORK=/var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T/go-build149644699
mkdir -p $WORK/command-line-arguments/_obj/
cd /Users/filippo/tmp/X
/usr/local/Cellar/go/1.4.2/libexec/pkg/tool/darwin_amd64/6g -o $WORK/command-line-arguments.a -trimpath $WORK -p command-line-arguments -complete -D _/Users/filippo/tmp/X -I $WORK -pack ./hello.go
cd .
/usr/local/Cellar/go/1.4.2/libexec/pkg/tool/darwin_amd64/6l -o hello -L $WORK -X main.hi hi -linkmode=external -v -extld=clang $WORK/command-line-arguments.a
# command-line-arguments
HEADER = -H1 -T0x2000 -D0x0 -R0x1000
searching for runtime.a in $WORK/runtime.a
searching for runtime.a in /usr/local/Cellar/go/1.4.2/libexec/pkg/darwin_amd64/runtime.a
 0.06 deadcode
 0.07 pclntab=284969 bytes, funcdata total 49800 bytes
 0.07 dodata
 0.08 symsize = 0
 0.08 symsize = 0
 0.08 reloc
 0.09 reloc
 0.09 asmb
 0.09 codeblk
 0.09 datblk
 0.09 dwarf
 0.09 sym
 0.09 headr
host link: clang -m64 -gdwarf-2 -Wl,-no_pie,-pagezero_size,4000000 -o hello -Qunused-arguments /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/000000.o /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/000001.o /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/go.o -g -O2 -g -O2 -lpthread
 0.17 cpu time
33619 symbols
64 sizeof adr
216 sizeof prog
23412 liveness data

Do you want to work next to Go developers that can always make you learn a new trick? We are hiring in London, San Francisco and Singapore.

↧

Welcome UK2 Group!

July 1, 2015, 10:54 am

≫ Next: Blue Light Special: Ensuring fast global configuration changes

≪ Previous: Setting Go variables from the outside

alt

Today we are thrilled to welcome UK2 Group as a CloudFlare partner. Customers of UK2 Group (including its brands UK2.net, Midphase, and Westhost) are now able to access CloudFlare’s web performance and security solutions with a single click. Backed by CloudFlare, UK2 Group’s customers can now protect their websites against security threats, ensure only clean traffic gets served, and speed up site performance no matter where visitors are located. Customers in need of advanced features, and even more performance and security, can sign up for CloudFlare Plus—a plan only offered through our reseller partners.

UK2 Group is one of the innovators of the hosting industry and operates globally. While its name points to its roots (located just down the road from the CloudFlare office in London), it also has an extensive presence in the US. We’re excited to partner with UK2 Group to provide the best web performance and security to its numerous customers.

Click here to learn more.

↧