1M HTTP rps on 1 cpu core. DPDK instead of nginx + linux kernel TCP / IP

I want to talk about such a thing as DPDK - this is a framework for working with a network around the core. Those. You can write / read in the queue of the network card directly from userland, without the need for any system calls. This saves a lot of overhead on copying and so on. As an example, I will write an application that gives a http test page and compare it to speed with nginx.

DPDK can be downloaded here . Stable do not take it - it did not work for me on EC2, take 18.05 - everything started with it. Before you start, you need to reserve in the system hugepages , for the normal operation of the framework. In theory, test applications can be run with a parameter to work without hugepages, but I always included them. * Do not forget the update-grub after grub-mkconfig * After the hugepages is complete, immediately go to ./usertools/dpdk-setup.py - this thing will collect and configure everything else. In Google, you can find instructions recommending to collect and configure something around dpdk-setup.py - do not do so. Well, you can do it, only with me, while I did not use dpdk-setup.py, it did not work. Briefly, the sequence of actions inside dpdk-setup.py:

build x86_x64 linux
load kernel module igb uio
make hugepages in / mnt / huge
bind your nic in uio (don't forget to do ifconfig ethX down before)

After that, you can build an example by running make in the directory with it. You just need to create the environment variable RTE_SDK which points to the directory with DPDK.

Here lies the complete code of the example. It consists of initialization, the implementation of a primitive version of tcp / ip and a primitive http parser. Let's start with the initialization

int main(int argc, char** argv) { int ret; //  dpdk ret = rte_eal_init(argc, argv); if (ret < 0) { rte_panic("Cannot init EAL\n"); } //  memory pool g_packet_mbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", 131071, 32, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id()); if (g_packet_mbuf_pool == NULL) { rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); } g_tcp_state_pool = rte_mempool_create("tcp_state_pool", 65535, sizeof(struct tcp_state), 0, 0, NULL, NULL, NULL, NULL, rte_socket_id(), 0); if (g_tcp_state_pool == NULL) { rte_exit(EXIT_FAILURE, "Cannot init tcp_state pool\n"); } //  hash table struct rte_hash_parameters hash_params = { .entries = 64536, .key_len = sizeof(struct tcp_key), .socket_id = rte_socket_id(), .hash_func_init_val = 0, .name = "tcp clients table" }; g_clients = rte_hash_create(&hash_params); if (g_clients == NULL) { rte_exit(EXIT_FAILURE, "No hash table created\n"); } //    1     uint8_t nb_ports = rte_eth_dev_count(); if (nb_ports == 0) { rte_exit(EXIT_FAILURE, "No Ethernet ports - bye\n"); } if (nb_ports > 1) { rte_exit(EXIT_FAILURE, "Not implemented. Too much ports\n"); } //     struct rte_eth_conf port_conf = { .rxmode = { .split_hdr_size = 0, .header_split = 0, /**< Header Split disabled */ .hw_ip_checksum = 0, /**< IP checksum offload disabled */ .hw_vlan_filter = 0, /**< VLAN filtering disabled */ .jumbo_frame = 0, /**< Jumbo Frame Support disabled */ .hw_strip_crc = 0, /**< CRC stripped by hardware */ }, .txmode = { .mq_mode = ETH_MQ_TX_NONE, }, }; //          port_conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM; port_conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM; ret = rte_eth_dev_configure(0, RX_QUEUE_COUNT, TX_QUEUE_COUNT, &port_conf); if (ret < 0) { rte_exit(EXIT_FAILURE, "Cannot configure device: err=%d\n", ret); } //     for (uint16_t j=0; j<RX_QUEUE_COUNT; ++j) { ret = rte_eth_rx_queue_setup(0, j, 1024, rte_eth_dev_socket_id(0), NULL, g_packet_mbuf_pool); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_rx_queue_setup:err=%d\n", ret); } } //    struct rte_eth_txconf txconf = { .offloads = port_conf.txmode.offloads, }; for (uint16_t j=0; j<TX_QUEUE_COUNT; ++j) { ret = rte_eth_tx_queue_setup(0, j, 1024, rte_eth_dev_socket_id(0), &txconf); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_tx_queue_setup:err=%d\n", ret); } } // NIC ret = rte_eth_dev_start(0); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d\n", ret); } //   lcore_hello(NULL); return 0; }

At that moment, when, through dpdk-setup.py, we bind the selected network interface to the dpdk driver, this network interface is no longer accessible to the kernel. After that, any packets that come to this interface will be recorded by the DMA in the queues that we have provided to it.

But the packet processing cycle.

 struct rte_mbuf* packets[MAX_PACKETS]; uint16_t rx_current_queue = 0; while (1) { //  unsigned packet_count = rte_eth_rx_burst(0, (++rx_current_queue) % RX_QUEUE_COUNT, packets, MAX_PACKETS); for (unsigned j=0; j<packet_count; ++j) { struct rte_mbuf* m = packets[j]; //   ethernet  struct ether_hdr* eth_header = rte_pktmbuf_mtod(m, struct ether_hdr*); //  IP  if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { do { //   if (rte_pktmbuf_data_len(m) < sizeof(struct ether_hdr) + sizeof(struct ipv4_hdr) + sizeof(struct tcp_hdr)) { TRACE; break; } struct ipv4_hdr* ip_header = (struct ipv4_hdr*)((char*)eth_header + sizeof(struct ether_hdr)); if ((ip_header->next_proto_id != 0x6) || (ip_header->version_ihl != 0x45)) { TRACE; break; } if (ip_header->dst_addr != MY_IP_ADDRESS) { TRACE; break; } if (rte_pktmbuf_data_len(m) < htons(ip_header->total_length) + sizeof(struct ether_hdr)) { TRACE; break; } if (htons(ip_header->total_length) < sizeof(struct ipv4_hdr) + sizeof(struct tcp_hdr)) { TRACE; break; } struct tcp_hdr* tcp_header = (struct tcp_hdr*)((char*)ip_header + sizeof(struct ipv4_hdr)); size_t tcp_header_size = (tcp_header->data_off >> 4) * 4; if (rte_pktmbuf_data_len(m) < sizeof(struct ether_hdr) + sizeof(struct ipv4_hdr) + tcp_header_size) { TRACE; break; } if (tcp_header->dst_port != 0x5000) { TRACE; break; } size_t data_size = htons(ip_header->total_length) - sizeof(struct ipv4_hdr) - tcp_header_size; void* data = (char*)tcp_header + tcp_header_size; //     hash table struct tcp_key key = { .ip = ip_header->src_addr, .port = tcp_header->src_port }; //    tcp process_tcp(m, tcp_header, &key, data, data_size); } while(0); } else if (eth_header->ether_type == 0x0608) // ARP { //     -       ARP- do { if (rte_pktmbuf_data_len(m) < sizeof(struct arp) + sizeof(struct ether_hdr)) { TRACE; break; } struct arp* arp_packet = (struct arp*)((char*)eth_header + sizeof(struct ether_hdr)); if (arp_packet->opcode != 0x100) { TRACE; break; } if (arp_packet->dst_pr_add != MY_IP_ADDRESS) { TRACE; break; } send_arp_response(arp_packet); } while(0); } else { TRACE; } rte_pktmbuf_free(m); } }

To read packets from the queue, the rte_eth_rx_burst function is used, if there is something in the queue, it will read the packets and put them into an array. If there is nothing in the queue, it will return 0, in this case, you should immediately call it again. Yes, this approach “consumes” the processor time for nothing, if at the moment there is no data in the network, but if we already took dpdk, then it is assumed that this is not our case. * Important, the function is not thread-safe , you cannot read from the same queue in different processes * After the package has been processed, you must call rte_pktmbuf_free. To send a packet, you can use the rte_eth_tx_burst function, which puts rte_mbuf obtained from rte_pktmbuf_alloc into a queue of a network card.

After the packet headers have been disassembled, you will need to build a tcp session. tcp protocol is replete with various special cases, special situations and dangers of denial of service. Implementing a more or less full-fledged tcp is an excellent exercise for an experienced developer, but nevertheless, is not included in the scope described here. In the example, tcp is implemented just enough to suffice for testing. A session table was implemented based on the hash table supplied with dpdk , setting up and breaking a tcp connection, sending and receiving data without taking into account losses and reordering of packets. Hash table from dpdk has an important limitation that you can read in several threads, but cannot write. The example is made single-threaded and this problem is not important here, and in the case of processing traffic on several cores, you can use RSS, download the hash table and do without locks.

TCP implementation

 tatic void process_tcp(struct rte_mbuf* m, struct tcp_hdr* tcp_header, struct tcp_key* key, void* data, size_t data_size) { TRACE; struct tcp_state* state; if (rte_hash_lookup_data(g_clients, key, (void**)&state) < 0) //Documentaion lies!!! { TRACE; if ((tcp_header->tcp_flags & 0x2) != 0) // SYN { TRACE; struct ether_hdr* eth_header = rte_pktmbuf_mtod(m, struct ether_hdr*); if (rte_mempool_get(g_tcp_state_pool, (void**)&state) < 0) { ERROR("tcp state alloc fail"); return; } memcpy(&state->tcp_template, &g_tcp_packet_template, sizeof(g_tcp_packet_template)); memcpy(&state->tcp_template.eth.d_addr, &eth_header->s_addr, 6); state->tcp_template.ip.dst_addr = key->ip; state->tcp_template.tcp.dst_port = key->port; state->remote_seq = htonl(tcp_header->sent_seq); #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wpointer-to-int-cast" state->my_seq_start = (uint32_t)state; // not very secure. #pragma GCC diagnostic pop state->fin_sent = 0; state->http.state = HTTP_START; state->http.request_url_size = 0; //not thread safe! only one core used if (rte_hash_add_key_data(g_clients, key, state) == 0) { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 12, &new_tcp_header); if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_start); state->my_seq_sent = state->my_seq_start+1; ++state->remote_seq; new_tcp_header->recv_ack = htonl(state->remote_seq); new_tcp_header->tcp_flags = 0x12; // mss = 1380, no window scaling uint8_t options[12] = {0x02, 0x04, 0x05, 0x64, 0x03, 0x03, 0x00, 0x01, 0x01, 0x01, 0x01, 0x01}; memcpy((uint8_t*)new_tcp_header + sizeof(struct tcp_hdr), options, 12); new_tcp_header->data_off = 0x80; send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp synack"); } } else { ERROR("can't add connection to table"); rte_mempool_put(g_tcp_state_pool, state); } } else { ERROR("lost connection"); } return; } if ((tcp_header->tcp_flags & 0x2) != 0) // SYN retransmit { //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); return process_tcp(m, tcp_header, key, data, data_size); } } if ((tcp_header->tcp_flags & 0x10) != 0) // ACK { TRACE; uint32_t ack_delta = htonl(tcp_header->recv_ack) - state->my_seq_start; uint32_t my_max_ack_delta = state->my_seq_sent - state->my_seq_start; if (ack_delta == 0) { if ((data_size == 0) && (tcp_header->tcp_flags == 0x10)) { ERROR("need to retransmit. not supported"); } } else if (ack_delta <= my_max_ack_delta) { state->my_seq_start += ack_delta; } else { ERROR("ack on unsent seq"); } } if (data_size > 0) { TRACE; uint32_t packet_seq = htonl(tcp_header->sent_seq); if (state->remote_seq == packet_seq) { feed_http(data, data_size, state); state->remote_seq += data_size; } else if (state->remote_seq-1 == packet_seq) // keepalive { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 0, &new_tcp_header); if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent); new_tcp_header->recv_ack = htonl(state->remote_seq); send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp ack keepalive"); } } else { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, state->http.last_message_size, &new_tcp_header); TRACE; if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent - state->http.last_message_size); new_tcp_header->recv_ack = htonl(state->remote_seq); memcpy((char*)new_tcp_header+sizeof(struct tcp_hdr), &state->http.last_message, state->http.last_message_size); send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp fin ack"); } //ERROR("my bad tcp stack implementation((("); } } if ((tcp_header->tcp_flags & 0x04) != 0) // RST { TRACE; //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); } } else if ((tcp_header->tcp_flags & 0x01) != 0) // FIN { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 0, &new_tcp_header); TRACE; if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent); new_tcp_header->recv_ack = htonl(state->remote_seq + 1); if (!state->fin_sent) { TRACE; new_tcp_header->tcp_flags = 0x11; // !@#$ the last ack } send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp fin ack"); } //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); } } }

The http parser will only support GET to read the URL from there and return the html with the requested URL.

HTTP parser

 static void feed_http(void* data, size_t data_size, struct tcp_state* state) { TRACE; size_t remaining_data = data_size; char* current = (char*)data; struct http_state* http = &state->http; if (http->state == HTTP_BAD_STATE) { TRACE; return; } while (remaining_data > 0) { switch(http->state) { case HTTP_START: { if (*current == 'G') { http->state = HTTP_READ_G; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_G: { if (*current == 'E') { http->state = HTTP_READ_E; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_E: { if (*current == 'T') { http->state = HTTP_READ_T; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_T: { if (*current == ' ') { http->state = HTTP_READ_SPACE; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_SPACE: { if (*current != ' ') { http->request_url[http->request_url_size] = *current; ++http->request_url_size; if (http->request_url_size > MAX_URL_SIZE) { http->state = HTTP_BAD_STATE; } } else { http->state = HTTP_READ_URL; http->request_url[http->request_url_size] = '\0'; } break; } case HTTP_READ_URL: { if (*current == '\r') { http->state = HTTP_READ_R1; } break; } case HTTP_READ_R1: { if (*current == '\n') { http->state = HTTP_READ_N1; } else if (*current == '\r') { http->state = HTTP_READ_R1; } else { http->state = HTTP_READ_URL; } break; } case HTTP_READ_N1: { if (*current == '\r') { http->state = HTTP_READ_R2; } else { http->state = HTTP_READ_URL; } break; } case HTTP_READ_R2: { if (*current == '\n') { TRACE; char content_length[32]; sprintf(content_length, "%lu", g_http_part2_size - 4 + http->request_url_size + g_http_part3_size); size_t content_length_size = strlen(content_length); size_t total_data_size = g_http_part1_size + g_http_part2_size + g_http_part3_size + http->request_url_size + content_length_size; struct tcp_hdr* tcp_header; struct rte_mbuf* packet = build_packet(state, total_data_size, &tcp_header); if (packet != NULL) { tcp_header->rx_win = TX_WINDOW_SIZE; tcp_header->sent_seq = htonl(state->my_seq_sent); tcp_header->recv_ack = htonl(state->remote_seq + data_size); #ifdef KEEPALIVE state->my_seq_sent += total_data_size; #else state->my_seq_sent += total_data_size + 1; //+1 for FIN tcp_header->tcp_flags = 0x11; state->fin_sent = 1; #endif char* new_data = (char*)tcp_header + sizeof(struct tcp_hdr); memcpy(new_data, g_http_part1, g_http_part1_size); new_data += g_http_part1_size; memcpy(new_data, content_length, content_length_size); new_data += content_length_size; memcpy(new_data, g_http_part2, g_http_part2_size); new_data += g_http_part2_size; memcpy(new_data, http->request_url, http->request_url_size); new_data += http->request_url_size; memcpy(new_data, g_http_part3, g_http_part3_size); memcpy(&http->last_message, (char*)tcp_header+sizeof(struct tcp_hdr), total_data_size); http->last_message_size = total_data_size; send_packet(tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp data"); } http->state = HTTP_START; http->request_url_size = 0; } else if (*current == '\r') { http->state = HTTP_READ_R1; } else { http->state = HTTP_READ_URL; } break; } default: { ERROR("bad http state"); return; } } if (http->state == HTTP_BAD_STATE) { return; } --remaining_data; ++current; } }

After the example is ready, you can compare the performance with nginx. Since I cannot assemble a real stand at home, I used amazon EC2. EC2 made its adjustments to testing - it was necessary to abandon Connection: close requests, since somewhere on the 300k rps SYN packages started dropping a few seconds after the start of the test. Apparently, there is some kind of protection from SYN-flood, so requests were made keep-alive. On EC2 dpdk does not work on all instances, for example, m1.medium did not work. The stand used 1 r4.8xlarge instance with the application and 2 r4.8xlarge instances to create the load. They communicate over hotel network interfaces via a private VPC subnet. I tried to load with different utilities: ab, wrk, h2load, siege. The most convenient was wrk, because ab is single threaded and produces corrupted statistics if there are errors in the network.

With a lot of traffic in EC2, you can observe a certain number of drops, for normal applications it will be imperceptible, but in the case of ab, any retransmit delays the total time and ab, as a result of which the data on the average number of requests per second turn out to be unsuitable. The reasons for the drops are a separate mystery with which you need to understand, however, the fact that there are problems not only when using dpdk, but also with nginx, suggests that it doesn’t seem like something is wrong with the example.

I conducted the test in two stages, first I started wrk on 1 instance, then on 2. If the total performance from 2 instances is 1, then it means I have not rested against the performance of wrk itself.

Test result of dpdk example on r4.8xlarge

start wrk c 1 instance

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.19ms 63.43ms 817.26ms 85.01%
Req/Sec 15.97k 4.04k 113.97k 93.47%
Latency Distribution
50% 2.58ms
75% 17.57ms
90% 134.94ms
99% 206.03ms
10278064 requests in 10.10s, 1.70GB read
Socket errors: connect 0, read 17, write 0, timeout 0
Requests/sec: 1017645.11
Transfer/sec: 172.75MB

Run wrk from 2 instances simultaneously

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 67.28ms 119.20ms 1.64s 88.90%
Req/Sec 7.99k 4.58k 132.62k 96.67%
Latency Distribution
50% 2.31ms
75% 103.45ms
90% 191.51ms
99% 563.56ms
5160076 requests in 10.10s, 0.86GB read
Socket errors: connect 0, read 2364, write 0, timeout 1
Requests/sec: 510894.92
Transfer/sec: 86.73MB

root@ip-172-30-0-225:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 74.87ms 148.64ms 1.64s 93.45%
Req/Sec 8.22k 2.59k 42.51k 81.21%
Latency Distribution
50% 2.41ms
75% 110.42ms
90% 190.66ms
99% 739.67ms
5298083 requests in 10.10s, 0.88GB read
Socket errors: connect 0, read 0, write 0, timeout 148
Requests/sec: 524543.67
Transfer/sec: 89.04MB

Nginx gave such results

start wrk c 1 instance

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.36ms 56.41ms 1.92s 95.26%
Req/Sec 15.27k 3.30k 72.23k 83.53%
Latency Distribution
50% 3.38ms
75% 6.82ms
90% 10.95ms
99% 234.99ms
9813464 requests in 10.10s, 2.12GB read
Socket errors: connect 0, read 1, write 0, timeout 3
Requests/sec: 971665.79
Transfer/sec: 214.94MB

Run wrk from 2 instances simultaneously

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 52.91ms 82.19ms 1.04s 82.93%
Req/Sec 8.05k 3.09k 55.62k 89.11%
Latency Distribution
50% 3.66ms
75% 94.87ms
90% 171.83ms
99% 354.26ms
5179253 requests in 10.10s, 1.12GB read
Socket errors: connect 0, read 134, write 0, timeout 0
Requests/sec: 512799.10
Transfer/sec: 113.43MB

root@ip-172-30-0-225:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 64.38ms 121.56ms 1.67s 90.32%
Req/Sec 7.30k 2.54k 34.94k 82.10%
Latency Distribution
50% 3.68ms
75% 103.32ms
90% 184.05ms
99% 561.31ms
4692290 requests in 10.10s, 1.01GB read
Socket errors: connect 0, read 2, write 0, timeout 21
Requests/sec: 464566.93
Transfer/sec: 102.77MB

nginx config

user www-data;
worker_processes auto;
pid /run/nginx.pid;
worker_rlimit_nofile 50000;
events {
worker_connections 10000;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;

include /etc/nginx/mime.types;
default_type text / plain;

error_log /var/log/nginx/error.log;
access_log off;

server {
listen 80 default_server backlog = 10000 reuseport;
location / {
return 200 "answer.padding _____________________________________________________________";
}
}
}

So, we see that in both examples we get about 1M requests per second, only nginx used for this all 32 cpu, and dpdk only one. Perhaps EC2 again puts a pig on it and 1M rps is a network restriction, but even if this is so, the results are not greatly distorted, because adding to the example of the delay type for (int x = 0; x <100; ++ x) http → request_url [0] = 'a' + (http-> request_url [0]% 10) before sending the packet already reduced rps, which means almost full cpu load with useful work.

In the process of experiments, one mystery was discovered, which I cannot solve for the time being. If you enable checksum offloading, that is, the calculation of checksums for the ip and tcp header of the network card itself, then the overall performance drops, and the latency improves.
Here is the launch with offloading enabled

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.91ms 614.33us 28.35ms 96.17%
Req/Sec 10.48k 1.51k 69.89k 98.78%
Latency Distribution
50% 5.91ms
75% 6.01ms
90% 6.19ms
99% 6.99ms
6738296 requests in 10.10s, 1.12GB read
Requests/sec: 667140.71
Transfer/sec: 113.25MB

But with checksum on cpu

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.19ms 63.43ms 817.26ms 85.01%
Req/Sec 15.97k 4.04k 113.97k 93.47%
Latency Distribution
50% 2.58ms
75% 17.57ms
90% 134.94ms
99% 206.03ms
10278064 requests in 10.10s, 1.70GB read
Socket errors: connect 0, read 17, write 0, timeout 0
Requests/sec: 1017645.11
Transfer/sec: 172.75MB

OK, I can explain the drop in performance by the fact that the network card slows down, although it is strange, it should speed up as it were. But why, with the calculation of checksum on the map, latency is almost constant equal to 6ms, and if you count on cpu, does it float from 2.5ms to 817ms? The task would be greatly simplified by a non-virtual stand with direct connection, but I have no such regret. DPDK itself does not work on all network cards and it must be checked with a list before using it.

And finally, a survey.

Source: https://habr.com/ru/post/416651/

All Articles

1M HTTP rps on 1 cpu core. DPDK instead of nginx + linux kernel TCP / IP

More articles: