NGINX in Linux Systems, Performance and Internal Mechanisms

Abolfazl Abbasi
Teknasyon Engineering

--

Introduction

For some time, I had been searching for a project idea that would allow me to delve deeply into Linux Networking and C++ concurrency programming. After evaluating several popular project concepts, I chose to develop a load-balancing web server. This application is designed to efficiently manage incoming requests, directing traffic to the appropriate origin backend servers. It then collects the responses from these servers and delivers them back to the clients. This system optimizes resource utilization by distributing the workload across multiple servers and ensures high availability and reliability of the service by managing traffic flows intelligently.

During my research, I discovered NGINX, renowned for its load-balancing capabilities. Intrigued to understand its workings, I cloned the source code and embarked on compiling it on my machine. After facing some challenges with OpenSSL, I eventually succeeded in building it.

My primary objective was to analyze NGINX’s behavior as it manages traffic routing from clients to the origin backend servers when it is configured as a Loadbalancer server. This post outlines the findings of my exploration, which began last month.

NGINX traffic diagram by DALL.E

Performance and Speed: 🚀

NGINX is renowned as one of the fastest web servers, capable of serving as a high-performance load balancer. But what exactly do we mean by “high performance”? This question opens up a broad discussion on the essence of performance in web servers.

When talking about performance, my focus is on how a web server application utilizes system resources to accept requests and return responses in the shortest possible time. It’s important to clarify that my discussion centers exclusively on the performance of the web server itself, not the backend services it interacts with. This distinction is crucial because the backend’s performance, although significant, falls outside the scope of this post and my current expertise. currently, I am going to focus on load-balancing performance by NGINX.

Traffic routing:

NGINX facilitates traffic routing at two distinct layers. Initially, it can direct traffic to the upstream server (origin backend) within Layer 4 of the OSI network model, or it can perform routing at Layer 7.

Upon receiving a new request, NGINX functions as a Reverse Proxy. It accepts the connection and employs a specific algorithm, defined within the configuration file, to identify an upstream server. To expedite the process and reduce waiting times associated with establishing new connections, NGINX may reuse existing connections to forward the traffic to the upstream server.

As I mentioned this process occurs in two primary ways:

  1. At Layer 4, NGINX can forward traffic to the upstream server utilizing the client’s IP address or pre-specified data.
  2. At Layer 7, NGINX analyses the request’s body following the HTTP protocol to determine the appropriate upstream server for that particular request. This decision may be based on user sessions, specific header data, or the content of the request body itself, enabling a more nuanced selection process for routing traffic.

The first way is faster because we don't need to process raw data to extract additional data and then decide on selecting an upstream backend server.

If I want to say just one sentence about why NGINX is fast I can say:
Nginx uses a non-blocking, event-driven, and asynchronous architecture.

Blocking I/O vs Non-Blocking I/O:

Blocking I/O: In blocking I/O operations, when a process attempts to perform an I/O operation (like reading from or writing to a file), it must go through several steps before the operation can be completed. These operations are called “blocking” because the process must wait at certain points until specific conditions are met or operations are completed. Let’s break down these steps:

  1. System Call Initiation: The process begins with an open() system call to start the I/O operation. This step is generally straightforward and does not cause blocking.
  2. Kernel Mode Switch: For the request to be processed, the CPU switches from user mode to kernel mode. This switch is necessary because operations like opening a file require accessing hardware resources, which only the kernel has the privilege to do. The mode switch is a common step for all system calls and does not typically result in blocking.
  3. Address Validation: The kernel checks the process’s permissions and validates if it has the required access to the file. While permission checking does not inherently block, if the file is located on a mechanical hard disk or if the data bus is busy, this step has the potential to block the process.
  4. Obtaining a File Descriptor: If the process has the necessary permissions and no other process has locked the file, the kernel proceeds to create a lock for the process, allocate a file descriptor for the file, and return this descriptor to the process. This step can lead to blocking if:
    . The kernel has opened many files and needs to allocate more space for new ones.
    . Another process has locked the file, requiring the current process to wait until the file is unlocked.

Any task that needs I/O will likely block the process. usually, when the system isn’t busy, everything works smoothly. But, when the system gets busy, like roads during rush hour, even simple things like opening a file can make us wait and block the process, which isn’t good for the system. We’re okay with doing tasks and using resources for them, but waiting for things like a free path for data or a lock to open is a waste and slows the system down. For example, if I need to read a very long file, I must read it line by line, which is a task I have to do. But, I can use a special way called Non-blocking I/O to avoid making the process wait to open this file.

The Linux Programming Interface book

In the provided figure, we see that the kernel manages communication between an application and the network through the creation of two buffers associated with each socket endpoint. When a process initiates a write operation to a socket, it communicates this through the socket’s file descriptor. The kernel then facilitates this operation by transferring the data to a send buffer, but during this time, the process is blocked; it cannot proceed until the kernel has successfully copied the data. This also applies to the receive buffer for incoming data.

To illustrate, let’s consider a process that writes 1024 bytes to a socket. As the kernel is copying this data to the send buffer and then onto the network, the process attempts to write an additional 512 bytes. This subsequent write operation must wait. it is blocked until the kernel has processed the initial 1024 bytes and has allocated space for the next 512 bytes of data. This mechanism ensures that data is handled in an orderly fashion, preserving the integrity and sequence of the information being transmitted.

NGINX leverages the "sendfile" system call for a faster and more efficient way to deliver files directly to the network socket from the file descriptor, avoiding the slower process of copying data through user space. This technique shines particularly when serving large files, as it significantly lowers CPU usage and disk I/O operations, making your web server more efficient.

Non-blocking I/O: The concept is straightforward but implementation can be complex. When a process requests an I/O operation, if the kernel is unable to perform the task immediately, it returns an error to the caller. Instead of waiting for the I/O resource to become available, the calling process continues with other tasks, avoiding waiting for resources. To complete the previously skipped I/O task, the process must either repeatedly check if the I/O can now be performed (polling) or set up a notification mechanism to be alerted when the I/O resource is available (event-driven). This allows the process to use its time and system resources efficiently, moving on instead of waiting idly.

Assume our app has just one thread and it has set the socket to non-blocking mode while attempting to connect to a server. If the connection cannot be established immediately and takes time, the kernel signals this by returning an error and setting errno to the EINPROGRESS. This means the connection attempt is underway, but our thread does not wait and get blocked; instead, it continues to work and will start sending data to the server once the connection is established.

Traditionally implementing this would involve using a busy loop, which is an infinite loop that repeatedly tries to write to the socket(a while true loop). However, a busy loop consumes system resources inefficiently. What’s the alternative? An event-driven architecture.

Note: In the Linux model, there is no difference between a thread and a process in terms of functionality. The only advantage of a thread is its ability to access the memory of other threads, or shared memory. Therefore, whenever I mention ‘thread’ in this series of posts, it can also refer to a process, because a process is considered a thread with its ID equivalent to the PID.

Event-Driven architecture:

NGINX diverges from the traditional model of handling concurrent connections, which relies on using separate processes or threads for each connection. This older approach, exemplified by web servers like Apache, is not only resource-intensive, requiring the allocation of new memory and CPU cycles for each connection, but it also suffers from inefficiencies related to blocking on network or input/output operations, excessive context switching, and thread thrashing. in this architecture, we can say a thread spends most of its time in the block state.

Thread per connection model

Threads can be resource-intensive for operating systems. When a process requests to create a new thread, the OS must allocate memory for the thread’s stack and heap. Additionally, the OS must maintain context information for the thread to manage context switching between multiple threads. This overhead can be significant, especially when there are many threads, leading to increased resource consumption and potentially decreased performance due to the overhead associated with managing these threads. In computing, context switching is the process where the CPU switches from executing one process or thread to another. While crucial for multitasking, context switching is not without its costs.

What Makes Context Switching Expensive?

  1. State Management: When switching contexts, the CPU needs to save the current state of a process (like where it is in its execution) before it can load and run another. This saving and restoring are resource-intensive, taking up valuable CPU time.
  2. Cache Efficiency: CPUs use a cache to quickly access data. A context switch can lead to a “cold cache” for the new process, meaning the data it needs isn’t readily available in the cache and must be fetched from slower memory, leading to a performance hit.
  3. Pipeline Processing: Modern CPUs process instructions in stages. A context switch may clear these stages, forcing the CPU to start over with the new process, which can temporarily slow down execution.
  4. Overhead from the Operating System: The OS itself must do some work to manage context switches, deciding which process runs next, which adds its computational overhead.

NGINX Process Architecture:

Instead of Thread per Connection I mentioned earlier, NGINX employs a Master-Worker architecture for its process management. HGNIX master opens a Worker process per machine core size, The Worker process handles many thousands of requests on just one thread. There are two primary types of processes in this architecture:

  1. Master Process: This is the central controller of the NGINX architecture. It is responsible for initializing the server, loading configuration files, handling various non-worker tasks such as re-reading configuration files, managing software updates without service interruption, and starting or stopping worker processes.
  2. Worker Processes: These processes handle the bulk of the workload in NGINX. Each worker process is capable of accepting new connections, processing requests, and performing the necessary decision-making for selecting the appropriate upstream server for each request. They also route the requests accordingly. Worker processes operate asynchronously and are designed to handle thousands of connections in a non-blocking manner. It’s worth noting that the worker processes don’t share or compete for tasks; they are each allocated their own set of connections to manage by the master process.
Diagram of nginx’s architecture

When It’s important to know that all of the tasks depend on doing work with incoming traffic are performed by the worker processes. The master process’s role is primarily to manage these worker processes, handling configuration loading and reloading, starting and stopping workers, and other administrative tasks, rather than directly handling client connections or making load-balancing decisions.

Each process consumes additional memory, and each switch between them consumes CPU cycles (source)

NGINX’s worker processes are designed to be highly efficient and capable of handling many connections simultaneously, thanks to its event-driven and asynchronous architecture. This allows a relatively small number of worker processes to serve a large number of concurrent connections. each worker nginx can handle many thousands of concurrent connections and requests per second. they are using epoll system call in Linux systems. The use of epoll with non-blocking I/O is highly efficient for scalable network programming, as it allows a single thread to manage many connections without blocking any single I/O operation. This model minimizes context switching and CPU idle times, leading to better throughput and scalability.

Conclusion:

NGINX’s speed and efficiency as a web server and load balancer can be attributed to its smart architecture and the use of advanced system calls like epoll. Its event-driven, asynchronous, and non-blocking I/O model allows for handling thousands of concurrent connections per worker process. Unlike traditional thread-per-connection models that consume significant system resources and suffer from context-switching overhead, NGINX minimizes these costs. The Master-Worker setup efficiently divides tasks, where the master process handles configuration and process management while worker processes deal directly with network traffic, maintaining high performance without being bogged down by overhead. The careful use of resources and intelligent management of connections means that NGINX can offer high throughput, low latency, and impressive scalability, solidifying its reputation as a fast and reliable solution in the web server space.

Keep in mind, however, that every architecture has its advantages and drawbacks. Interestingly, Cloudflare recently published a blog post detailing how they developed their proxy server, which, according to their specific use cases, performs even better than NGINX. This highlights the importance of thoroughly understanding your own operational requirements and project goals before settling on a development approach. It serves as a reminder that while NGINX offers excellent speed, efficiency, and scalability, the optimal solution varies depending on the unique challenges and objectives of each project.

As a software engineer exploring the complexities of large-scale software, I’m on a quest to understand how these sophisticated systems work. This journey is a new challenge for me, so if you notice any mistakes or if you have any ideas on how to make things better, please feel free to leave a comment or email me. I’m open to discussions and eager to delve deeper into the world of large software projects.

--

--