I/O 复用

本文是《Linux 高性能服务器编程》阅读记录，供以后查阅参考。推荐阅读原书。

所有函数未标明需要包含什么头文件，可使用 man 命令自行查询。

1. select

select 系统调用用于：在一段指定的时间内，监听用户感兴趣的文件描述符上的可读、可写和异常等事件。

1.1 select API

select 函数原型：

/* Check the first NFDS descriptors each in READFDS (if not NULL) for read
   readiness, in WRITEFDS (if not NULL) for write readiness, and in EXCEPTFDS
   (if not NULL) for exceptional conditions.  If TIMEOUT is not NULL, time out
   after waiting the interval specified therein.  Returns the number of ready
   descriptors, or -1 for errors.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int select (int __nfds, fd_set *__restrict __readfds,
		   fd_set *__restrict __writefds,
		   fd_set *__restrict __exceptfds,
		   struct timeval *__restrict __timeout);

nfds：指定被监听的文件描述符的总数。通常设置为 select 监听的所有文件描述符中的最大值 + 1，因为文件描述符从 0 开始计数。
readfs：指向可读事件对应的文件描述符集合；
writefs：可写；

exceptfs：异常；select 调用返回时，系统将修改这些文件描述符来通知应用程序哪些文件描述符已就绪。 fd_set 结构体类型定义如下：

/* The fd_set member is required to be an array of longs.  */
typedef long int __fd_mask;

/* Some versions of <linux/posix_types.h> define this macros.  */
#undef	__NFDBITS
/* It's easier to assume 8-bit bytes than to get CHAR_BIT.  */
#define __NFDBITS	(8 * (int) sizeof (__fd_mask))
#define	__FD_ELT(d)	((d) / __NFDBITS)
#define	__FD_MASK(d)	((__fd_mask) (1UL << ((d) % __NFDBITS)))

/* fd_set for select and pselect.  */
typedef struct
  {
    /* XPG4.2 requires this member name.  Otherwise avoid the name
       from the global namespace.  */
#ifdef __USE_XOPEN
    __fd_mask fds_bits[__FD_SETSIZE / __NFDBITS];
# define __FDS_BITS(set) ((set)->fds_bits)
#else
    __fd_mask __fds_bits[__FD_SETSIZE / __NFDBITS];
# define __FDS_BITS(set) ((set)->__fds_bits)
#endif
  } fd_set;

即：fd_set 结构体成员仅包含一个整型数组，数组每个元素的每一位（bit）标记一个文件描述符。fd_set 能容纳的文件描述符数量由 FD_SETSIZE 指定，这限制了 select 能同时处理的文件描述符总量。

系统提供了以下宏，以便于用户读写 fd_set 中的 bit：

#include <sys/select.h>

FD_ZERO(fd_set *fdset);				// 清除 fd_set 所有位
FD_SET(int fd, fd_set *fdset);		// 设置 fd_set 位 fd
FD_CLR(int fd, fd_set *fdset);		// 清除 fd_set 位 fd
int FD_ISSET(int fd, fd_set *fdset);	// 测试 fd_set 的位 fd 是否被设置

timeout：设置 select 函数的超时时间，类型是 tomeval 结构体：

/* A time value that is accurate to the nearest
   microsecond but also has a range of years.  */
struct timeval
{
  long tv_sec;		/* Seconds.  */
  long tv_usec;	/* Microseconds.  */
};

如果传递给 timeout NULL 指针，则 select 将阻塞，直到某个文件描述符就绪。

返回值：成功时返回就绪（可读、可写和异常）文件描述符的总数。如果超时后没有任何文件描述符就绪，返回 0。失败时返回 -1 并设置 errno。如果在 select 等待期间，程序收到信号，则 select 立即返回 -1，并设置 errno 为 EINTR。

1.2 文件描述符就绪条件

socket 可读情况：
- socket 内核接收缓冲区中的字节数大于或等于其低水位标记 SO_RCVLOWAT。此时我们可以无阻塞的读该 socket 且返回的字节数大于 0。
- socket 通信的对方关闭连接。此时读取 socket 返回 0。
- 监听 socket 上有新的连接请求。
- socket 上有未处理的错误。此时我们可以使用 getsocketopt 来读取和清楚该错误。
socket 可写情况：
- socket 内核发送缓冲区中的字节数大于或等于其低水位标记 SO_SNDLOWAT。此时我们可以无阻塞的写该 socket 且返回的字节数大于 0。
- socket 的写操作被关闭。对写操作被关闭的 socket 执行写操作将触发一个 SIGPIPE 信号。
- socket 使用非阻塞 connect 连接成功或者失败（超时）之后。
- socket 上有未处理的错误。此时我们可以使用 getsocketopt 来读取和清楚该错误。
socket 异常情况：
- select 能处理的异常情况只有一种：socket 接收到带外数据。

2. poll

poll 函数原型如下：

/* Poll the file descriptors described by the NFDS structures starting at
   FDS.  If TIMEOUT is nonzero and not -1, allow TIMEOUT milliseconds for
   an event to occur; if TIMEOUT is -1, block until an event occurs.
   Returns the number of file descriptors with events, zero if timed out,
   or -1 for errors.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int poll (struct pollfd *__fds, nfds_t __nfds, int __timeout);

fds：一个 pollfd 结构体类型数组，结构体定义如下：

/* Data structure describing a polling request.  */
struct pollfd
  {
    int fd;			/* File descriptor to poll.  */
    short int events;		/* Types of events poller cares about.  */
    short int revents;		/* Types of events that actually occurred.  */
  };

其中，fd 成员指定文件描述符；events 成员告诉系统监听 fd 上的哪些事件，是一系列事件的按位或；revents 成员由内核修改，通知应用程序 fd 实际上发生了哪些事件。poll 支持的事件类型如下：

/* Event types that can be polled for.  These bits may be set in `events'
   to indicate the interesting event types; they will appear in `revents'
   to indicate the status of the file descriptor.  */
#define POLLIN		0x001		/* There is data to read.  */
#define POLLPRI		0x002		/* There is urgent data to read.  */
#define POLLOUT		0x004		/* Writing now will not block.  */

#if defined __USE_XOPEN || defined __USE_XOPEN2K8
/* These values are defined in XPG4.2.  */
# define POLLRDNORM	0x040		/* Normal data may be read.  */
# define POLLRDBAND	0x080		/* Priority data may be read.  */
# define POLLWRNORM	0x100		/* Writing now will not block.  */
# define POLLWRBAND	0x200		/* Priority data may be written.  */
#endif

#ifdef __USE_GNU
/* These are extensions for Linux.  */
# define POLLMSG	0x400
# define POLLREMOVE	0x1000
# define POLLRDHUP	0x2000
#endif

/* Event types always implicitly polled for.  These bits need not be set in
   `events', but they will appear in `revents' to indicate the status of
   the file descriptor.  */
#define POLLERR		0x008		/* Error condition.  */
#define POLLHUP		0x010		/* Hung up.  */
#define POLLNVAL	0x020		/* Invalid polling request.  */

nfds：指定被监听事件集合 fds 的大小，其类型定义如下：

1 2	`/* Type used for the number of file descriptors. */ typedef unsigned long int nfds_t;`

timeout：指定超时值，int 类型，单位毫秒。-1 表示 poll 调用将阻塞直到某个事件发生；0 调用立即返回。
返回值：含义与 select 相同。

3. epoll

3.1 内核事件表

epoll 是 Linux 特有的 I/O 复用函数。它在实现和使用上与 select、poll 有很大差异。首先，epoll 使用一组函数来完成任务，而不是单个函数。其次，epoll 把用户关心的文件描述符上的事件放在内核里的一个事件表中，从而无须像 select 和 poll 那样每次调用都要重复传入文件描述符集或事件集。但 epoll 需要使用一个额外的文件描述符，来唯一标识内核中的这个事件表。这个文件描述符使用 epoll_create 函数创建：

/* Creates an epoll instance.  Returns an fd for the new instance.
   The "size" parameter is a hint specifying the number of file
   descriptors to be associated with the new instance.  The fd
   returned by epoll_create() should be closed with close().  */
extern int epoll_create (int __size) __THROW;

函数返回的文件描述符将用于后续 epoll 相关函数的第一个参数，以指定要访问的内核事件表。

操作 epoll 内核事件表的函数 epoll_ctl：

/* Manipulate an epoll instance "epfd". Returns 0 in case of success,
   -1 in case of error ( the "errno" variable will contain the
   specific error code ) The "op" parameter is one of the EPOLL_CTL_*
   constants defined above. The "fd" parameter is the target of the
   operation. The "event" parameter describes which events the caller
   is interested in and any associated user data.  */
extern int epoll_ctl (int __epfd, int __op, int __fd,
		      struct epoll_event *__event) __THROW;

op：指定操作类型，有以下三种：

/* Valid opcodes ( "op" parameter ) to issue to epoll_ctl().  */
#define EPOLL_CTL_ADD 1	/* Add a file descriptor to the interface.  */
#define EPOLL_CTL_DEL 2	/* Remove a file descriptor from the interface.  */
#define EPOLL_CTL_MOD 3	/* Change file descriptor epoll_event structure.  */

分别表示：往事件表注册 fd 上的事件；删除 fd 上注册的事件；修改 fd 上注册的事件

fd：要操作的文件描述符

event：指定事件类型，是 epoll_event 结构体类型的指针，定义如下：

typedef union epoll_data
{
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event
{
  uint32_t events;	/* Epoll events */
  epoll_data_t data;	/* User data variable */
} __EPOLL_PACKED;

其中，events 成员表示事件类型。epoll 支持的事件类型和 poll 基本相同。表示 epoll 事件类型的宏是在 poll 对应的宏前加上 “E”，具体如下：

enum EPOLL_EVENTS
  {
    EPOLLIN = 0x001,
#define EPOLLIN EPOLLIN
    EPOLLPRI = 0x002,
#define EPOLLPRI EPOLLPRI
    EPOLLOUT = 0x004,
#define EPOLLOUT EPOLLOUT
    EPOLLRDNORM = 0x040,
#define EPOLLRDNORM EPOLLRDNORM
    EPOLLRDBAND = 0x080,
#define EPOLLRDBAND EPOLLRDBAND
    EPOLLWRNORM = 0x100,
#define EPOLLWRNORM EPOLLWRNORM
    EPOLLWRBAND = 0x200,
#define EPOLLWRBAND EPOLLWRBAND
    EPOLLMSG = 0x400,
#define EPOLLMSG EPOLLMSG
    EPOLLERR = 0x008,
#define EPOLLERR EPOLLERR
    EPOLLHUP = 0x010,
#define EPOLLHUP EPOLLHUP
    EPOLLRDHUP = 0x2000,
#define EPOLLRDHUP EPOLLRDHUP
    EPOLLEXCLUSIVE = 1u << 28,
#define EPOLLEXCLUSIVE EPOLLEXCLUSIVE
    EPOLLWAKEUP = 1u << 29,
#define EPOLLWAKEUP EPOLLWAKEUP
    EPOLLONESHOT = 1u << 30,	// epoll 特有
#define EPOLLONESHOT EPOLLONESHOT
    EPOLLET = 1u << 31	// epoll 特有
#define EPOLLET EPOLLET
  };

返回值：成功时返回 0；失败返回 -1 并设置 errno。

3.2 epoll_wait

epoll_wait 函数在一段超时时间内等待一组文件描述符上的事件，其原型如下：

/* Wait for events on an epoll instance "epfd". Returns the number of
   triggered events returned in "events" buffer. Or -1 in case of
   error with the "errno" variable set to the specific error code. The
   "events" parameter is a buffer that will contain triggered
   events. The "maxevents" is the maximum number of events to be
   returned ( usually size of "events" ). The "timeout" parameter
   specifies the maximum wait time in milliseconds (-1 == infinite).

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int epoll_wait (int __epfd, struct epoll_event *__events,
		       int __maxevents, int __timeout);

返回值：成功时返回就绪的文件描述符数量；失败时返回 -1 并设置 errno。
timeout：与 poll 函数中意义相同。
maxevents：指定最多监听多少个事件，必须大于 0。
events：epoll_wait 函数如果检测到事件，就将所有就绪的事件从内核事件表（由 epfd 参数指定）中复制到它的第二个参数 events 指向的数组中。这个数组只用于输出 epoll_wait 检测到的就绪事件，而不像 select 和 poll 的数组参数那样既用于传入用户注册的事件，又用于输出内核检测到的就绪事件。这就极大地提高了应用程序索引就绪文件描述符的效率。

poll 与 epoll 索引就绪文件描述符示例代码如下：

void poll_and_epoll() {
  // 索引 poll 返回的就绪文件描述符
  int ret = poll(fds, MAX_EVENT_NUMBER, -1);
  // 必须遍历所有已注册文件描述符并找到其中的就绪者（当然，可以利用 ret 来稍做优化）
  for (int i = 0; i < MAX_EVENT_NUMBER; ++i) {
    if (fds[i].revents & POLLIN)  // 判断第 i 个文件描述符是否就绪
    {
      int sockfd = fds[i].fd;
      // 处理 sockfd
    }
  }


  // 索引 epoll 返回的就绪文件描述符
  int ret = epoll_wait(epollfd, events, MAX_EVENT_NUMBER, -1);
  // 仅遍历就绪的 ret 个文件描述符
  for (int i = 0; i < ret; i++) {
    int sockfd = events[i].data.fd;
    // sockfd 肯定就绪，直接处理
  }
}

从中可以看出 epoll 对比 poll 的性能提升（需要遍历的文件描述符数变少了）。

3.3 LT 和 ET 模式

epoll 对文件描述符的操作有两种模式：LT（Level Trigger，电平触发）模式和ET（Edge Trigger，边沿触发）模式。LT 模式是默认的工作模式，这种模式下 epoll 相当于一个效率较高的 poll。当往 epoll 内核事件表中注册一个文件描述符上的 EPOLLET 事件时，epoll 将以 ET 模式来操作该文件描述符。ET 模式是 epoll 的高效工作模式。

对于采用 LT 工作模式的文件描述符，当 epoll_wait 检测到其上有事件发生并将此事件通知应用程序后，应用程序可以不立即处理该事件。这样，当应用程序下一次调用 epoll_wait 时，epoll_wait 还会再次向应用程序通告此事件，直到该事件被处理。而对于采用 ET 工作模式的文件描述符，当 epoll_wait 检测到其上有事件发生并将此事件通知应用程序后，应用程序必须立即处理该事件，因为后续的 epoll_wait 调用将不再向应用程序通知这一事件。ET 模式在很大程度上降低了同一个 epoll 事件被重复触发的次数，因此效率要比 LT 模式高。

注意：每个使用 ET 模式的文件描述符都应该是非阻塞的。如果文件描述符是阻塞的，那么读或写操作将会因为没有后续的事件而一直处于阻塞状态（饥渴状态）。

3.4 EPOLLONESHOT 事件

为了防止多个线程并发读取 socket，可以使用 epoll 的 EPOLLONESHOT 事件。

对于注册了 EPOLLONESHOT 事件的文件描述符，操作系统最多触发其上注册的一个可读、可写或者异常事件，且只触发一次，除非我们使用 epoll_ctl 函数重置该文件描述符上注册的 EPOLLONESHOT 事件。这样，当一个线程在处理某个 socket 时，其他线程是不可能有机会操作该 socket 的。但反过来思考，注册了 EPOLLONESHOT 事件的 socket 一旦被某个线程处理完毕，该线程就应该立即重置这个socket 上的 EPOLLONESHOT 事件，以确保这个 socket 下一次可读时，其 EPOLLIN 事件能被触发，进而让其他工作线程有机会继续处理这个 socket。

4. 三组 I/O 多路复用函数的比较

Linux > 网络编程

#select #poll #epoll

I/O 复用

https://arcsin2.cloud/2024/05/04/I-O-复用/

作者

arcsin2

发布于

2024年5月4日

许可协议

信号与定时器上一篇

高级 I/O 函数下一篇