php · bjaspan · Apr 30, 2014 · May 6, 2014 · Sep 5, 2014 · Oct 20, 2014
diff --git a/sapi/fpm/DESIGN.md b/sapi/fpm/DESIGN.md
@@ -0,0 +1,186 @@
+# Introduction
+
+This document describes the design of PHP-FPM. It is woefully incomplete, but
+at least we are now starting.
+
+## Request stages
+
+FPM defines a set of request stages for tracking a child process' current
+status:
+
+* FPM_REQUEST_ACCEPTING: About to call or blocked in accept() on the pool's
+  listening socket. The parent considers a child in this stage to be idle.
+* FPM_REQUEST_ACCEPTED: Just returned from accept() but not yet in
+  READING_HEADERS. This stage exists so that ondemand workers can enter it
+  before signaling the parent that they returned from accept(). Without it, a
+  race condition exists in which the parent could decide the child was idle
+  just *before* it began processing a request, and thus not start listening on
+  the pool's socket again.
+* FPM_REQUEST_READING_HEADERS: 
+* FPM_REQUEST_INFO: 
+* FPM_REQUEST_EXECUTING: 
+* FPM_REQUEST_END: 
+* FPM_REQUEST_FINISHED:
+
+
+## Scoreboard
+
+### Allocation
+
+FPM maintains a scoreboard data structure that tracks information about the
+current state and accumulated activities of workers. Each fpm_worker_pool_s
+contains one fpm_scoreboard_s structure, and wp->scoreboard->procs is an array
+of wp->config->mp_max_children pointers to fpm_scoreboard_proc_s
+structures. All of these structures are stored in shared memory allocated with
+mmap(MAP_ANONYMOUS|MAP_SHARED).
+
+The scoreboard proc slots are allocated as needed to children by
+fpm_scoreboard_proc_alloc() and _free(), which are non-locked functions that
+can only be called by the parent. Each scoreboard proc has a boolean used
+flag. The scoreboard maintains an index, free_proc, which indicates that a proc
+that might be unused. A scoreboard proc is allocated to a specific child
+process by fpm_scoreboard_proc_alloc(). This function first checks whether
+scoreboard->procs[scoreboard->free_proc] is unused, and if it is scans through
+the array looking for one that is unused. Whichever way it finds one, it marks
+that proc used and provides the now-allocated index to its caller. The
+free_proc hint is then incremented one (circularly) to suggest a new free
+proc. fpm_scoreboard_proc_free() releases a proc slot by memset()ing it to
+zero, and setting scoreboard->free_proc to the just-released index.
+
+Just before the parent forks a new child, it allocated a scoreboard proc for
+the impending child in fpm_resources_prepare(). Once the child is forked,
+fpm_child_resources_use() first frees the scoreboard for all other worker pools
+(to prevent accidentally updating them, presumably), and calls
+fpm_scoreboard_child_use() which sets the global fpm_scoreboard pointer to the
+worker's scoreboard and fpm_scoreboard_i to the child's allocated index in the
+worker's scoreboard->procs array. Finally, the child's scoreboard proc is
+updated with the child's pid and start time.
+
+A scoreboard proc is atomically locked and unlocked with
+fpm_scoreboard_proc_acquire() and _release(). The parent specifies the worker
+pool scoreboard and proc index to lock. A child specifies sentinel values that
+instruct the functions to use the global fpm_scoreboard and fpm_scoreboard_i
+variables.
+
+### Tracking request stages
+
+Each scoreboard proc structure has a field, enum fpm_request_stage_e
+request_stage, to track the current stage. Functions such as
+fpm_request_accepting() are called from a child to lock the child's scoreboard
+proc, update the request stage, set whatever other proc values are appropriate
+for the stage, and release the proc.
+
+The parent can inspect a child's current stage via functions like
+fpm_request_is_idle(), which accepts a specific fpm_child_s structure,
+retrieves its scoreboard proc, and checks the proc's request_stage.
+
+Note that fpm_request_is_idle() explicitly DOES NOT lock the child's proc
+structure before reading the request_stage. In Java, this would mean that the
+parent had no guarantee of reading the current value. In C, with no specific
+memory model, ... all bets are off, I suspect.
+
+## Process management styles
+
+There are three supported process management styles, but they all share a
+common flow. In main(), the parent calls fpm_run(). fpm_run() forks initial
+worker processes for each pool, if any, with fpm_children_create_initial(), and
+calls fpm_event_loop() which runs forever.
+
+The forked children return from fpm_run(), inheriting their listening socket
+from the parent. They do a little more initialization, and enter their main loop
+calling fcgi_accept_request(). Each call to that function calls accept() on
+the socket and then reads and processes a single FastCGI request.
+
+### Static
+
+### Dynamic
+
+### Ondemand
+
+No initial worker processes are started. Instead, fpm_children_create_initial()
+installs an FPM_EV_READ event on the pool's listening socket. When a connection
+arrives, the event calls fpm_pctl_on_socket_accept(). If there is an idle child
+available to run the request, the function does nothing; an idle child is
+already waiting in accept() and will respond. Otherwise, if the current system
+state allows it, the function forks a new child; if not, the request function
+does nothing, and the request stays queued until the next child is
+available. Either way, the parent trusts that a newly launched or existing
+child will eventually handle the request.
+
+The strategy for deciding when to fork is subtle. The quick summary:
+
+* fpm_pctl_on_socket_accept() always removes the pool's listening socket,
+* a child always restores its pool's listening socket after calling accept(),
+* a child's first call to accept() is non-blocking,
+* a pool's listening socket is always restore when a child exits, and
+* a timer restores listening sockets removed due to external system state.
+
+In detail:
+
+Suppose FPM uses a normal level-triggered select()/poll() on the listening
+socket. When a connection arrives, fpm_pctl_on_socket_accept() is triggered by
+the event system. It finds no idle children, so it forks one, and returns. If
+the parent polls again before the child makes it to accept(), the connection is
+still pending so fpm_pctl_on_socket_accept() is called again. The child is
+idle, so the function returns, but until the child calls accept(), the parent
+will spin in this loop. Also, there are a race conditions in which loop occurs
+before the child is marked idle (which happens immediately prior to its calling
+accept()), in which case the parent will rapidly fork extra processes.
+
+To avoid this, the initial version of ondemand used edge-triggering on the
+listening socket (and thus only worked on systems that support epoll() or
+kqueue()). This prevents the spin loop because each pending connection is only
+returned once, and thus only triggers one call to fpm_pctl_on_socket_accept()
+no matter how long the child takes to accept() it. Unfortunately, it causes
+different problems. Edge triggering semantics requires that the polling process
+read the edge-triggered file until it is empty; in this case, that requires
+calling accept() until it returns EAGAIN/EWOULDBLOCK. However, with ondemand,
+the parent process polls the listening socket, while the child processes call
+accept(), so there is no way for the parent to comply with the edge-triggering
+contract. As a result, multiple requests can arrive, generate only a single
+edge-triggered event, and launch insufficient processes to handle the requests.
+
+The solution is to use level-triggered polling in a way that avoids spinning
+the CPU. The trick is for the parent to remove a pool's listening socket from
+its polled set during necessary periods, and to restore it when appropriate. To
+facilitate restoring the listening socket, during
+fpm_children_create_initial(), the parent creates a per-pool child_accept pipe
+and registers an event to call fpm_pctl_on_child_accept() when the pipe is
+readable. When invoked, this function reads sizeof(pid_t) from one of the
+children and adds the listening socket event back to the polling set.
+
+For removing the listening socket event, there are three cases to
+consider. Each time a request arrives and fpm_pctl_on_socket_accept() is
+invoked:
+
+* An idle child is available. The parent removes the listening socket. When the
+idle child returns from accept(), it writes to the child_accept pipe, restoring
+the listening socket.
+
+* No idle child is available, and the parent forks a new child. The parent
+removes the listening socket. The child starts and, after calling accept(), it
+writes to the child_accept pipe. For this to work, the first call to accept()
+must be non-blocking---a sibling process may grab the request first, and the
+new child must inform the parent it called accept() so new requests can be
+processed whether accept() suceeded or failed.
+
+* No idle child is available, but the parent cannot fork a new child due to
+system state (such as already having pm_max_procs children). If there are zero
+children and the parent cannot fork one, the parent exits with an error
+status. Otherwise, the parent again removes the listening socket, because there
+is no point in acting on a pending request when there is no child process to
+run it (since level triggering is in place, any pending requests will again
+generate a poll event when the socket is restored). The listening socket event
+is restored when:
+
+	* A child exits for any reason, since now a new one can possiby be
+      started. The SIGCHLD triggers a call to fpm_children_bury(), which looks
+      up the child's fpm_child_s record, and writes to the child's child_accept
+      pipe regardless of why it exited.
+
+	* A child becomes idle, since it may empty the pending queue. An idle child
+      immediately calls accept(), by definition. If there is still a pending
+      request, accept() returns, and as per above, the child writes to its
+      child_accept pipe. If there is no pending request, it is okay that the
+      parent continues ignoring listening socket; the next request to arrive
+      will go to an idle child which will write to its child_accept pipe.
diff --git a/sapi/fpm/fpm/fastcgi.c b/sapi/fpm/fpm/fastcgi.c
@@ -472,7 +472,7 @@ static int fcgi_get_params(fcgi_request *req, unsigned char *p, unsigned char *e
 			break;
 		}
 		if (eff_name_len >= buf_size-1) {
-			if (eff_name_len > ((uint)-1)-64) { 
+			if (eff_name_len > ((uint)-1)-64) {
 				ret = 0;
 				break;
 			}
@@ -803,7 +803,35 @@ static int fcgi_is_allowed() {
 	return 0;
 }
 
-int fcgi_accept_request(fcgi_request *req)
+/**
+ * Set a file descriptor to be blocking or non-blocking.
+ *
+ * @param fd: the file descriptor
+ * @param nonblocking: zero to set blocking, non-zero to set nonblocking
+ * @returns 0 on success, <0 on error
+ */
+int fcgi_set_nonblocking(int fd, int nonblocking)
+{
+  int flags = fcntl(fd, F_GETFL, 0);
+  if (flags < 0) {
+    return flags;
+  }
+  flags = nonblocking ? (flags | O_NONBLOCK) : (flags & ~O_NONBLOCK);
+  return fcntl(fd, F_SETFL, flags);
+}
+
+/**
+* Accept a connect and parse a FastCGI request.
+*
+* @param req: the request structure to fill in
+* @param nonblock_once: if non-zero, the listening socket is set nonblocking
+*   for one call to accept(); if accept() fails with EAGAIN/EWOULDBLOCK, it is
+*   called a second time in blocking mode.
+* @param accept_pipe: if non-negative, write the pid to this fd every time
+*   accept() returns, regardless of its return value.
+* @returns the accepted fd, or -1 on error.
+*/
+int fcgi_accept_request(fcgi_request *req, int nonblock_once, int accept_pipe)
 {
 #ifdef _WIN32
 	HANDLE pipe;
@@ -843,22 +871,47 @@ int fcgi_accept_request(fcgi_request *req)
 				{
 					int listen_socket = req->listen_socket;
 #endif
-					sa_t sa;
-					socklen_t len = sizeof(sa);
-
-					fpm_request_accepting();
-
-					FCGI_LOCK(req->listen_socket);
-					req->fd = accept(listen_socket, (struct sockaddr *)&sa, &len);
-					FCGI_UNLOCK(req->listen_socket);
-
-					client_sa = sa;
-					if (req->fd >= 0 && !fcgi_is_allowed()) {
-						closesocket(req->fd);
-						req->fd = -1;
-						continue;
-					}
-				}
+          sa_t sa;
+          socklen_t len = sizeof(sa);
+          int accept_errno;
+          do {
+            fpm_request_accepting();
+
+            FCGI_LOCK(req->listen_socket);
+            if (nonblock_once) {
+              fcgi_set_nonblocking(req->listen_socket, 1);
+            }
+            req->fd = accept(listen_socket, (struct sockaddr *)&sa, &len);
+            accept_errno = errno;
+            if (nonblock_once) {
+              fcgi_set_nonblocking(req->listen_socket, 0);
+              nonblock_once = 0;
+            }
+            // Mark ourselves as no longer ACCEPTING before telling
+            // the parent to restore our listening
+            // socket. Otherwise, the parent can decide we are
+            // still idle and about to grab a new request when we
+            // are not.
+            fpm_request_accepted();
+            if (accept_pipe >= 0) {
+              pid_t pid = getpid();
+              if (write(accept_pipe, &pid, sizeof(pid)) != sizeof(pid)) {
+                zlog(ZLOG_ERROR, "Failed writing to accept_pipe; something bad is going to happen.");
+              }
+            }
+            if (req->fd < 0 && (accept_errno == EAGAIN || accept_errno == EWOULDBLOCK)) {
+              zlog(ZLOG_DEBUG, "Child pid %d lost the accept() race, retrying", getpid());
+            }
+            FCGI_UNLOCK(req->listen_socket);
+          } while (req->fd < 0 && (accept_errno == EAGAIN || accept_errno == EWOULDBLOCK));
+
+          client_sa = sa;
+          if (req->fd >= 0 && !fcgi_is_allowed()) {
+            closesocket(req->fd);
+            req->fd = -1;
+            continue;
+          }
+        }
 
 #ifdef _WIN32
 				if (req->fd < 0 && (in_shutdown || errno != EINTR)) {
@@ -1041,8 +1094,8 @@ ssize_t fcgi_write(fcgi_request *req, fcgi_request_type type, const char *str, i
 				return -1;
 			}
 			pos += 0xfff8;
-		}		
-		
+		}
+
 		pad = (((len - pos) + 7) & ~7) - (len - pos);
 		rest = pad ? 8 - pad : 0;
 

diff --git a/sapi/fpm/fpm/fastcgi.h b/sapi/fpm/fpm/fastcgi.h
@@ -115,7 +115,7 @@ typedef struct _fcgi_request {
 int fcgi_init(void);
 void fcgi_shutdown(void);
 void fcgi_init_request(fcgi_request *req, int listen_socket);
-int fcgi_accept_request(fcgi_request *req);
+int fcgi_accept_request(fcgi_request *req, int nonblock_once, int accept_pipe);
 int fcgi_finish_request(fcgi_request *req, int force_close);
 
 void fcgi_set_in_shutdown(int);

diff --git a/sapi/fpm/fpm/fpm_children.c b/sapi/fpm/fpm/fpm_children.c
@@ -146,8 +146,9 @@ static struct fpm_child_s *fpm_child_find(pid_t pid) /* {{{ */
 static void fpm_child_init(struct fpm_worker_pool_s *wp) /* {{{ */
 {
 	fpm_globals.max_requests = wp->config->pm_max_requests;
-
-	if (0 > fpm_stdio_init_child(wp)  ||
+
+	if (0 > fpm_pctl_init_child(wp)   ||
+	    0 > fpm_stdio_init_child(wp)  ||
 	    0 > fpm_log_init_child(wp)    ||
 	    0 > fpm_status_init_child(wp) ||
 	    0 > fpm_unix_init_child(wp)   ||
@@ -187,6 +188,13 @@ void fpm_children_bury() /* {{{ */
 
 		child = fpm_child_find(pid);
 
+		if (child->wp->config->pm == PM_STYLE_ONDEMAND) {
+		  pid_t pid = getpid();
+		  if (write(child->wp->child_accept_pipe[1], &pid, sizeof(pid)) != sizeof(pid)) {
+		    zlog(ZLOG_ERROR, "Failed writing to accept_pipe; something bad is going to happen.");
+		  }
+		}
+
 		if (WIFEXITED(status)) {
 
 			snprintf(buf, sizeof(buf), "with code %d", WEXITSTATUS(status));
@@ -444,10 +452,26 @@ int fpm_children_create_initial(struct fpm_worker_pool_s *wp) /* {{{ */
 		}
 
 		memset(wp->ondemand_event, 0, sizeof(struct fpm_event_s));
-		fpm_event_set(wp->ondemand_event, wp->listening_socket, FPM_EV_READ | FPM_EV_EDGE, fpm_pctl_on_socket_accept, wp);
+		fpm_event_set(wp->ondemand_event, wp->listening_socket, FPM_EV_READ, fpm_pctl_on_socket_accept, wp);
 		wp->socket_event_set = 1;
 		fpm_event_add(wp->ondemand_event, 0);
 
+		if (pipe(wp->child_accept_pipe) < 0) {
+		  zlog(ZLOG_ERROR, "[pool %s] unable to create child_accept_pipe", wp->config->name);
+		  // FIXME handle crash
+		  return 1;
+		}
+
+		wp->child_accept_event = (struct fpm_event_s *)malloc(sizeof(struct fpm_event_s));
+		if (!wp->child_accept_event) {
+			zlog(ZLOG_ERROR, "[pool %s] unable to malloc the child_accept socket event", wp->config->name);
+			// FIXME handle crash
+			return 1;
+		}
+		memset(wp->child_accept_event, 0, sizeof(struct fpm_event_s));
+		fpm_event_set(wp->child_accept_event, wp->child_accept_pipe[0], FPM_EV_READ, fpm_pctl_on_child_accept, wp);
+		fpm_event_add(wp->child_accept_event, 0);
+
 		return 1;
 	}
 	return fpm_children_make(wp, 0 /* not in event loop yet */, 0, 1);

diff --git a/sapi/fpm/fpm/fpm_conf.c b/sapi/fpm/fpm/fpm_conf.c
@@ -822,11 +822,6 @@ static int fpm_conf_process_all_pools() /* {{{ */
 		} else if (wp->config->pm == PM_STYLE_ONDEMAND) {
 			struct fpm_worker_pool_config_s *config = wp->config;
 
-			if (!fpm_event_support_edge_trigger()) {
-				zlog(ZLOG_ALERT, "[pool %s] ondemand process manager can ONLY be used when events.mechanisme is either epoll (Linux) or kqueue (*BSD).", wp->config->name);
-				return -1;
-			}
-
 			if (config->pm_process_idle_timeout < 1) {
 				zlog(ZLOG_ALERT, "[pool %s] pm.process_idle_timeout(%ds) must be greater than 0s", wp->config->name, config->pm_process_idle_timeout);
 				return -1;