The Null Value Header Bug: Why Clients Didn't Receive Packets in Spring Gateway

206

Dependency Version

  • Spring Boot: 2.2.6.RELEASE

  • Tomcat Embedded Core: 9.0.79

Background

This problem surfaced in our API gateway service, which we built using Spring Gateway. The service handles exceptions by implementing the ErrorWebExceptionHandler interface, returning processed exception details to the client and adding some header information. Everything was functioning well until one day, after a code merge, issues began to sporadically appear. Here's what we observed:

Whenever a request processing exception arose, it would be addressed by the GatewayErrorWebExceptionHandler. However, oddly enough, clients weren't receiving any responses. Our server-side logs looked fine, but it appeared as if the connection was somehow stalled, preventing the client-side from getting any feedback. Further analysis, specifically packet captures, confirmed our suspicion: no data packets were being transmitted. A deeper dive into the GatewayErrorWebExceptionHandler revealed an inconsistency with the traceID. In certain cases, this traceID might be absent. Notably, with the gateway rooted in WebFlux and Netty as its foundational network library, it's important to highlight that Netty performs null checks when headers are added. Encountering a null value triggers an exception, which seemed counterintuitive to us.

public class GatewayErrorWebExceptionHandler implements ErrorWebExceptionHandler {
  public Mono<Void> handle(ServerWebExchange exchange, Throwable ex) {
    ...
    String traceID = exchange.getRequest().getHeaders().getFirst("traceID");
    exchange.getResponse().getHeaders().add("traceID", traceID);
    ...
  }
}

Problem Identification

Using the arthas tool, the call stack was traced:

[arthas@3423878]$ stack *.GatewayErrorWebExceptionHandler *

It was observed that the underlying servlet had changed to Tomcat instead of the expected Netty:

[arthas@3423878]$ stack *.GatewayErrorWebExceptionHandler *
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 11) cost in 212 ms, listenerId: 1
ts=2023-07-29 19:01:14;thread_name=http-nio-8080-exec-1;id=20;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@6cc8c13c
    @com.xxx.server.gateway.filter.GatewayErrorWebExceptionHandler.handle()
        at org.springframework.web.server.handler.ExceptionHandlingWebHandler.lambda$handle$0(ExceptionHandlingWebHandler.java:77)
        ... 
        at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.drain(MonoIgnoreThen.java:172)
        at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:56)
        at reactor.core.publisher.Mono.subscribe(Mono.java:4210)
        at org.springframework.http.server.reactive.ServletHttpHandlerAdapter.service(ServletHttpHandlerAdapter.java:192)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:373)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1594)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:829)

Upon further investigation, it was found that the tomcat-embed-core package had been added to the project's dependencies, which changed the underlying Servlet container of Spring Gateway

<dependency>
    <groupId>org.apache.tomcat.embed</groupId>
    <artifactId>tomcat-embed-core</artifactId>
</dependency>

Subsequently, I speculated whether Tomcat might have swallowed the null pointer exception during its processing. Time to delve into the Tomcat source code: Http11Processor.java

/**
  * When committing the response, we have to validate the set of headers, as
  * well as setup the response filters.
  */
@Override
protected final void prepareResponse() throws IOException {
  ...
  // Build the response header
  try {
    outputBuffer.sendStatus();
	int size = headers.size();
	for (int i = 0; i < size; i++) {
		outputBuffer.sendHeader(headers.getName(i), headers.getValue(i));
	}
	outputBuffer.endHeaders();
  } catch (Throwable t) {
    ...
	outputBuffer.resetHeaderBuffer();
	throw t;
  }
  outputBuffer.commit();
}

As expected, an exception could potentially arise at headers.getValue(i). Moreover, when this exception is thrown outward, only I/O exceptions are handled externally

public final void action(ActionCode actionCode, Object param) {
  switch (actionCode) {
  // 'Normal' servlet support
  case COMMIT: {
    if (!response.isCommitted()) {
      try {
        // Validate and write response headers
        prepareResponse();
        } catch (IOException e) {
          handleIOException(e);
        }
      }
    break;
  }
  ...
}

private void handleIOException (IOException ioe) {
  if (ioe instanceof CloseNowException) {
    // Close the channel but keep the connection open
	setErrorState(ErrorState.CLOSE_NOW, ioe);
  } else {
	// Close the connection and all channels within that connection
	setErrorState(ErrorState.CLOSE_CONNECTION_NOW, ioe);
  }
}

public static void handleThrowable(Throwable t) {
   if (t instanceof ThreadDeath) {
      throw (ThreadDeath) t;
   }
   ...
   // All other instances of Throwable will be silently swallowed
}

No logs are printed or exceptions thrown outward. Only when an IOException occurs is the connection closed. In this case, it's a null pointer exception, which explains why the client remains unresponsive. When the OutputBuffer flushes the cache, it merely sets doFlush to false, effectively swallowing the exception.

@Override
public void flush() throws IOException {
  doFlush(true);
}

Furthermore, Spring Reactor's implementation class for Tomcat's AbstractListenerWriteFlushProcessor doesn't handle the event when it receives a send signal. Neither of the two libraries does anything or prints any logs, leading the client to continuously wait for the request response. This part of the code can be found in: AbstractListenerWriteFlushProcessor.java

try {
  processor.flush();
}
catch (Throwable ex) {
  processor.flushingFailed(ex); 
  return;
}
public abstract class AbstractListenerWriteFlushProcessor<T> implements Processor<Publisher<? extends T>, Void> {
  /**
  * Invoked when an error happens while flushing. Sub-classes may choose
  * to ignore this if they know the underlying API will provide an error
  * notification in a container thread.
  * <p>Defaults to no-op.
  */
  protected void flushingFailed(Throwable t) {
  }
}

Compared to WebMvc, when using Tomcat as the servlet container, the strategy for handling headers is: headers with null values are directly filtered out during the filter process: Response.java

private void addHeader(String name, String value, Charset charset) {
  if (name == null || name.length() == 0 || value == null) {
    return;
  }
  ...
}

However, this filtering operation only exists in the Spring MVC processing chain. In this case, the gateway project relies on WebFlux. So, when the traceID header with a null value is ignored, Tomcat encounters a null pointer exception while preparing to write buffered data to the socket. This exception is internally swallowed, causing the connection to remain open indefinitely without writing any data.

Solution

  1. Remove the tomcat-embed-core dependency from the common module and only introduce the java-servlet-api, allowing Spring Gateway to revert to using Netty as the underlying server.

  2. In GatewayErrorWebExceptionHandler, perform a null check on traceID to prevent adding empty headers.

public class GatewayErrorWebExceptionHandler implements ErrorWebExceptionHandler {
  ...
  public Mono<Void> handle(ServerWebExchange exchange, Throwable ex) {
    ...
    String traceID = exchange.getRequest().getHeaders().getFirst("traceID");
    if(traceID != null){
      exchange.getResponse().getHeaders().add("traceID", traceID);
    }
  }
}

This case highlights the importance of understanding the technical stack and being clear about the project's dependencies. Only by doing so can we quickly locate and solve problems when they arise. Additionally, testing confirmed that undertow doesn't exhibit this issue. This is because the version of Spring in question has overloaded flushingFailed for undertow's AbstractListenerWriteFlushProcessor implementation class. Only Tomcat handles internal null pointers in this manner.

However, two years ago, someone had already fixed this issue. But the version in our project was outdated.

Conclusion

This actually reflects a problem:

  1. There are significant differences between versions of open-source libraries, especially frameworks like Spring. The interactions between them and the underlying servlet containers can change with version updates, which developers need to be aware of during the development process.