P2 · Network Infrastructure

HTTP 504 Gateway Timeout — Reverse Proxy or API Gateway Cannot Reach Upstream Server

An HTTP 504 Gateway Timeout is returned when a reverse proxy, API gateway, or load balancer (e.g., Nginx, HAProxy, AWS ALB, IIS ARR) fails to receive a timely response from the upstream application server or backend service. The upstream may be down, overloaded, firewalled off, or suffering a network path failure between the proxy and the backend. Resolution requires isolating whether the failure is at the application layer (crashed service, resource exhaustion), the network layer (routing loss, firewall block), or the configuration layer (proxy timeout set too short for legitimate upstream response times), then applying the corresponding fix — service restart, capacity scaling, timeout tuning, or firewall rule correction.

Indicators

HTTP 504 status code returned to the client by the proxy or gateway layer
Browser or API client displays '504 Gateway Timeout' error page
Proxy or gateway access logs record upstream connection timeout events with the specific upstream host, port, and elapsed time
Upstream application server logs show no corresponding inbound request for the failing time window, suggesting a network-layer failure before the request reaches the backend
Monitoring alerts on elevated upstream response time, failed health checks, or rising 5xx error rate on the gateway

Likely causes

Upstream application server process has crashed, stopped, or is not accepting new TCP connections
Network path between the proxy/gateway and upstream server is degraded or broken (packet loss, routing failure, MTU mismatch)
Upstream server is overloaded (CPU, memory, connection pool, file descriptor exhaustion) and cannot process requests within the proxy timeout period
Firewall or cloud security group rules are silently dropping TCP traffic between the proxy IP range and the upstream port
DNS resolution failure preventing the proxy from resolving the upstream hostname to an IP
Upstream service is performing a long-running operation (e.g., slow database query, external API call) that exceeds the gateway timeout threshold
Proxy timeout values are misconfigured too short relative to the upstream's legitimate 99th-percentile response time

Diagnostic steps

Review proxy or gateway access and error logs for the exact upstream address, port, and timeout duration at the time of 504 errors. On Nginx: `tail -100 /var/log/nginx/error.log | grep upstream`. On Apache: `tail -100 /var/log/apache2/error.log | grep proxy`. On AWS ALB: filter access logs in S3 for records with elb_status_code=504 and note the target_status_code and target_processing_time fields.

Identifies which upstream host and port is failing to respond, and whether the timeout is consistently hitting a specific threshold — distinguishes a single-backend failure from a systemic issue.
From the proxy/gateway server, test direct TCP connectivity to the upstream host and port: `curl -v --max-time 10 http://<upstream-host>:<port>/health` and `telnet <upstream-host> <port>` (or `nc -zv <upstream-host> <port>`). If the upstream uses HTTPS internally: `curl -vk --max-time 10 https://<upstream-host>:<port>/health`.

Determines whether the upstream is reachable at the TCP layer from the proxy's perspective — isolates network-layer failure (connection refused, timeout) from application-layer failure (connection succeeds but response is slow or malformed).
Check upstream application server health: `systemctl status <service-name>` to confirm the process is running. Check resource utilisation: `top` or `vmstat 1 5` for CPU/memory pressure. Check open file descriptors: `ulimit -n` and `ls /proc/<pid>/fd | wc -l`. Verify the upstream is listening on the expected port: `ss -tlnp | grep <port>`.

Identifies whether the upstream is running but overwhelmed (resource exhaustion, connection queue full) or has crashed and is not listening — drives the choice between restart vs. scale-out.
Run a traceroute or MTR from the proxy server to the upstream server to identify network path degradation: `traceroute <upstream-host>` or `mtr --report --report-cycles 20 <upstream-host>`. Look for hops with >5% packet loss or latency spikes inconsistent with expected network topology.

Pinpoints any intermediate network hop introducing latency or packet loss between the proxy and upstream — required before escalating to the network team or cloud provider.
Review the proxy timeout configuration and compare against upstream response time metrics. On Nginx: `grep -E 'proxy_(read|connect|send)_timeout' /etc/nginx/nginx.conf /etc/nginx/conf.d/*.conf`. On Apache: `grep ProxyTimeout /etc/apache2/*.conf`. On AWS ALB: check the target group's 'Idle timeout' setting in the AWS Console or via CLI: `aws elbv2 describe-target-group-attributes --target-group-arn <arn>`. Compare timeout values against the upstream's actual 99th-percentile response time from APM or application logs.

Determines whether the 504s are caused by a legitimate slow upstream being cut off by an under-configured timeout, rather than a true upstream failure — prevents misdiagnosis and unnecessary service restarts.

Resolution path

If the upstream service process has crashed or stopped: restart it with `systemctl restart <service-name>` and confirm it is listening: `ss -tlnp | grep <port>`. Issue a test request from the proxy server with `curl -v --max-time 10 http://<upstream-host>:<port>/health` before re-enabling traffic.
If the upstream server is overloaded (high CPU/memory, connection queue full): scale horizontally by adding upstream instances behind the load balancer, and apply temporary rate limiting at the gateway to shed load while the upstream recovers. On Nginx, add `limit_req_zone` and `limit_req` directives to cap inbound request rate.
If the proxy timeout is too short for legitimate upstream processing time: increase timeout values in the proxy configuration. On Nginx, set `proxy_read_timeout 120s;` and `proxy_connect_timeout 60s;` in the relevant server or location block, then reload: `nginx -t && nginx -s reload`. On Apache, set `ProxyTimeout 120` in the VirtualHost config and reload: `apachectl graceful`. On AWS ALB, update the target group idle timeout via the Console or CLI.
If a firewall or security group is blocking traffic between the proxy and upstream: add a rule permitting TCP from the proxy's IP or CIDR to the upstream port. On Linux iptables: `iptables -A INPUT -s <proxy-cidr> -p tcp --dport <upstream-port> -j ACCEPT`. On AWS: update the upstream EC2 security group inbound rules to allow the proxy's security group or IP range on the upstream port. Re-test with `nc -zv <upstream-host> <port>` from the proxy.
If a network path failure is identified via MTR/traceroute: escalate to the network or cloud infrastructure team with the MTR report showing the failing hop. As a temporary measure, if an alternate upstream endpoint or region is available, update the proxy upstream block or target group to route to the healthy endpoint.

Prevention

Implement active health checks on all upstream targets at the load balancer or gateway layer so that unhealthy instances are automatically removed from rotation before they generate 504s for end users. On Nginx, configure `upstream` blocks with `health_check` (Nginx Plus) or use passive failure detection with `max_fails` and `fail_timeout` parameters.
Set and tune proxy timeout values based on measured upstream response-time SLOs — specifically the 99.9th-percentile response time plus a safety margin — and configure alerting on upstream response times approaching the timeout threshold before end-user-visible 504s occur.
Deploy upstream autoscaling policies triggered by CPU utilisation, memory pressure, or request-queue depth metrics to proactively add capacity before the upstream becomes overloaded and begins timing out.
Implement circuit-breaker patterns at the API gateway or application layer to fast-fail requests to a known-degraded upstream rather than holding connections open until the gateway timeout fires — reducing cascade failure risk and improving overall system resilience.

Tools

curl (HTTP-level connectivity, response code, and response time testing)
telnet / nc (TCP port reachability testing from proxy to upstream)
traceroute / mtr (network path latency and packet loss diagnostics)
ss / netstat (verify upstream process is listening on expected port)
top / vmstat / htop (upstream server CPU, memory, and I/O utilisation)
systemctl (service status check and restart on Linux systemd systems)
nginx -t / nginx -s reload (Nginx configuration syntax test and graceful reload)
AWS Console / CLI — ALB target group health checks and access log analysis

References

http-504gateway-timeoutproxyreverse-proxyload-balancerupstreamnginxhaproxyaws-albnetworkingtimeoutavailabilityincident-responseapi-gatewayconnectivity