---
title: Why Reading Parquet from S3 Is Slow (Even In-Network) and What I Did Instead
date: 2026-04-01
description: S3 in-network reads still have a ~300ms latency floor. Downloading parquet files to ephemeral disk at startup gave me local-disk speeds (~3ms) for free on Railway.
tags: ["railway", "polars", "s3", "prediction markets"]
author: Rian Dolphin
---

# Why Reading Parquet from S3 Is Slow (Even In-Network) and What I Did Instead

I'm building a small analytics dashboard for my systematic [Polymarket](https://polymarket.com/) trading activity. The stack is simple: Python, [Polars](https://pola.rs/), [FastHTML](https://fastht.ml/), deployed on [Railway](https://railway.com?referralCode=3pXBGQ). The data lives in a few parquet files: market metadata, event metadata, and a hive-partitioned activity log. Nothing too large. Maybe 40k rows across the lot.

My initial plan was straightforward. Store the parquet files in Railway's S3-compatible object storage, then read them directly with `pl.scan_parquet()` on each request. Polars supports S3 natively, parquet supports column projection and predicate pushdown, and the bucket would sit right next to the application server on Railway's network. Should be fast, right?

## The numbers told a different story

I wrote a benchmark script that runs each query five times and reports the min, median, and max. Here's what I found reading from S3 over the public internet (my laptop to Railway's bucket):

| Query | Median |
|---|---|
| Full scan, markets (138 cols, 37k rows) | 2.52s |
| Full scan, events (90 cols, 4.7k rows) | 3.12s |
| Select 4 columns from markets | 0.59s |
| Join markets + events, filter | 0.69s |

Column projection helped a lot. Reading 4 columns instead of 138 brought it from 2.5s down to 0.6s. But 600ms per query is still too slow for a dashboard.

I then deployed the benchmark to Railway itself, so the application was reading from S3 within the same network:

| Query | Median |
|---|---|
| Full scan, markets | 1.13s |
| Full scan, events | 1.08s |
| Select 4 columns from markets | 0.39s |
| Join markets + events, filter | 0.37s |

Faster, roughly 2x across the board. But there's a floor around 300ms that no amount of query optimisation could get past.

Then I ran the same queries against local parquet files:

| Query | Median |
|---|---|
| Full scan, markets | 0.154s |
| Full scan, events | 0.165s |
| Select 4 columns from markets | 0.003s |
| Join markets + events, filter | 0.006s |

3 milliseconds. Two orders of magnitude faster than S3 in-network.

## Why the difference?

I'd assumed that "in-network S3" would behave roughly like local disk. It doesn't, and the reason is simple once you know how S3 works a little bit.

S3 is an object store accessed over HTTP. Every query involves a TLS handshake, HTTP request/response overhead, and byte-range requests to read specific column chunks. Even when the network hop is short, you're still paying the cost of an HTTP round-trip for every read operation. That ~300ms floor is the latency of the S3 protocol itself, not the network.

Local parquet reads are fundamentally different. Polars memory-maps the file. The OS reads bytes directly from disk, and after the first access, the data sits in the kernel's page cache. There's no protocol overhead, no serialisation, no network. It's just pointer arithmetic.

## The solution: ephemeral disk

The obvious approach would be to read the files into memory once at startup and serve from a cached DataFrame. But Railway charges $10/GB/month for RAM. My data is small now, but that pricing model doesn't scale well, and it felt wasteful to keep data in RAM that could just as easily sit on disk.

Then I looked more carefully at Railway's pricing. Each container gets an ephemeral filesystem. It's the container's own disk, wiped on every deploy or restart, but perfectly fine for data that can be re-downloaded. On the Hobby plan, you get 100GB of ephemeral storage included. No additional cost.

So the solution was simple: download the parquet files from S3 to `/tmp` once at startup, then read locally from there for the lifetime of the container.

```python
def download_s3_data() -> None:
    config = _get_storage_options()
    if config is None:
        return  # Local dev, no-op

    bucket, storage_options = config
    CACHE_DIR.mkdir(parents=True, exist_ok=True)

    for file_name in S3_FILES:
        dest = CACHE_DIR / file_name
        if dest.exists():
            continue
        s3_path = f"s3://{bucket}/data/{file_name}"
        pl.read_parquet(s3_path, storage_options=storage_options).write_parquet(dest)
```

This runs before the server starts accepting requests. I added a `/health` endpoint so Railway knows when the app is ready. The old container keeps serving traffic until the new one finishes downloading and passes the health check. Zero downtime on deploy.

The cost breakdown:

| Resource | Price |
|---|---|
| RAM | $10/GB/month |
| S3 storage | $0.15/GB/month |
| Ephemeral disk | Included (100GB on Hobby) |

I get local-disk query speeds (~3-7ms) using storage that's effectively free. The only trade-off is a few seconds of startup time to download the files, which is invisible to users because of the health check gating.

## What I took away from this

The mental model I had wrong was treating S3 as "remote disk". It's not. It's an HTTP API that happens to store files. Every read is a network request, and network requests have a latency floor that no amount of clever query planning can eliminate.

If your data is small enough to fit on disk (and on [Railway's](https://railway.com?referralCode=3pXBGQ) Hobby plan, that means under 100GB), downloading it at startup and reading locally is both faster and cheaper than reading from S3 on every request. You get the durability of object storage (the data lives in S3 and survives deploys) with the read performance of local disk.
