Back to posts

Why Reading Parquet from S3 Is Slow (Even In-Network) and What I Did Instead

Why Reading Parquet from S3 Is Slow (Even In-Network) and What I Did Instead

I'm building a small analytics dashboard for my systematic Polymarket trading activity. The stack is simple: Python, Polars, FastHTML, deployed on Railway. The data lives in a few parquet files: market metadata, event metadata, and a hive-partitioned activity log. Nothing too large. Maybe 40k rows across the lot.

My initial plan was straightforward. Store the parquet files in Railway's S3-compatible object storage, then read them directly with pl.scan_parquet() on each request. Polars supports S3 natively, parquet supports column projection and predicate pushdown, and the bucket would sit right next to the application server on Railway's network. Should be fast, right?

The numbers told a different story

I wrote a benchmark script that runs each query five times and reports the min, median, and max. Here's what I found reading from S3 over the public internet (my laptop to Railway's bucket):

Query Median
Full scan, markets (138 cols, 37k rows) 2.52s
Full scan, events (90 cols, 4.7k rows) 3.12s
Select 4 columns from markets 0.59s
Join markets + events, filter 0.69s

Column projection helped a lot. Reading 4 columns instead of 138 brought it from 2.5s down to 0.6s. But 600ms per query is still too slow for a dashboard.

I then deployed the benchmark to Railway itself, so the application was reading from S3 within the same network:

Query Median
Full scan, markets 1.13s
Full scan, events 1.08s
Select 4 columns from markets 0.39s
Join markets + events, filter 0.37s

Faster, roughly 2x across the board. But there's a floor around 300ms that no amount of query optimisation could get past.

Then I ran the same queries against local parquet files:

Query Median
Full scan, markets 0.154s
Full scan, events 0.165s
Select 4 columns from markets 0.003s
Join markets + events, filter 0.006s

3 milliseconds. Two orders of magnitude faster than S3 in-network.

Why the difference?

I'd assumed that "in-network S3" would behave roughly like local disk. It doesn't, and the reason is simple once you know how S3 works a little bit.

S3 is an object store accessed over HTTP. Every query involves a TLS handshake, HTTP request/response overhead, and byte-range requests to read specific column chunks. Even when the network hop is short, you're still paying the cost of an HTTP round-trip for every read operation. That ~300ms floor is the latency of the S3 protocol itself, not the network.

Local parquet reads are fundamentally different. Polars memory-maps the file. The OS reads bytes directly from disk, and after the first access, the data sits in the kernel's page cache. There's no protocol overhead, no serialisation, no network. It's just pointer arithmetic.

The solution: ephemeral disk

The obvious approach would be to read the files into memory once at startup and serve from a cached DataFrame. But Railway charges $10/GB/month for RAM. My data is small now, but that pricing model doesn't scale well, and it felt wasteful to keep data in RAM that could just as easily sit on disk.

Then I looked more carefully at Railway's pricing. Each container gets an ephemeral filesystem. It's the container's own disk, wiped on every deploy or restart, but perfectly fine for data that can be re-downloaded. On the Hobby plan, you get 100GB of ephemeral storage included. No additional cost.

So the solution was simple: download the parquet files from S3 to /tmp once at startup, then read locally from there for the lifetime of the container.

def download_s3_data() -> None:
    config = _get_storage_options()
    if config is None:
        return  # Local dev, no-op

    bucket, storage_options = config
    CACHE_DIR.mkdir(parents=True, exist_ok=True)

    for file_name in S3_FILES:
        dest = CACHE_DIR / file_name
        if dest.exists():
            continue
        s3_path = f"s3://{bucket}/data/{file_name}"
        pl.read_parquet(s3_path, storage_options=storage_options).write_parquet(dest)

This runs before the server starts accepting requests. I added a /health endpoint so Railway knows when the app is ready. The old container keeps serving traffic until the new one finishes downloading and passes the health check. Zero downtime on deploy.

The cost breakdown:

Resource Price
RAM $10/GB/month
S3 storage $0.15/GB/month
Ephemeral disk Included (100GB on Hobby)

I get local-disk query speeds (~3-7ms) using storage that's effectively free. The only trade-off is a few seconds of startup time to download the files, which is invisible to users because of the health check gating.

What I took away from this

The mental model I had wrong was treating S3 as "remote disk". It's not. It's an HTTP API that happens to store files. Every read is a network request, and network requests have a latency floor that no amount of clever query planning can eliminate.

If your data is small enough to fit on disk (and on Railway's Hobby plan, that means under 100GB), downloading it at startup and reading locally is both faster and cheaper than reading from S3 on every request. You get the durability of object storage (the data lives in S3 and survives deploys) with the read performance of local disk.