Digging Into Factor Modelling via Alternative Data

I've been working through Paleologo's Advanced Portfolio Management, and I found the chapter on factor modelling really interesting. The algebra is straightforward enough, just linear regression for the most part, but I never felt like the intuition went in fully. I could follow the math, but it didn't feel like it clicked immediately.

In order to try and make it click, I am trying to implement a factor model on some real data. At Massive, where I work, we recently picked up a panel of European consumer card-spending numbers. Some classic alt data, and a perfect opportunity to play around with that data and learn more about factor modelling.

This post is as much an explanation to my future self as anything else, but hopefully it might be of interest to others too! It roughly follows the path of a detailed back-and-forth I had with Claude (using the learning style) on the topic.

Why factors exist at all

Before jumping into the modelling, a bit of motivation.

Suppose I tell you one fact about today: the S&P 500 closed up 2%. Nothing else. Now I ask you to guess how three stocks did. (1) ExxonMobil (2) A small US biotech you've never heard of (3) A Japanese electric utility. Even with no other information you can form a sensible guess for each. Probably Exxon was up roughly with the index. The biotech, less obvious, but probably up. The utility, who knows, but maybe not far from flat (as a best guess).

In fact, on a +2% S&P day, somewhere in the region of 70 to 80% of US stocks close up. Not just the big index names. The tiny biotech, the regional bank, the railroad, the software company. They move the same direction on the same day, even though their actual businesses have nothing in common with each other. The biotech's drug trial does not care what Exxon's oil production is. And yet day to day they move together.

Why? Because something common pushed them. A Fed announcement, a tariff headline, an inflation print, a tax bill, some macro event that isn't about any single firm but about the environment they're all operating in.

So a stock's return on any given day is a mix of two things. Stuff happening to the world that affects everyone, and stuff happening to that specific firm. The first is systematic. The second is idiosyncratic. You can write any single day's return as the sum of the two and the decomposition is true by construction.

That is the seed of the whole modelling framework. The craft is in cutting up the systematic part properly.

Same shock, different dials

There's a tempting wrong reading of "systematic", which is to think it means "affects every stock equally". The whole texture of factor modelling lives in why that is wrong. Systematic just means the cause is external to any one firm. However, it doesn't mean that it affects every firm the same (and may not affect some firms at all).

A Fed cut is systematic because no firm's CEO produced it. But the impact of that cut travels through different channels and lands with different force on different businesses. A regional bank's whole P&L is wired to interest rates. It literally makes its money on the spread between what it lends at and what it pays depositors. A profitless biotech is valued mostly on cash flows ten or more years out, so the discount rate matters a lot even though the firm has no current revenue. A utility carries a lot of debt, so its financing cost moves directly. Exxon mostly cares about oil prices, with rates a second-order concern.

Same shock, four very different responses. The analogy Claude made, which I found useful, was that each stock has a dial for every kind of systematic shock, and each dial setting is a fact about the firm. The bank's Fed dial is at 7, its oil dial near 0. Exxon's oil dial is at 8, its Fed dial maybe at 2. When a shock hits, every firm's return picks up the shock multiplied by that firm's dial, plus whatever firm-specific stuff was going on that day.

That is what the factor identity is saying. For a stock \(i\) on day \(t\),

\[r_{i,t} = \beta_{i,1} f_{1,t} + \beta_{i,2} f_{2,t} + \cdots + \beta_{i,m} f_{m,t} + \varepsilon_{i,t}.\]

The \(f_{k,t}\) are the shocks. The \(\beta_{i,k}\) are the dials. The \(\varepsilon_{i,t}\) is the firm-specific leftover.

Read the subscripts carefully

Look at the subscripts on the right-hand side. The dial \(\beta_{i,k}\) is indexed by the stock \(i\) (and by which factor \(k\)) but not by the day \(t\). The shock \(f_{k,t}\) is indexed by the day \(t\) (and by \(k\)) but not by the stock \(i\). That asymmetry is something I overlooked the first time I read the chapter, but made things click a bit when I saw it.

The dial doesn't carry a \(t\) because it is a property of the firm. It is about how the business is wired, which doesn't change much day to day. A bank is rate-sensitive on Tuesday for the same reason it is rate-sensitive on Wednesday. (Dials drift slowly over months and years, when a firm changes its business mix, but on any single day treating the dial as fixed is fine.)

The shock doesn't carry an \(i\) because it is a property of the day. Either the Fed surprised the market or it didn't. That fact doesn't have a per-stock version.

Returns get both subscripts because a return is a fact about a stock and a day at once.

The dataset

The signal I'll use is one of the simplest you could build on top of card data, year-over-year log growth of a trailing 90-day sum:

\[g_{i,t} \;=\; \log\left( \sum_{u = t-89}^{t} s_{i,u} \right) \;-\; \log\left( \sum_{u = t-89-365}^{t-365} s_{i,u} \right).\]

For each firm \(i\) on each day \(t\), that is one number. A snapshot on some Tuesday might look like:

Firm	\(g_{i,t}\)	Interpretation
Amazon	+0.15	Strong growth
Starbucks	+0.06	Mildly positive
Nike	+0.01	Flat
Target	-0.04	Softening
Gap	-0.22	Sharp decline

These are log growths, not percentages. For small values the two are nearly identical, but the gap widens further out. \(g = 0.15\) is about a 16% increase (\(e^{0.15} - 1\)), and \(g = -0.22\) is about a 20% decline.

So now I have one number per firm per day, computed from a real-world quantity, that varies meaningfully across the panel. From here there are really two questions. How to turn this into a portfolio, and how to show that the portfolio isn't just a repackaging of stuff we already own or know about. Factor models are mostly machinery for answering the second.

Cleaning the signal

Before running any regression, \(g\) needs three pieces of pre-processing.

The first is to centre \(g_{i,t}\) across the cross-section. Subtract the cross-sectional mean each day so the \(g\)'s sum to zero. The motivation only fully lands once we get to the portfolio reading a couple of sections from now, but the short version is that we want the long-short structure to be balanced around zero, rather than the whole panel quietly long on a day when everyone is growing.

The second is to standardise: divide by the cross-sectional standard deviation. This fixes the scale so the gross size of the resulting book is steady through time, and has the nice side-effect that the factor return ends up in interpretable units (percent return per one-sigma move in \(g\)).

The third is to winsorise the tails, clipping at perhaps three standard deviations. This is the robustness step. Raw alt-data signals have fat tails, and a single firm with 200% YoY growth would otherwise dominate the regression. Its \(g^2\) swamps the denominator \(\sum g^2\), leaving the factor return for the day as roughly that one firm's return. Winsorising stops that.

A useful sanity check on the construction is to ask what happens on a hypothetical day where every firm has exactly the same \(g\). Suppose every firm prints \(g = 0.05\). After centring, every \(g\) becomes zero, the regression has nothing to fit on, and the factor return is zero. That is the right answer. On a day with no cross-sectional spread in the characteristic, the characteristic cannot explain any of the variation in returns. Every firm looks the same on this dimension, so if Amazon goes up and Gap goes down, the difference can't be attributed to card-spending growth, since they had identical \(g\).

A factor is built entirely from cross-sectional dispersion in the characteristic. On a day with no dispersion, there is nothing to fit and the factor returns nothing.

From here on, \(g_{i,t}\) refers to the cleaned signal.

Manufacturing a factor

Going back to the identity:

\[r_{i,t} = \beta_{i,1} f_{1,t} + \cdots + \varepsilon_{i,t}.\]

In the CAPM-style mental model that most people have inherited, the \(f_{k,t}\) are obvious things. The S&P 500's daily return. The yield change on a Treasury bond. Quantities you can pull off a Bloomberg terminal. Once you have those, you regress each stock's return on them through time and the dials \(\beta_{i,k}\) fall out. Many days, one stock at a time.

The question for our card-spending dataset is whether "European consumer spending strength" is a factor in the same sense as the market. The dial side is easy. Different firms obviously have different exposure to European consumers. Amazon is more exposed than a domestic US utility. We can even measure it directly. \(g_{i,t}\) is, in effect, a proxy for today's reading of each firm's exposure to whatever is happening with the European consumer.

The shock side is harder. There is no daily index published called European consumer health. The shock that we want exists in the world somewhere, but not as a number sitting on a server I can read.

The fundamental approach inverts the CAPM logic. Instead of taking the factor as given and inferring the dials, take the dials as given and infer the factor.

Concretely, on a single day \(t\), write the cross-sectional regression (i.e. \(t\) stays fixed and we look across the firms)

\[r_{i,t} = f_t \cdot g_{i,t} + \varepsilon_{i,t}, \qquad i = 1, \dots, N.\]

You know the \(N\) returns today. You know the \(N\) values of \(g_{i,t}\). The only unknown is the scalar \(f_t\), the factor return for the day. Run the OLS and you get

\[\widehat{f}_t \;=\; \frac{\sum_i g_{i,t}\, r_{i,t}}{\sum_i g_{i,t}^2}.\]

The derivation is no-intercept, single-regressor OLS. Minimise

\[S(f_t) \;=\; \sum_{i=1}^{N} \left( r_{i,t} - f_t\, g_{i,t} \right)^2\]

over \(f_t\). Differentiating with respect to \(f_t\) gives

\[\frac{\partial S}{\partial f_t} \;=\; -2 \sum_{i=1}^{N} g_{i,t} \left( r_{i,t} - f_t\, g_{i,t} \right),\]

setting that to zero and dropping the \(-2\) leaves

\[\sum_i g_{i,t}\, r_{i,t} \;-\; f_t \sum_i g_{i,t}^2 \;=\; 0,\]

which rearranges to the formula above. It is the standard \(\widehat{\beta} = (X^\top X)^{-1} X^\top y\) collapsed to the scalar case, with \(X^\top X = \sum_i g_{i,t}^2\) and \(X^\top y = \sum_i g_{i,t}\, r_{i,t}\). Each firm's return is weighted by its exposure \(g_{i,t}\) and normalised by total squared exposure, so firms with larger exposures contribute more to the estimate.

Worth doing once by hand. Take the five firms above. After cleaning, their \(g\)'s come out as \(z\)-scores rather than raw log growths. Pair them with a plausible set of returns:

Firm	\(g_{i,t}\)	\(r_{i,t}\) (%)	\(g_{i,t} \cdot r_{i,t}\)
Amazon	+1.28	+1.2	+1.54
Starbucks	+0.55	+0.3	+0.17
Nike	+0.15	-0.1	-0.02
Target	-0.26	-0.4	+0.10
Gap	-1.72	-1.5	+2.58
		Sum	+4.37

With \(\sum g^2 \approx 4.99\), today's factor return is \(4.37 / 4.99 \approx 0.88\%\). Because \(g\) was standardised first, the units of the slope are clean: a firm one cross-sectional standard deviation higher in \(g\) today outperformed by about \(0.88\%\). With raw \(g\) the same arithmetic would have given a much larger number in awkward units. Repeat tomorrow, and the day after, and so on, and you have a factor return time series, manufactured out of a pile of firm-level numbers that had no obvious price interpretation a moment ago.

So in CAPM you take the factor as given, regress each stock through time, and infer the dial. In the fundamental approach you take the dial (the characteristic) as given, regress across stocks on a single day, and infer the factor. Same identity, same OLS, but slicing the data the other direction.

A factor is a portfolio

The denominator of \(\widehat{f}_t\), \(\sum_j g_{j,t}^2\), is just a number that doesn't depend on \(i\). Pull it inside the sum:

\[\widehat{f}_t = \frac{\sum_i g_{i,t}\, r_{i,t}}{\sum_j g_{j,t}^2} = \sum_i w_{i,t}\, r_{i,t}, \qquad w_{i,t} = \frac{g_{i,t}}{\sum_j g_{j,t}^2}.\]

The right-hand side is a weighted sum of the day's stock returns, with weights that depend on each firm's characteristic. That is the definition of a portfolio's daily return. So the slope of the cross-sectional regression and the P&L of a particular portfolio are literally the same number.

Which portfolio? The weights are proportional to \(g_{i,t}\). Firms with high positive \(g\) get large positive weights, firms with sharply negative \(g\) get large negative weights, firms near zero barely show up. So it's a long/short book, long the firms whose European consumer business is growing, short the ones whose business is shrinking, with weights set by exactly how much each firm is growing or shrinking.

This also retroactively explains why we centred \(g\) upstream. The weights \(w_{i,t}\) sum to zero exactly when the centred \(g_{i,t}\)'s do, so centring buys us a book that is automatically equal-cash long and short, rather than a closet long-only bet on a day when the panel is broadly growing.

The "factor return" you compute by running OLS isn't an abstract statistical quantity. It is the daily P&L of a specific, buildable book of positions.

A note on the decile cousin

Quants often build a simpler version of this. Long the top decile of the signal, short the bottom decile, equal-weighted, ignore the middle. It's a coarser cousin of the regression-weighted portfolio, throwing away both the magnitudes and the middle of the distribution, but more robust to outliers in return. Both describe the same long-short structure.

The risk-manager's view of the same machinery

A short detour, because I was confused for a while about why the same factor model gets sold as both a risk tool and an alpha tool. They are the same machinery, pointed at different questions.

On the risk side, the headache is the covariance matrix. A portfolio of \(N = 5{,}000\) stocks has an \(N \times N\) covariance matrix, around 12.5 million unique numbers once you account for symmetry. With a few years of daily data you have far fewer data points than parameters, so you can't estimate the matrix directly without producing something noisy and ill-conditioned.

The factor model imposes structure. Under the identity, two stocks co-move for two reasons. Either they share factor exposures, or their idiosyncratic shocks are correlated. The model assumes the latter is zero (i.e. idio shocks should be firm specific by definition), which gives

\[\Sigma = B \Sigma_f B^\top + D,\]

with \(B\) the \(N \times m\) matrix of loadings, \(\Sigma_f\) the \(m \times m\) factor covariance, and \(D\) a diagonal matrix of idiosyncratic variances. With \(m \approx 50\) factors that brings the parameter count from 12.5 million down to around a quarter of a million, dominated by the loadings.

The piece \(B \Sigma_f B^\top\) is "low rank". An \(N \times N\) matrix that secretly only has \(m\) degrees of freedom. It can describe co-movement only along the \(m\) factor channels we have named. The piece \(D\) is "diagonal", capturing each firm's idiosyncratic variance, with zeros off the diagonal because residuals are assumed uncorrelated across firms.

That residual independence is the key assumption the model relies on. When it fails, it almost always fails for the same reason: there is a real systematic driver you forgot to name (i.e. your list of factors don't cover everything, unsurprisingly).

Suppose the true world has an oil factor and you didn't include it. Exxon and Chevron both have large positive oil dials. Their co-movement on oil days still happens, but the model has nowhere to put it apart from the residuals, so the correlated piece ends up there. The model treats them as roughly independent stocks, each with high idio variance, when in reality they are tightly coupled through oil.

The off-diagonal of \(\Sigma\) between Exxon and Chevron gets underestimated (because you don't have an oil-based factor in your model). A portfolio long both will look like two independent positions to the risk model, when in reality they share a large hidden exposure. On a bad oil day the realised loss can be much worse than the model claimed it could be. That is what is meant by "our risk model underpredicted yesterday's drawdown".

The same logic flips sign for opposite-sign exposures. An airline (negative oil dial) and Exxon (positive oil dial) have a real negative oil-driven covariance. If you forget oil, you miss that negative contribution and overestimate their joint risk. So a missing factor distorts the off-diagonal in whichever direction the missed contribution was pushing, and quietly inflates the residuals along the way.

The diagnostic is to look at the residual correlation matrix and check for groups of stocks whose residuals visibly co-move. Risk modellers spend a lot of time doing exactly that because it helps to identify what factors your model might be missing.

The reason this detour is worth taking is that the same matrix \(B\) appears on both sides. On the alpha side, each column of \(B\) defines a factor portfolio of the kind we built earlier. On the risk side, \(B\) is the structure that lets the covariance matrix be tractable. The factor model is one piece of machinery serving both questions.

Neutralisation

Back to the alpha side. Suppose I run my card-spending portfolio over the backtest and it has a Sharpe of 1.2.

Say we do some diggind and find that firms with the highest card-spending growth in the panel tend to be smaller, faster-growing, more retail-heavy companies. Firms with the lowest growth tend to be bigger and more mature.

So the portfolio is, by accident, long small-and-growthy, short big-and-mature. Small-cap is itself a known factor with a documented historical premium. Therefore, a reasonable question to ask at this point is whether the Sharpe is just small-cap exposure in disguise?

Perhaps some unknown chunk of the Sharpe is therefore not card-spending insight, it is accidental size exposure that the literature already named. Perhaps all of the sharpe is just size.

This is the contamination problem. The signal is correlated with size, so any portfolio built from the raw signal inherits that correlation. The portfolio's returns are a mix of genuine alt-data insight (hopefully) and repackaged size factor.

The fix is the same cross-sectional regression move, with a richer right-hand side. On each day, regress

\[g_{i,t} = \gamma_0 + \gamma_{\text{size}}\, \text{size}_{i,t} + \gamma_{\text{val}}\, \text{val}_{i,t} + \sum_k \gamma_k\, \text{industry}_{k,i} + \tilde{g}_{i,t}.\]

You are decomposing \(g\) into the part explainable by things you don't want exposure to (size, value, industry) and the part that is genuinely surprising given those things (\(\tilde{g}\)). Throw the raw \(g\) away and build the portfolio from \(\tilde{g}\).

The reason this neutralises the portfolio comes from a property of OLS. The residuals from a fitted regression are exactly orthogonal to the regressors in the sample you fit on. So

\[\sum_i \tilde{g}_{i,t} \cdot \text{size}_{i,t} = 0\]

on day \(t\) by construction. The portfolio's dollar exposure to size is

\[\sum_i w_{i,t} \cdot \text{size}_{i,t} \;=\; \sum_i \frac{\tilde{g}_{i,t}}{\sum_j \tilde{g}_{j,t}^2} \cdot \text{size}_{i,t} \;=\; 0,\]

which is the same sum as the orthogonality condition, scaled by a constant. So the portfolio has, mechanically, zero size exposure on day \(t\). Same logic for value and each industry dummy. The OLS orthogonality property and the portfolio's neutrality to the regressors are really the same fact written in two notations.

Once you have this, the whole pre-processing recipe is the same operation applied at different richness. Regress the characteristic on everything you don't want the portfolio to be secretly long, and use the residuals. Centring is the trivial case where the only thing on the right is a constant, so the resulting portfolio has zero exposure to "the constant", which is just dollar-neutrality. Full neutralisation is the same move with size, value, industry, and whatever else you already own elsewhere.

Regress the characteristic on everything you don't want the portfolio to be secretly long, and use the residuals. I.e. Throw the raw \(g\) away and build the portfolio from \(\tilde{g}\).

Daily neutrality is not cumulative neutrality

There is another subtlety worth flagging. The orthogonality just described holds exactly on the day you fit the regression. Tomorrow you fit a new cross-section, get new residuals, rebuild the portfolio. Each day's portfolio is locally neutral.

But you don't hold one day's portfolio forever. You rebuild every day. Over a year, the daily portfolios drift, and drift can leak time-averaged exposure even when each day is locally clean.

Suppose for the first half of the year small-caps happen to coincide with high \(g\), so neutralisation rotates the book one way. For the second half the relationship reverses, and the book rotates the other way. Each day is fine. But the time-averaged size exposure across the whole year doesn't have to be zero. Daily orthogonality is a within-cross-section guarantee, not a through-time one.

This is one reason production quant shops do an ex post factor attribution at the end of the period. You take the realised P&L and regress it on the realised factor returns over the same window. The slope on size in that regression is the actual time-averaged size exposure of the strategy, not the one the daily snapshots claim. So we need to do the ex post alpha test in the following section.

Gappy makes a particular point of the ex post vs ex ante processes either in his book or a podcast, I can't remember which (probably both), but I do remember being initially confused by the Latin.

The alpha test

Excerpt from my Claude chat I found funny and serves as a perfect intro to this section:

You've done everything right. You took card-spending growth, centered and standardized it, neutralised against size, value, and industry. You built the daily long-short portfolio from the residuals. You ran it over your backtest period. You got a Sharpe of, say, 1.0.

Your PM looks at it and asks: "Is it real?"

What does that even mean? You have positive returns over a real time series. They didn't make themselves up. In what sense could they not be "real"?

The returns themselves are real in the sense that they happened. The sharper version of the question is whether anything is left after stripping out the contributions of factors that were already known.

The standard way to ask that is another regression, a time-series one this time:

\[\widehat{f}_t^{\text{spend}} = \alpha + \beta_{\text{mkt}} f_t^{\text{mkt}} + \beta_{\text{val}} f_t^{\text{val}} + \beta_{\text{mom}} f_t^{\text{mom}} + \cdots + \varepsilon_t.\]

Same identity again, with the portfolio's daily P&L taking the role usually played by a stock's daily return. The decomposition splits the new factor's return into a part explained by known factors and a part that isn't.

The intercept \(\alpha\) is the part of interest. It is the average daily return that survives after stripping out everything already on the desk. A positive \(\alpha\) with a decent t-statistic means the signal carries information no existing factor can claim credit for. A near-zero \(\alpha\) means the Sharpe was real but boring, a noisy combination of things the other factors already capture. A negative \(\alpha\) can also happen, and means the signal is paying less than a passive combination of existing factors.

This is the ex post test that complements the ex ante neutralisation from earlier. The neutralisation step controls what goes into the day, and the alpha test checks what came out the other side. Both are needed for two separate reasons. The first is that you can't shove every conceivable factor into the daily neutralisation regression. The literature has dozens, and stuffing them all in eats degrees of freedom and requires data that isn't always clean. The second is that daily neutrality leaks over time, as the previous section described.

One more thing about \(\alpha\). It is always alpha against a specific factor list. A signal with a great alpha against the Fama-French three-factor model in 1993 might have a smaller alpha against the Carhart four-factor model in 1998 (once momentum gets added) and smaller still against the Fama-French five-factor model in 2015 (once profitability and investment get added). Same signal, same returns. The alpha shrank because the language for explaining it got richer. A serious shop tests against several factor lists and looks at whether the alpha survives all of them.

The whole arc, in three regressions

The whole framework above is really three applications of the same regression identity, run in different directions.

The first runs across stocks on each day, regressing returns on a characteristic. Output: a daily factor return time series. This is how you manufacture a factor from raw firm-level data.

The second also runs across stocks on each day, but regresses the characteristic on controls. Output: residuals that you use as the cleaned characteristic, so the resulting portfolio has mechanical neutrality to the controls. This is how you clean the factor.

The third runs through time, regressing the factor's daily returns on the daily returns of factors you already had. Output: an alpha and some betas. This is how you evaluate the factor.

So everything we did sits on top of three regressions on the same underlying identity, applied at different scales.

What "a factor is a portfolio" actually means

Paleologo's chapter ends with the line that a factor is not a number, it is a portfolio. The first time I read it I didn't really sit with what it meant, and most of the work in this post is the version of it I now find useful.

The literal claim is straightforward. The number you compute on day \(t\) by running the cross-sectional OLS, the slope \(\widehat{f}_t\), equals the return on day \(t\) of a specific portfolio. They are the same number, because they are the same expression rearranged:

\[\widehat{f}_t \;=\; \frac{\sum_i g_{i,t}\, r_{i,t}}{\sum_i g_{i,t}^2} \;=\; \sum_i w_{i,t}\, r_{i,t}.\]

The right-hand side is a portfolio return by definition. The left-hand side is a regression coefficient. So the "factor return" you compute statistically and the daily P&L of a particular long-short book are not analogous things, they are the same arithmetic.

The portfolio is also a real thing you can build. You could compute its pnl, look up its drawdowns, hand the position file to an execution desk. In the snapshot from earlier it would be long Amazon and Starbucks (high \(g\)), short Gap and Target (low \(g\)), with weights proportional to each firm's centred and neutralised characteristic.

What makes the reframing useful, beyond being neat, is that the rest of the framework hangs off this one portfolio. The loadings are the recipe that defines its weights. The factor return is its daily P&L. The cross-sectional residuals \(\varepsilon_{i,t}\) are what is left of each stock's return on a day after subtracting the portfolio's contribution. The \(R^2\) of the cross-section measures how much of today's return variation is explained by holding this portfolio. The alpha from the previous section is what is left of the portfolio's own return after subtracting the contributions of the portfolios you already had on the desk.

So the closing line of Paleologo's chapter isn't really adding a new concept. It's reorganising the existing ones around a single concrete object, which is the part of it I find most useful on a second pass.