Rolling Window Feature Engineering for Time Series ML Models in Python

Raw price data tells a machine learning model almost nothing. A closing price of $64,250 has no inherent predictive value. What matters is context: how that price compares to recent history, whether volatility is expanding or contracting, whether volume is confirming the move. Rolling window features encode this context into numbers a model can learn from.

Feature engineering is where most of the alpha in quantitative trading comes from. Better features consistently outperform better models. A simple LightGBM with well-crafted rolling features will beat a complex neural network trained on raw OHLCV data nearly every time. This article walks through the most effective rolling window features for time series prediction, how to implement them correctly in Python, and the subtle pitfalls that can silently destroy your model's real-world performance.

The Mechanics of Rolling Windows

A rolling window computes a statistic over the last N periods at each point in time. The window "rolls" forward, always looking backward. This is critical: the window must never include the current or future observations when creating features for prediction.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "close": [100, 102, 101, 105, 103, 107, 106, 110, 108, 112],
    "volume": [1000, 1200, 900, 1500, 1100, 1400, 1300, 1600, 1000, 1700],
})

df["sma_5"] = df["close"].rolling(window=5).mean()
df["vol_std_5"] = df["close"].rolling(window=5).std()
df["vol_mean_5"] = df["volume"].rolling(window=5).mean()

The first four rows of sma_5 will be NaN because there are not yet five periods of history. This is correct behavior. Never fill these NaN values with forward-looking data or arbitrary constants. Either drop the rows or use a smaller minimum period with min_periods.

Essential Rolling Features for Trading Models

Through extensive experimentation across thousands of trading models, certain rolling features consistently prove their value. Here are the ones that carry the most predictive signal.

Returns Over Multiple Horizons

Percentage returns over different lookback periods capture momentum at multiple timescales:

def add_return_features(df: pd.DataFrame, periods: list[int]) -> pd.DataFrame:
    for p in periods:
        df[f"return_{p}"] = df["close"].pct_change(p)
    return df

df = add_return_features(df, periods=[1, 3, 5, 10, 20])

A model that sees 1-period, 5-period, and 20-period returns can distinguish between a short-term pullback within a larger uptrend (negative return_1, positive return_20) and a genuine reversal (negative across all horizons).

Volatility Features

Volatility measures tell the model about the current market regime. High volatility periods behave fundamentally differently from low volatility periods:

def add_volatility_features(df: pd.DataFrame, windows: list[int]) -> pd.DataFrame:
    log_returns = np.log(df["close"] / df["close"].shift(1))

    for w in windows:
        df[f"volatility_{w}"] = log_returns.rolling(window=w).std()
        df[f"volatility_ratio_{w}"] = (
            log_returns.rolling(window=w).std() /
            log_returns.rolling(window=w * 4).std()
        )

    return df

df = add_volatility_features(df, windows=[5, 10, 20])

The volatility_ratio is particularly useful. When short-term volatility exceeds long-term volatility (ratio > 1), the market is in an expanding volatility regime. When it is below 1, volatility is contracting. These transitions frequently precede large directional moves.

Z-Score Normalization

Raw rolling statistics are not directly comparable across different assets or time periods. A 20-period SMA of $64,000 for BTC and $3,200 for ETH mean entirely different things. Z-scores solve this by expressing each value as the number of standard deviations from the rolling mean:

def add_zscore_features(df: pd.DataFrame, columns: list[str], window: int) -> pd.DataFrame:
    for col in columns:
        rolling_mean = df[col].rolling(window=window).mean()
        rolling_std = df[col].rolling(window=window).std()
        df[f"{col}_zscore_{window}"] = (df[col] - rolling_mean) / rolling_std

    return df

df = add_zscore_features(df, columns=["close", "volume"], window=20)

A close price with a z-score of +2.5 means the price is 2.5 standard deviations above its 20-period mean, which is statistically extreme regardless of whether the asset trades at $100 or $100,000. This normalization makes features transferable across assets and stable across time.

Volume Profile Features

Volume confirms or contradicts price moves. A price breakout on declining volume is suspect. A breakout on surging volume is more likely to follow through:

def add_volume_features(df: pd.DataFrame, windows: list[int]) -> pd.DataFrame:
    for w in windows:
        df[f"volume_sma_{w}"] = df["volume"].rolling(window=w).mean()
        df[f"volume_ratio_{w}"] = df["volume"] / df["volume"].rolling(window=w).mean()
        df[f"volume_trend_{w}"] = (
            df["volume"].rolling(window=w // 2).mean() /
            df["volume"].rolling(window=w).mean()
        )

    return df

df = add_volume_features(df, windows=[10, 20])

The volume_ratio feature is one of the most consistently predictive features across all asset classes. A ratio of 2.0 means current volume is twice the recent average, which is a meaningful event regardless of the absolute volume numbers.

Relative Strength Features

How an asset performs relative to a benchmark or its own recent range provides directional context:

def add_relative_strength(df: pd.DataFrame, window: int) -> pd.DataFrame:
    df[f"distance_from_high_{window}"] = (
        df["close"] / df["high"].rolling(window=window).max() - 1
    )
    df[f"distance_from_low_{window}"] = (
        df["close"] / df["low"].rolling(window=window).min() - 1
    )
    df[f"range_position_{window}"] = (
        (df["close"] - df["low"].rolling(window=window).min()) /
        (df["high"].rolling(window=window).max() - df["low"].rolling(window=window).min())
    )

    return df

df = add_relative_strength(df, window=20)

The range_position feature places the current price on a 0-1 scale within its recent range. Values near 0 mean the price is near the bottom of its range (potential reversal or continuation of weakness). Values near 1 mean it is near the top. This single feature often carries more signal than complex technical indicators.

Building a Complete Feature Pipeline

Individual features are useful, but the real power comes from combining them into a systematic pipeline. Here is a production-grade feature engineering function:

def build_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    return_periods = [1, 2, 3, 5, 10, 20]
    for p in return_periods:
        df[f"return_{p}"] = df["close"].pct_change(p)

    log_ret = np.log(df["close"] / df["close"].shift(1))
    for w in [5, 10, 20, 50]:
        df[f"volatility_{w}"] = log_ret.rolling(window=w).std()

    df["vol_ratio_short_long"] = df["volatility_5"] / df["volatility_20"]
    df["vol_ratio_med_long"] = df["volatility_10"] / df["volatility_50"]

    for w in [10, 20, 50]:
        df[f"volume_ratio_{w}"] = df["volume"] / df["volume"].rolling(window=w).mean()

    for w in [20, 50]:
        rolling_mean = df["close"].rolling(window=w).mean()
        rolling_std = df["close"].rolling(window=w).std()
        df[f"price_zscore_{w}"] = (df["close"] - rolling_mean) / rolling_std

    for w in [10, 20, 50]:
        high_max = df["high"].rolling(window=w).max()
        low_min = df["low"].rolling(window=w).min()
        df[f"range_position_{w}"] = (df["close"] - low_min) / (high_max - low_min)
        df[f"distance_from_high_{w}"] = df["close"] / high_max - 1

    df["body_ratio"] = (df["close"] - df["open"]) / (df["high"] - df["low"])
    df["upper_shadow"] = (df["high"] - df[["close", "open"]].max(axis=1)) / (df["high"] - df["low"])
    df["lower_shadow"] = (df[["close", "open"]].min(axis=1) - df["low"]) / (df["high"] - df["low"])

    max_window = 50
    df = df.iloc[max_window:].reset_index(drop=True)

    return df

Notice that the function drops the first max_window rows at the end. This is where all features will have NaN values due to insufficient history. Dropping these rows explicitly is safer than filling or interpolating them.

Avoiding Lookahead Bias

Lookahead bias is the silent killer of backtesting accuracy. It occurs when your features accidentally include information from the future. With rolling windows, the most common sources are:

Centered windows: Never use center=True in pandas rolling operations for prediction features. A centered window looks both forward and backward, which is fine for analysis but fatal for prediction.

df["bad_feature"] = df["close"].rolling(window=20, center=True).mean()

df["good_feature"] = df["close"].rolling(window=20, center=False).mean()

Normalization with full-dataset statistics: If you normalize features using the mean and standard deviation of the entire dataset, you are leaking future information into past observations.

df["bad_zscore"] = (df["close"] - df["close"].mean()) / df["close"].std()

rolling_mean = df["close"].rolling(window=50).mean()
rolling_std = df["close"].rolling(window=50).std()
df["good_zscore"] = (df["close"] - rolling_mean) / rolling_std

Including the current row in target calculation: When computing the target variable (e.g., forward return), make sure the feature window does not overlap with the target window. If your target is the return over the next 5 periods, your features should use data up to and including the current period, but never the next 5 periods.

Multi-Timeframe Features

Some of the strongest features combine information from multiple timeframes. You can either resample your data or compute rolling features at different scales:

def add_multi_timeframe_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    df["trend_alignment"] = np.sign(df["return_5"]) + np.sign(df["return_10"]) + np.sign(df["return_20"])

    df["momentum_acceleration"] = df["return_5"] - df["return_5"].shift(5)

    df["vol_regime_change"] = df["volatility_5"].pct_change(5)

    return df

The trend_alignment feature ranges from -3 to +3. A value of +3 means short, medium, and long-term returns are all positive, which is a strong bullish alignment. A value of 0 means the timeframes are in conflict. Models learn to treat these regimes very differently.

Performance Optimization

Rolling window calculations on large datasets can be slow. Here are practical optimizations:

Use NumPy for custom rolling functions: Pandas rolling with apply calls Python for every window, which is extremely slow. Use NumPy's stride tricks or scipy.ndimage for custom rolling operations:

from numpy.lib.stride_tricks import sliding_window_view

def fast_rolling_zscore(arr: np.ndarray, window: int) -> np.ndarray:
    windows = sliding_window_view(arr, window)
    means = windows.mean(axis=1)
    stds = windows.std(axis=1)
    result = np.full(len(arr), np.nan)
    result[window - 1:] = (arr[window - 1:] - means) / stds
    return result

df["fast_zscore"] = fast_rolling_zscore(df["close"].values, window=20)

Compute features in parallel: When building features for hundreds of assets, use multiprocessing:

from concurrent.futures import ProcessPoolExecutor

def compute_features_for_symbol(symbol_df: pd.DataFrame) -> pd.DataFrame:
    return build_features(symbol_df)

def parallel_feature_engineering(dfs: dict[str, pd.DataFrame], max_workers: int = 8) -> dict:
    results = {}
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(compute_features_for_symbol, df): symbol
            for symbol, df in dfs.items()
        }
        for future in futures:
            symbol = futures[future]
            results[symbol] = future.result()
    return results

For a universe of 400+ crypto pairs with 10,000 candles each, parallel feature engineering can reduce computation time from several minutes to under 30 seconds.

Practical Takeaways

Start with returns, volatility, and volume ratios. These three feature families carry the most signal for price prediction across all asset classes and timeframes.
Always use z-scores or ratios instead of raw values. This makes features comparable across assets and stable over time.
Never use centered windows or full-dataset statistics for features. Lookahead bias is the most common and most damaging mistake in quantitative research.
Drop NaN rows explicitly rather than filling them. Filling with zeros or forward-fills introduces subtle biases that degrade model performance.
Test feature importance after training. If a feature consistently has near-zero importance, remove it. Fewer, stronger features produce better models than many weak ones.
Multi-timeframe features capture market regime context. Trend alignment and volatility regime changes are among the strongest predictive signals available.

Feature engineering is not glamorous work. It does not make for impressive architecture diagrams or exciting model announcements. But it is where the performance of your trading system is won or lost. Time spent crafting thoughtful rolling window features will repay itself many times over in model accuracy and real-world profitability.