Author: Emmanouil Kritharakis

Training Together, Sharing Nothing: The Promise of Federated Learning
Why Federated Learning Now?

In marketing, data is a competitive edge. The more audience signals, campaign performance data, and consumer behavior a Machine Learning (ML) model can learn from, the sharper its predictions and the greater its business impact. Across the marketing ecosystem — spanning WPP’s agencies, their clients, and external partners — the collective wealth of consumer and brand data is enormous. The potential to train models across these combined assets could unlock transformative capabilities: better audience targeting, smarter media spend, and faster creative optimization at a global scale.

But here’s the challenge. While WPP maintains centralized access to its own internal data assets, much of the most valuable complementary data resides with clients and partners and, for good reasons, it can never leave their walls. Client contracts, privacy regulations like GDPR, and the sheer sensitivity of consumer-level data make cross-organizational data pooling a non-starter. The result? Models are trained without the full picture, and transformative insights that could emerge from combining datasets across organizational boundaries remain permanently out of reach.

The traditional solution, centralized ML, pools raw data from multiple sources into a single cloud to train a global model. But uploading terabytes of sensitive data to a central server creates severe network latency and exposes collaborators to data breaches and potential violations of privacy regulations.

Distributed ML methods attempted to address this by splitting training across local worker nodes. While this reduces latency and avoids centralizing raw data, these architectures were designed for internal computing clusters, not secure collaboration between independent companies. Without cross-organization coordination, models remain isolated, difficult to align, and impossible to improve as a unified whole.

Problem: The Collaboration vs. Privacy Bottleneck

Organizations are effectively trapped between two undesirable choices: compromise sensitive data through centralization, or settle for underperforming, isolated models through distribution. Neither architecture allows multiple companies to collaboratively train a shared, high-performing model while keeping private data strictly localized.

Federated Learning (FL) offers a way out of this dilemma by bringing the model to the data, rather than the other way around. To understand why this shift matters, let’s look at how FL actually works under the hood.

How Federated Learning Works

Figure 1: Overview of the Federated Learning communication cycle between a central node and distributed client nodes.

Federated Learning (FL) enables multiple organizations to collaboratively train a shared model without ever centralizing raw data. Instead of moving data to the model, FL brings the model to the data. Training proceeds through iterative rounds:
1. A central server sends the current global model to all participating clients (Blue arrow).
2. Each client trains the model locally on its own private data (Green arrow).
3. Clients send back only their model updates, never the underlying data (Pink arrow).
4. The server aggregates these updates into an improved global model and starts the next round.
Throughout this process, raw data never leaves its source. Only learned model representations are exchanged across the network.

The Multimodal Challenge

While the above privacy-preserving framework is valuable in its own right, modern marketing data adds another layer of complexity. Organizations do not just work with spreadsheets and numbers. They work with images, video, text, audio, and structured data, often all at once. A single campaign might involve visual brand assets, ad copy, audience segments, and performance metrics across channels. Training models that can reason across these different data types, known as multimodal learning, is already one of the most demanding challenges in ML.

Now combine that with the constraints of federated learning. Each client may hold different combinations of modalities, in different formats and volumes. One partner might contribute rich visual data, another mostly text and tabular records. Coordinating a single global model that learns effectively from this fragmented, heterogeneous landscape, without ever seeing the raw data, pushes the problem to a new level of complexity.

This is precisely what makes the intersection of FL and multimodal learning so important, and so hard. If it can be made to work, it unlocks collaborative intelligence across organizations at a scale that neither approach could achieve alone.

Our Objective: Can Federated Learning Deliver?

The promise of FL is compelling, but before investing in real-world deployment, we need to answer a fundamental question:

Does federated learning actually work well enough on multimodal marketing data to justify the tradeoff?

Centralized training will always have an inherent advantage, it sees all the data at once. The question is not whether FL can beat centralized performance, but whether it can get close enough to make the privacy and collaboration benefits worthwhile. And beyond raw performance, we need to understand how FL behaves under realistic stress conditions: more partners joining, noisy data, and complex cross-modal relationships.

To answer this, we designed a series of experiments around four key questions:

Experiment 1 — Centralized vs. Federated Performance
- How close can FL get to centralized performance? In a centralized setup, the model sees all the data at once, the ideal scenario for learning. FL, by design, fragments this data across clients. The first question is whether this tradeoff costs us meaningful accuracy, or whether FL can match centralized results despite never accessing the full dataset.
- What happens as more clients join? In practice, a federated network might involve a handful of partners or dozens. As the number of participants grows, each client holds a smaller, potentially less representative slice of the overall data. We tested how model performance scales as we increase the number of clients.
Experiment 2 — Resilience to Noisy Data
- How robust are centralized and federated models to noisy data? Real-world datasets are messy, labels can be wrongly defined, and data quality varies across partners. We deliberately introduced noise into the multimodal dataset to simulate these imperfections and measure how much degradation the model can tolerate before performance breaks down.
Experiment 3 — Cross-Modal Relationships
- How sensitive centralized and federated models to underlying cross-modal patterns? Multimodal models learn by connections between different types of data. For example, a luxury brand might target a high-income audience through a premium creative tone on a specific platform. Some of these connections appear frequently in the data, while others are rare. We tested whether emphasizing the most frequent cross-modal patterns in our synthetic data improves performance compared to emphasizing the least frequent ones, helping us understand how much the model benefits from common, naturally occurring relationships versus rare, atypical ones.
The Data

For our experiments, we used a multimodal synthetic dataset generated by our own well-tested synthetic data generator, designed to mirror real-world marketing dynamics. The generator allows us to customize various elements of the data and design targeted datasets that stress-test our model architecture under controlled conditions, giving us full visibility into the factors that drive campaign performance.

Each campaign in the dataset is described using five key modalities:
- Audience – the consumer segment being targeted
- Brand – the positioning and perception of the brand
- Creative – the tone and message of the campaign
- Platform – where the campaign runs
- Geography – the markets being targeted
Each dataset’s sample is assigned a target label, Positive (Over performing), Negative (Under performing), or Average (Average performance), indicating whether that particular combination of modalities would lead to a successful, underperforming, or average campaign outcome.

Experimental Results

All federated experiments are implemented using Flower, a widely adopted open-source framework for federated learning research and deployment. Flower allows us to simulate multi-client federated setups in a controlled environment, making it possible to rigorously test different configurations before moving to a fully distributed architecture.

To ensure a fair comparison between centralized and federated setups, we kept the playing field level. Both setups use the exact same model architecture, so any performance differences come from how the model is trained, not what is being trained. In the federated setup, data is split equally across clients so that each partner sees a representative sample. This way, when we increase the number of clients, any change in performance can be attributed to the scaling itself, not to differences in what each client’s data looks like.

Experiment 1

Question: How does FL performance compare to centralized training? What happens as more clients join?

Figure 2: Impact of increasing client fragmentation on Federated Learning performance. Performance clearly degrade as the number of clients increases from 5 to 15, compared to the baseline centralized model version.

The centralized model sets the performance ceiling at 79.67%. This is expected, when a single model has direct access to all the data at once, it has the best possible conditions to learn. No information is lost to partitioning, and no coordination overhead is introduced. It’s the ideal scenario, and the benchmark everything else is measured against.

The federated results tell a clear story: as we add more clients, performance gradually declines. With 5 clients, the model reaches 76.23%, a modest drop from the centralized baseline. But as we scale to 10 and then 15 clients, scores fall to 70.29% and 67.65% respectively. The same pattern holds across all metrics, with the sharpest drops in the model’s ability to correctly identify both positive and negative cases.

Why does this happen? As more clients join, the total dataset gets divided into smaller slices. Each client sees less data, which means each client’s local training produces a less reliable picture of the overall patterns. When the server combines these local updates, the differences between them make it harder to converge on a strong global model, an effect we call the “aggregation penalty.”

Lesson Learned: FL with 5 clients comes remarkably close to centralized performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of clients grows, makes it progressively harder for the global model to match centralized results.

Experiment 2

Question: How robust are centralized and federated models to noisy data?

In practice, marketing data is never perfectly clean. Campaign outcomes don’t fall neatly into “this worked” or “this didn’t.” Was a campaign that slightly exceeded expectations truly a success, or just average? Was a modest underperformance a failure, or noise in the measurement? Different teams may label the same outcome differently, tracking systems introduce inconsistencies, and the line between a “positive” and “average” campaign is often blurry.

To simulate this reality, we deliberately introduced noise into our synthetic dataset by blurring the boundaries between performance classes. With no noise, the labels are clean — positive, negative, and neutral outcomes are clearly separated. As we increase the noise level from low, to medium, and then to high, the boundaries between these classes increasingly overlap, making it harder for the model to tell them apart. Think of it like gradually turning up the fog: the underlying patterns are still there, but they become harder to see. The federated learning simulation for this experiment was configured with 5 participating clients, consistent with the best-performing federated setup identified in Experiment 1.

Figure 3: Performance comparison of centralized (left) and federated learning (right) configurations across increasing noise levels. Both paradigms degrade gradually, with Positive F1 and Negative F1 most affected, while the performance gap between the two remains approximately constant across all conditions.

As expected, both models perform best on clean data and gradually decline as noise increases. At high noise:
- The centralized model’s score drops from 82.05% to 78.11%
- The FL model’s score drops from 80.74% to 76.09%
The good news: neither model collapses. Even at the highest noise level both models still perform reasonably well. The overall accuracy dips, and the models struggle most with distinguishing clearly positive or clearly negative campaigns, which makes sense, since those are exactly the boundaries we blurred. However, their ability to capture general patterns across the dataset remains stable throughout.

As in Experiment 1, the centralized model maintains a consistent edge over the federated setup at every noise level, but the gap between them stays roughly the same. This means that FL doesn’t become more fragile in noisy conditions; it handles data messiness about as well as its centralized counterpart.

Lesson Learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralized and FL models show strong resilience — performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralized training.

Experiment 3:

Question: How sensitive centralized and federated models to underlying cross-modal patterns?

Our synthetic data generator creates campaign data based on a graph of relationships between five key factors: Audience, Brand, Creative, Platform, and Geography. Each relationship captures whether a particular combination of these factors tends to drive strong or weak campaign performance. Some of these relationships are common and obvious — they show up frequently and reflect well-known marketing dynamics. Others are rare and subtle — unusual combinations that don’t appear often but may carry uniquely valuable signal about what makes a campaign succeed or fail.

Understanding how these different types of patterns affect learning is important for both training paradigms. If the nature of the underlying data patterns matters, we need to know whether centralized and federated models respond to them in the same way — or whether one setup handles certain patterns better than the other. To investigate this, we generated three versions of our dataset, keeping everything else the same:
- Common-first: The generator focuses on the most frequently occurring combinations and downplays the rarest ones. This gives us a dataset dominated by typical, familiar marketing patterns.
- Rare-first: The opposite — the generator prioritizes the rarest combinations and downplays the most common. This fills the dataset with unusual, less obvious patterns.
- Middle-ground: The generator focuses on combinations that fall in the middle of the frequency spectrum, neither the most common nor the rarest.
As in Experiment 2, the federated learning simulation was run with 5 participating clients, and performance was compared against the centralized baseline across all three dataset versions.

Figure 4: Impact of cross-modal relationships on model performance. Prioritizing rare feature combinations (Rare-first) substantially improves accuracy compared with focusing on common patterns, showing that atypical relationships provide a stronger learning signal for both centralized and federated learning paradigms.

The results were striking. The Rare-first configuration dramatically outperformed the other two, achieving peak scores of 94.41% (Centralized) and 93.43% (FL), compared to scores in the 86–88% range for the Common-first and Middle-ground setups.

This tells us something counterintuitive: the model learns far more from unusual feature combinations than from common ones. The typical, frequently seen patterns are in some sense “easy”, they don’t give the model much new information. But rare combinations force the model to learn more nuanced and distinctive boundaries between what makes a campaign succeed or fail.

As in previous experiments, the centralized model maintains a small edge over FL, but the ranking between dataset strategies stays the same in both setups. Whether training centrally or federally, prioritizing rare patterns is the winning strategy.

Lesson Learned: Not all data is equally valuable. Prioritizing on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

The Impact and Looking ahead

This work is just the initial spark for our federated learning efforts. Verifying that the centralized ML model performance our company provides is slightly degraded under a reasonable number of users opens the discussion about delivering ML solutions that address shared challenges among clients who are reluctant to share data to tackle a common industry problem. The FL approach allows companies to securely train a shared global model on their own datasets without the risk of data leakage throughout the training process.

Although Federated Learning has been an established collaborative learning method since 2017, it remains a highly active research domain in academia and a strategic priority for industrial implementation. The findings from WPP Lab’s initial FL research establish the foundation for further exploration, specifically focusing on the following directions:

1. Privacy Constraints in Malicious FL Environments

While FL provides a relatively secure framework for multiple organizations to collaboratively train a global model, a critical question remains: how safe is it to exchange local and global model updates during each communication round? Extensive literature shows that FL networks are vulnerable to attacks from malicious clients or a compromised central server. Consequently, there is a pressing need for robust defense mechanisms that enable honest participants to verify the integrity and security of the collaborative learning process.

2. Evaluation of Advanced and Realistic FL Scenarios

While simulating collaborative training with evenly distributed data provides a valuable baseline for foundational FL research, it does not fully capture the complexities of real-world implementations. Our next objective is to build on our preliminary investigations into data heterogeneity, drawing on the noise-injection experiments previously conducted on synthetic datasets. We will also assess the efficacy of using a shared synthetic dataset on the central server as a benchmark to evaluate the integrity of incoming model updates and detect potential malicious activity. Finally, we plan to move from the simulated FL environment currently facilitated by the Flower framework to a fully distributed architecture. By deploying distinct nodes to represent separate corporate entities, we will empirically investigate and address the communication bottlenecks inherent in practical FL deployments.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.
March 26, 2026

Multimodal Federated Learning Pod

Multimodal Federated Learning for Marketing Outcome Prediction: A Deep Dive Flower-Based Simulation Analysis

Training high-capacity models for marketing prediction is often constrained not by algorithmic complexity but by data locality. In cross-portfolio settings, informative signals are dispersed across organizational silos, and raw data is typically non-transferable due to privacy, and governance restrictions. Conventional centralized training, pooling all data to train a single model, is therefore often infeasible, while standard distributed training within a single trust domain does not address the fundamental challenge of learning across independent data owners.

The Multimodal Challenge

The problem is further complicated by the inherently multimodal nature of modern marketing data. A single campaign may span visual brand assets, textual ad copy, structured audience segments, and cross-channel performance metrics. Multimodal learning, training models to reason jointly across such heterogeneous inputs, is already among the most demanding areas in machine learning. Under collaborative constraints, the challenge intensifies: clients may hold different modality combinations in varying formats and volumes, and a global model must learn from this fragmented landscape without ever accessing raw data.

Why use Federated Learning?

Federated Learning (FL) offers a principled alternative by keeping data at the edge and exchanging only model updates. In the standard cross-silo formulation, each client trains locally on its private partition, a central coordinator aggregates the resulting parameters, and the process repeats over communication rounds until convergence. This report studies that pipeline in a multimodal classification setting, where each sample is a structured composition of contextual modalities and the task is ternary outcome prediction.

The Objective

A fundamental question must be answered before committing to real-world deployment:

Does federated learning perform well enough on multimodal marketing data to justify its tradeoffs?

Centralized training retains an intrinsic advantage, full visibility into the data distribution at every optimization step. The goal of this work is not to show that FL surpasses centralized performance, but to determine whether it approximates it closely enough for the privacy and collaboration benefits to be practically worthwhile. We further examine FL robustness under realistic stress conditions: scaling the number of participating clients, introducing data noise, and investigating the complexity of relationships among different modalities. If this intersection of federated and multimodal learning can be made viable, it enables collaborative intelligence across organizations at a scale neither approach could achieve alone.

To ensure a controlled, reproducible analysis, we generate multimodal datasets with a configurable synthetic generator and assign them to virtual clients using identically independently distributed (IID)-like, homogeneous partitioning to simulate collaborative training. We implement the full FL loop in simulation mode using Flower, a state-of-the-art federated learning framework. The experiments that follow quantify how federation affects predictive quality relative to a centralized baseline, and how performance varies with (i) the number of clients, (ii) dataset noise levels, and (iii) the prioritization of cross-modal relationships in the synthetic data generation based on their frequencies.

Handling the Data

Synthetic Data Generator

For our experiments, we used proprietary multimodal synthetic datasets generated with an internally developed framework. The synthetic data was designed to replicate real-world marketing dynamics, where campaign outcomes are shaped by the interplay of multiple contextual factors (modalities). The framework provides precise control over the composition of each data instance, enabling rigorous evaluation of the proposed model under known conditions and full transparency into the factors driving campaign performance. The specific hyper parameters governing each dataset’s generation are detailed in the experimental sections below, where they are adjusted according to each experiment’s objectives.

Each campaign instance within the dataset is characterized by five distinct modalities:

Audience – the target consumer segment,
Brand – the strategic positioning and perceptual attributes of the brand,
Creative – the tonal and messaging characteristics of the campaign,
Platform – the distribution channel on which the campaign is deployed, and
Geography – the market region(s) in which the campaign is activated.

Each sample is assigned a categorical performance label — Positive (Overperforming), Negative (Underperforming), or Average (Performing within expected bounds) — indicating the projected campaign outcome for a given multimodal configuration. This ternary classification scheme enables the model to learn discriminative representations across the full spectrum of campaign effectiveness.

Data Partitioning

After generating the training and test datasets with the synthetic data generator, the training data must be partitioned across federated clients to simulate a real-world scenario in which multiple companies collaboratively train the model, each holding its own proprietary data. In practice, this partitioning step would not be necessary, as each company would naturally possess its own local dataset. However, in our simulated environment, the codebase provides the partition-dataset script, which splits the training dataset across a user-defined number of clients using either a homogeneous or heterogeneous partitioning strategy.

Under homogeneous partitioning, the script groups training samples by label and divides each label’s indices into equal-sized segments across the specified number of clients. This ensures each client receives a statistically representative subset of the data with approximately uniform class proportions. Under heterogeneous partitioning, the script samples a Dirichlet distribution with concentration parameter alpha to determine the proportion of each class assigned to each client, producing non-IID label skew of configurable severity. Lower alpha values yield more extreme label imbalance across clients, while higher values produce distributions that progressively approach the homogeneous case. For our proof-of-concept research project, we focus our experiments on a homogeneous data split.

Upon completion, the script outputs individual training files per client, a shared server test set, and a heat map visualizing the label distribution across all partitions. An example of a homogeneous partition heat map across 5 clients is shown below:

Figure 1: Heat map showing homogeneous data partitioning across 5 clients in the federated learning scenario

With the dataset now generated and partitioned into per-client training files, we can move from data preparation to the federated training setup. In the next section, we introduce Flower, the framework we use to orchestrate client selection, parameter exchange, and aggregation over these partitions.

Flower: The Federated Learning Framework

What is Flower?

Flower is an open-source, framework-agnostic federated learning framework that enables collaborative model training across decentralized data holders without requiring raw data to leave its source. It supports both real-world distributed deployment over gRPC and local simulation through a Virtual Client Engine (VCE) backed by the Ray distributed runtime. In simulation mode, clients are virtualized as ephemeral objects instantiated on demand, allowing researchers to simulate federations of arbitrary size on a single machine with fine-grained resource control.

Simulation Cycle

The simulation follows a centralized client-server architecture executed over iterative communication rounds. Each round proceeds through five sequential stages:

Figure 2: Breakdown of a federated learning communication round into steps under the Flower framework.

Each virtual client is instantiated at the beginning of its task, executes training or evaluation, returns results, and is immediately destroyed. This allows all five clients in this configuration to run concurrently without persistent memory allocation.

The above diagram illustrates a complete FL communication round orchestrated by the Flower framework. The process begins with Client Selection, where the Flower Server selects a subset of participants from a broader pool. The server then distributes the current global model parameters to all selected clients simultaneously. Each client performs Local Training in parallel using Ray workers, training on its own private dataset to produce a locally updated model. These Local Updates are sent back to the server, which performs Federated Averaging(FedAvg). This is a weighted average, so clients with more data have proportionally more influence on the updated global model. A readable way to write the FedAvg update is:

\theta_{\text{global}} = \sum_{k \in \mathcal{K}} \frac{n_k}{N} \, \theta_k

Where:

$\theta_{\text{global}}$ is the updated global model parameters after aggregation.
$\mathcal{K}$ is the set of clients selected to participate in this round.
$\theta_k$ is the model parameters after client $k$ finishes local training.
$n_k$ is the number of training samples held by client $k$ .
$N = \sum_{k \in \mathcal{K}} n_k$ is the total number of samples across the participating clients.

Finally, a Federated Evaluation step assesses the updated global model across all clients and reports per-client accuracy metrics. This pipeline repeats across successive FL rounds until convergence.

A key architectural detail of Flower’s VirtualClientEngine underpins this workflow: each virtual client is instantiated at the start of its task, runs training or evaluation, returns results, and is immediately destroyed. This enables all clients in the configuration to run concurrently without persistent memory allocation, making the system highly scalable even on resource-constrained hardware. By keeping raw data decentralized at the edge and exchanging only model parameters, this design preserves data privacy by default while still enabling collaborative model improvement across heterogeneous environments.

Configuration Reference

The simulation is governed by a YAML configuration organized into five sections that correspond to Flower’s core components. Here, we present the YAML file with its default values.

server:
  strategy: Mean
  fraction_fit: 1.0
  fraction_eval: 1.0
  num_rounds: 4
  server_dataset_included: false
client:
  num_clients: 5
model:
  name: GeneralisedJointModel_3cls_feat_drop
  hidden_embed_dim: 256
  embed_dim: 256
  hidden_dim: 256
  dropout_prob: 0.3
  modality_dropout_prob: 0
  local_epochs: 5
  batch_size: 64
  lr: 1e-3
  weight_decay: 1e-4
  optuna_n_trials: 3
general:
  use_wandb: true
  random_seed: 42
  text_embedding_task: classification
  data_version: V28_noise_0_percent
  data_split: homogeneous
  alpha: 0.1
  one_hot_modalities:
backend:
  client_resources:
    num_cpus: 2.0
    num_gpus: 0.0

To run an experiment, the user only needs to adjust the desired parameters in this YAML file and execute the simulation with:

poetry run simulation

The framework reads the configuration, initializes all components accordingly, and executes the full federated learning cycle without any additional setup. Below, we explain what each section’s default values represent.

Server Configuration

Defines the central orchestrator responsible for client coordination, parameter distribution, and aggregation.

Parameter	Default Value	Description
`strategy`	`Mean`	Aggregation strategy implementing Federated Averaging (FedAvg), where client model parameters are averaged weighted by each client’s local dataset size.
`fraction_fit`	`1.0`	Fraction of clients selected for training each round. At `1.0`, all clients participate in every round.
`fraction_eval`	`1.0`	Fraction of clients selected for evaluation each round. At `1.0`, all clients evaluate after each aggregation.
`num_rounds`	`4`	Total number of federated communication rounds. Combined with 5 local epochs per round (default value), each data sample is exposed to 20 effective training epochs.
`server_dataset_included`	`false`	The server holds no data partition and acts purely as a parameter aggregator.

Client Configuration

Defines the federated client pool. Each client represents an independent data silo with its own private partition.

Parameter	Default Value	Description
`num_clients`	`5`	Total number of virtual clients in the federation. With full participation, this constitutes a cross-silo setting with a small number of reliable, always-available participants.

Model Configuration

Defines the neural network architecture and the local training hyper parameters applied on each client.

Architecture Parameters

Parameter	Default Value	Description
`name`	`GeneralisedJointModel_3cls_feat_drop`	A custom multimodal fusion model with a modality-agnostic joint embedding space, three classification heads for multi-task prediction, and feature-level dropout regularization.
`hidden_embed_dim`	`256`	Dimensionality of hidden layers within each modality-specific encoder, applied before the fusion stage.
`embed_dim`	`256`	Dimensionality of the joint fused embedding space shared across all classification heads.
`hidden_dim`	`256`	Dimensionality of hidden layers within each of the three classification heads.
`dropout_prob`	`0.3`	Dropout probability applied in the classification heads and hidden layers.
`modality_dropout_prob`	`0`	Probability of dropping entire modality branches during training. At `0`, all modalities are always present (disabled).

Training Parameters

Parameter	Default Value	Description
`local_epochs`	`5`	Number of complete passes over a client’s local dataset per communication round.
`batch_size`	`64`	Mini-batch size for local stochastic gradient descent.
`lr`	`1e-3`	Learning rate for the local optimizer.
`weight_decay`	`1e-4`	L2 regularization coefficient penalizing large weight magnitudes.
`optuna_n_trials`	`3`	Number of Optuna Bayesian hyper parameter optimization trials.

Data Configuration

Controls dataset versioning, partitioning strategy, and modality-specific preprocessing.

Parameter	Default Value	Description
`data_version`	`V28_noise_0_percent`	Dataset version identifier pointing to a specific preprocessed dataset.
`data_split`	`homogeneous`	Partitioning strategy across clients. Points to a homogeneous data distribution folder where data is split in an IID manner and each client receives a statistically representative partition with similar class distributions. The alternative `heterogeneous` points to a non-IID distribution folder where splits are governed by `alpha`.
`alpha`	`0.1`	Dirichlet concentration parameter controlling non-IID severity when `data_split` points to a `heterogeneous` distribution folder. Currently inactive as the configuration uses the `homogeneous` split. When active, lower values produce more extreme label skew across clients.
`text_embedding_task`	`classification`	Configures the text modality encoder for a classification objective, affecting pooling strategy and embedding optimization.
`one_hot_modalities`	`null`	Specifies modalities requiring one-hot encoding. Currently none.

The data_split and alpha parameters in the simulation YAML configuration must match the partitioning strategy used when running the partition-dataset script, because the simulation reads client data from the output directory, whose folder name encodes both the split type and the number of clients.

Simulation Backend & Experiment Management

Controls reproducibility, logging, and resource allocation for parallel execution of virtual clients through the Ray runtime.

Parameter	Default Value	Description
`use_wandb`	`true`	Enables Weights & Biases experiment tracking for real-time metric logging and cross-experiment comparison.
`random_seed`	`42`	Global seed ensuring reproducibility in our experiments.
`num_cpus`	`2.0`	CPU cores reserved per virtual client task. Ray schedules a client only when the required cores are available.
`num_gpus`	`0.0`	GPU allocation per client. At `0.0`, all computation runs on CPU and concurrency is bounded solely by CPU availability.

Experiments

The following section outlines three experiments that compare federated learning with centralized training under different conditions.

Experiment 1 establishes a baseline comparison and tests how FL performance changes as the number of clients increases.
Experiment 2 examines how robust both approaches are when controlled noise is added to the training data.
Experiment 3 evaluates whether emphasizing common vs. rare cross-modal relationships during synthetic data generation affects model performance.

Overall, these experiments measure both the absolute performance of each approach and whether the performance gap between centralized and federated training remains consistent as the setting becomes more challenging.

To ensure a fair comparative evaluation, all experimental variables were held constant across centralized and federated configurations. Both setups employ an identical model architecture, ensuring that any observed performance differences are attributable solely to the training paradigm rather than structural variations in the model itself.

The total computational training budget was also standardized: the centralized model undergoes 20 sequential training epochs, while the federated configuration distributes this across 4 global communication rounds with each client performing 5 local epochs per round, yielding an equivalent total of 20 epochs. Furthermore, the training data is partitioned uniformly across all participating clients under a homogeneous assumption, ensuring that each client’s local dataset is a representative subset of the global distribution. This controlled partitioning ensures that any performance degradation observed with increasing client counts can be attributed to the effects of federation and aggregation at scale, rather than to statistical heterogeneity across client data.

Experiment 1: Centralized vs. Federated Performance

The Data

We employed the V16 synthetic dataset, generated using the synthetic data generator, comprising 84K training samples and 36K test samples. For both centralized and federated training configurations, raw input features — specifically Audience, Brand, Creative, Platform, and Geography — were encoded into dense representations using Vertex AI embeddings as a preprocessing step.

The Results

Training Configuration	Score	Negative F1	Positive F1	Average F1
Centralized (Baseline)	0.7967	0.7074	0.7305	0.9523
FL with 5 clients	0.7623	0.6841	0.6571	0.9458
FL with 10 clients	0.7029	0.5781	0.5902	0.9405
FL with 15 clients	0.6765	0.5345	0.5581	0.9368

The centralized training configuration establishes the upper performance bound at a score of 0.7967. This outcome is theoretically expected, as the model benefits from unrestricted access to the complete dataset, without information loss due to partitioning or coordination overhead inherent in distributed paradigms. It therefore serves as the reference benchmark for all federated configurations.

The federated learning results reveal a consistent and monotonic degradation in performance as the number of participating clients increases. With 5 clients, the model achieves a score of 0.7623 — a modest decline of approximately 3.5 points from the centralized baseline. However, scaling to 10 and 15 clients yields more substantial reductions to 0.7029 and 0.6765, respectively. This pattern is uniformly reflected across all evaluation metrics; however, the decline is most pronounced in the Negative F1 and Positive F1 scores, which degrade at a markedly steeper rate than Average F1. This suggests that class-specific discriminative performance is more sensitive to data partitioning than overall classification ability.

This observed degradation is primarily attributable to the aggregation penalty. As the number of clients grows, the training corpus is divided into progressively smaller subsets, resulting in local model updates that are less representative of the global data distribution. The increased variance among these updates introduces noise during server-side aggregation, impeding convergence toward a robust global model.

Lesson Learned: FL with 5 clients comes remarkably close to centralized performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of clients grows, makes it progressively harder for the global model to match centralized results.

Experiment 2: Resilience to Noisy Data

The Data

To conduct this experiment, it was necessary to generate synthetic data with controlled levels of noise. To understand what noise means in this context, it is important to first describe how the synthetic data is generated.

The data generation process is grounded in a predefined graph structure. In this graph, nodes represent distinct values for each modality — namely Audience, Brand, Creative, Platform, and Geography — while edges encode the pairwise relationships between these values. Each edge carries a label of either Positive (indicating an over performing campaign) or Negative (indicating an underperforming campaign).

The generator samples from this graph to produce a user-defined number of data points, subject to a set of hard constraints governed by configurable hyper parameters. Specifically, the user defines the desired number of samples for each target label: Positive, Negative, and Average. The generation of a single data sample proceeds as follows:

Value Selection: One or more unique values are selected for each modality.
Pairwise Evaluation: All pairwise combinations among the selected values are evaluated against the graph. Each combination is classified as positive, negative, or missing — the latter indicating that no edge exists between the two values in the graph.
Proportion Calculation: The proportions of positive, negative, and missing combinations are computed relative to the total number of pairwise combinations.
Label Assignment: These proportions are then compared against predefined acceptable ranges specified in the hyper parameters for each target label. If the proportions fall within the range defined for Positive, Negative, or Average, the sample is assigned the corresponding label. If the proportions do not satisfy any of the defined ranges, the sample is discarded and the generation process is repeated.

A key question that arises from this process is: How are the predefined acceptable ranges for each target label determined?

To address this, we conducted the following preliminary experiment. We randomly sampled 10,000 subgraphs, each comprising 1,000 edges, from the initial graph. For each subgraph, we computed the proportions of positive, negative, and missing pairwise combinations. From these 10,000 samples, we derived the mean and standard deviation among Positive, Negative, and Missing values. These statistics were then used to define the acceptable range for the Average target label, representing the typical composition of a randomly sampled subgraph.

The acceptable ranges for the Positive and Negative target labels were subsequently defined by shifting the boundaries of the Average range along the respective axes. Specifically, the Positive range requires the proportion of positive combinations to exceed the Average upper bound by at least 5 standard deviations, and similarly, the Negative range requires the proportion of negative combinations to exceed the Average upper bound by the same margin. This ensures a clear statistical separation between the three label categories, such that samples assigned to the Positive or Negative class exhibit meaningfully distinct distributional characteristics from those labeled as Average.

Based on the above methodology, we established the appropriate acceptable ranges for each target label. This, however, raises a subsequent question: What constitutes noise in this context?

Figure 3: Impact of additive noise on the acceptable ranges for each target label in the synthetic data generator. As noise increases from zero to high, the opposing acceptable range for each sample’s target label progressively widens. This increases the acceptable proportion of negative combinations for the Positive label, positive combinations for the Negative label, and both equally for the Average label, thereby reducing the distributional separation between label categories.

In our framework, noise is defined as the relaxation of the opposing acceptable range for a given target label. Specifically, introducing noise to the Positive target label corresponds to increasing its acceptable proportion of negative combinations — effectively reducing the degree of “positiveness” required for a sample to be classified as Positive. Conversely, adding noise to the Negative target label increases its acceptable proportion of positive combinations. For the Average target label, the additive noise is distributed equally across both the Positive and Negative acceptable ranges.

This noise mechanism is applied at three levels of intervention — low, medium, and high — each progressively widening the acceptable range of the opposing value for a given target label. The figure above illustrates how the acceptable ranges for each target label are impacted under each level of intervention.

To support this experiment, four synthetic datasets were generated, each comprising 84K training samples and 36K test samples:

Clean: No noise intervention applied.
Low Noise: Low-level relaxation of the opposing acceptable ranges.
Medium Noise: Medium-level relaxation of the opposing acceptable ranges.
High Noise: High-level relaxation of the opposing acceptable ranges.

The federated learning simulation was configured with five participating clients, and performance was evaluated against the centralized baseline across all four dataset conditions.

The Results

Training Configuration	Score	Negative F1	Average F1	Positive F1
FL with no noise	0.8074	0.8029	0.8633	0.7559
Centralized with no noise	0.8205	0.8165	0.8692	0.7760
FL with low noise	0.7923	0.7902	0.8555	0.7313
Centralized with low noise	0.8159	0.8039	0.8750	0.7686
FL with medium noise	0.7826	0.7718	0.8570	0.7189
Centralized with medium noise	0.8017	0.7876	0.8643	0.7533
FL with high noise	0.7609	0.7344	0.8625	0.6859
Centralized with high noise	0.7811	0.7551	0.8659	0.7221

As anticipated, both the centralized and federated models achieve their highest performance on clean data and exhibit a gradual decline as noise levels increase. At the highest noise intervention, the centralized model’s score decreases from 0.8205 to 0.7811, while the federated model’s score declines from 0.8074 to 0.7609 — representing drops of approximately 3.9 and 4.7 percentage points, respectively.

Notably, neither model exhibits catastrophic degradation under any noise condition. Even at the highest level of intervention, both configurations maintain reasonable performance. The most pronounced declines are observed in the Positive F1 and Negative F1 scores, which is consistent with the noise injection methodology described above: since noise is introduced by relaxing the opposing acceptable range for each target label, the boundaries between Positive and Negative classes become increasingly blurred, making these the most challenging distinctions for the model. In contrast, the Average F1 remains remarkably stable across all noise levels for both configurations, indicating that the models’ capacity to capture general distributional patterns is largely unaffected by the introduced noise.

Consistent with the findings from Experiment 1, the centralized model maintains a performance advantage over the federated configuration at every noise level. However, the magnitude of this gap remains approximately constant across all noise conditions. This observation is significant: it indicates that the federated setup does not exhibit increased sensitivity to noisy data relative to its centralized counterpart. The performance differential between the two paradigms is attributable to the aggregation penalty discussed in Experiment 1, rather than to any compounding effect of noise on the federated training process.

Lesson Learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralized and FL models show strong resilience, performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralized training.

Experiment 3: Impact of Cross-Modal Relationships under Synthetic Data Generation

The Data

Leveraging the synthetic data generator enables the investigation of additional structural characteristics of the initial marketing graph — specifically, the edge types representing pairwise relationships between modalities. Understanding which modality pairs (e.g., Audience–Brand or Creative–Geography) are most influential on model performance is of particular interest.

To this end, we conducted a preliminary analysis: 10,000 subgraphs, each comprising 1,000 edges, were sampled from the initial multimodal graph, and the mean and standard deviation of the observed edge-type frequencies were computed. The table below presents the modality relationships ranked by frequency, from the most common to the most rare.

Modality Relationship	Mean Frequency
Brand to Content	8.48
Audience to Content	7.29
Content to Geography	6.57
Audience to Brand	5.90
Audience to Geography	5.84
Brand to Geography	5.55
Content to Content	3.24
Brand to Brand	3.18
Content to Platform	1.88
Brand to Platform	1.52
Audience to Audience	1.18
Audience to Platform	1.16

This frequency distribution informed the design of a subsequent performance-based experiment. The synthetic data generator exposes two relevant hyper parameters: a high pair preference, which increases the likelihood of sampling edges from the specified modality relationships, and a low pair preference, which suppresses them. Using these controls, three synthetic datasets were generated under distinct configurations:

Common First: The two highest-frequency modality relationships are assigned as the high pair, and the two lowest-frequency relationships as the low pair.
Rare First: The inverse configuration, where the two lowest-frequency relationships are assigned as the high pair and the two highest as the low pair.
Middle Ground: The four middle-ranked relationships from the frequency table are assigned to the high and low pairs accordingly.

All remaining hyper parameters were held constant across the three configurations: noise levels were set to zero, and the federated learning simulation was conducted with five participating clients, consistent with the setup described in prior experiments. The training size remains 84k, as the test size is equal to 36K.

The Results

Training Configuration	Score	Negative F1	Average F1	Positive F1
Centralized Common First	0.8803	0.8856	0.9245	0.8308
FL Common First	0.8748	0.8775	0.9150	0.8320
Centralized Rare First	0.9441	0.9503	0.9579	0.9242
FL Rare First	0.9343	0.9470	0.9505	0.9054
Centralized Middle Ground	0.8813	0.8744	0.9143	0.8551
FL Middle Ground	0.8625	0.8677	0.9009	0.8188

The results reveal a notable disparity in performance across the three dataset configurations. The Rare First configuration substantially outperforms the other two, achieving scores of 0.9441 (Centralized) and 0.9343 (FL) — a margin of approximately 6–8 percentage points over the Common First and Middle Ground configurations, which yield scores in the 0.86–0.88 range. This performance advantage is consistently reflected across all evaluation metrics, with particularly pronounced gains in Positive F1, where the Rare First configuration achieves 0.9242 (Centralized) and 0.9054 (FL), compared to values in the 0.81–0.85 range for the alternative configurations.

This finding is counterintuitive yet theoretically interpretable. Frequently occurring modality combinations, by virtue of their prevalence, contribute comparatively less discriminative information to the learning process — the decision boundaries they define are, in effect, already well-represented and easily separable. In contrast, rare combinations compel the model to learn more nuanced and distinctive feature interactions, resulting in richer decision boundaries between Positive and Negative campaign outcomes. The learning signal provided by atypical patterns is therefore disproportionately more informative per sample.

Consistent with findings from prior experiments, the centralized model maintains a modest performance advantage over the federated configuration across all three dataset strategies. Crucially, however, the relative ranking of dataset configurations remains identical under both training paradigms: Rare First consistently outperforms Common First and Middle Ground, regardless of whether training is conducted centrally or in a federated manner.

Lesson Learned: Not all data is equally valuable. Prioritizing on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

Impact and Future Directions

This work represents an initial investigation into the viability of federated learning within our operational context. The finding that centralized model performance degrades only marginally under a reasonable number of participating clients opens a promising avenue for delivering machine learning solutions that address shared industry challenges among organizations reluctant to pool their data. The federated learning paradigm enables multiple entities to collaboratively train a shared global model on their respective proprietary datasets, without exposing raw data at any stage of the training process, thereby mitigating the risk of data leakage.

Although Federated Learning has been an established collaborative learning paradigm since its introduction in 2017, it remains a highly active area of research in academia and a strategic priority for industrial adoption. The findings from WPP Lab’s initial FL research establish the foundation for continued exploration, with future work organized around the following directions:

1. Privacy Guarantees in Adversarial Federated Environments

While FL provides a relatively secure framework for multi-organizational model training, a fundamental concern persists: to what extent can the exchange of local and global model updates during each communication round be considered safe? A substantial body of literature has demonstrated that FL networks are susceptible to a range of adversarial attacks, originating from either malicious clients or a compromised central server. Addressing this vulnerability necessitates the development of robust defense mechanisms that enable honest participants to verify the integrity and trustworthiness of the collaborative learning process.

2. Evaluation Under Advanced and Realistic Federated Scenarios

While simulating collaborative training with uniformly distributed data provides a valuable baseline for foundational FL research, it does not fully capture the complexities inherent in real-world deployments. Future work will extend our preliminary investigations into data heterogeneity, building upon the noise-injection experiments conducted on synthetic datasets in this study. Additionally, we intend to evaluate the efficacy of maintaining a shared synthetic dataset on the central server as a reference benchmark for assessing the integrity of incoming model updates and detecting potentially malicious contributions. Finally, we plan to transition from the simulated FL environment currently facilitated by the Flower framework to a fully distributed architecture. By deploying distinct computational nodes to represent separate organizational entities, we aim to empirically investigate and address the communication bottlenecks inherent in practical federated deployments.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

March 26, 2026

Author: Emmanouil Kritharakis

Training Together, Sharing Nothing: The Promise of Federated Learning

Why Federated Learning Now?

How Federated Learning Works

The Multimodal Challenge

Our Objective: Can Federated Learning Deliver?

The Data

Experimental Results

Experiment 1

Experiment 3:

The Impact and Looking ahead

1. Privacy Constraints in Malicious FL Environments

2. Evaluation of Advanced and Realistic FL Scenarios

Multimodal Federated Learning Pod

Multimodal Federated Learning for Marketing Outcome Prediction: A Deep Dive Flower-Based Simulation Analysis

The Multimodal Challenge

Why use Federated Learning?

The Objective

Handling the Data

Synthetic Data Generator

Data Partitioning

Flower: The Federated Learning Framework

What is Flower?

Simulation Cycle

Configuration Reference

Server Configuration

Client Configuration

Model Configuration

Data Configuration

Simulation Backend & Experiment Management

Experiments

Experiment 1: Centralized vs. Federated Performance

The Data

The Results

Experiment 2: Resilience to Noisy Data

The Data

The Results

Experiment 3: Impact of Cross-Modal Relationships under Synthetic Data Generation

The Data

The Results

Impact and Future Directions

1. Privacy Guarantees in Adversarial Federated Environments

2. Evaluation Under Advanced and Realistic Federated Scenarios