Blog

  • Training Together, Sharing Nothing: The Promise of Federated Learning

    Why Federated Learning Now?

    In marketing, data is a competitive edge. The more audience signals, campaign performance data, and consumer behavior a Machine Learning (ML) model can learn from, the sharper its predictions and the greater its business impact. Across the marketing ecosystem — spanning WPP’s agencies, their clients, and external partners — the collective wealth of consumer and brand data is enormous. The potential to train models across these combined assets could unlock transformative capabilities: better audience targeting, smarter media spend, and faster creative optimization at a global scale.

    But here’s the challenge. While WPP maintains centralized access to its own internal data assets, much of the most valuable complementary data resides with clients and partners and, for good reasons, it can never leave their walls. Client contracts, privacy regulations like GDPR, and the sheer sensitivity of consumer-level data make cross-organizational data pooling a non-starter. The result? Models are trained without the full picture, and transformative insights that could emerge from combining datasets across organizational boundaries remain permanently out of reach.

    The traditional solution, centralized ML, pools raw data from multiple sources into a single cloud to train a global model. But uploading terabytes of sensitive data to a central server creates severe network latency and exposes collaborators to data breaches and potential violations of privacy regulations.

    Distributed ML methods attempted to address this by splitting training across local worker nodes. While this reduces latency and avoids centralizing raw data, these architectures were designed for internal computing clusters, not secure collaboration between independent companies. Without cross-organization coordination, models remain isolated, difficult to align, and impossible to improve as a unified whole.

    Problem: The Collaboration vs. Privacy Bottleneck

    Organizations are effectively trapped between two undesirable choices: compromise sensitive data through centralization, or settle for underperforming, isolated models through distribution. Neither architecture allows multiple companies to collaboratively train a shared, high-performing model while keeping private data strictly localized.

    Federated Learning (FL) offers a way out of this dilemma by bringing the model to the data, rather than the other way around. To understand why this shift matters, let’s look at how FL actually works under the hood.


    How Federated Learning Works

    Figure 1: Overview of the Federated Learning communication cycle between a central node and distributed client nodes.

    Federated Learning (FL) enables multiple organizations to collaboratively train a shared model without ever centralizing raw data. Instead of moving data to the model, FL brings the model to the data. Training proceeds through iterative rounds:

    1. A central server sends the current global model to all participating clients (Blue arrow).
    2. Each client trains the model locally on its own private data (Green arrow).
    3. Clients send back only their model updates, never the underlying data (Pink arrow).
    4. The server aggregates these updates into an improved global model and starts the next round.

    Throughout this process, raw data never leaves its source. Only learned model representations are exchanged across the network.

    The Multimodal Challenge

    While the above privacy-preserving framework is valuable in its own right, modern marketing data adds another layer of complexity. Organizations do not just work with spreadsheets and numbers. They work with images, video, text, audio, and structured data, often all at once. A single campaign might involve visual brand assets, ad copy, audience segments, and performance metrics across channels. Training models that can reason across these different data types, known as multimodal learning, is already one of the most demanding challenges in ML.

    Now combine that with the constraints of federated learning. Each client may hold different combinations of modalities, in different formats and volumes. One partner might contribute rich visual data, another mostly text and tabular records. Coordinating a single global model that learns effectively from this fragmented, heterogeneous landscape, without ever seeing the raw data, pushes the problem to a new level of complexity.

    This is precisely what makes the intersection of FL and multimodal learning so important, and so hard. If it can be made to work, it unlocks collaborative intelligence across organizations at a scale that neither approach could achieve alone.


    Our Objective: Can Federated Learning Deliver?

    The promise of FL is compelling, but before investing in real-world deployment, we need to answer a fundamental question:

    Does federated learning actually work well enough on multimodal marketing data to justify the tradeoff?

    Centralized training will always have an inherent advantage, it sees all the data at once. The question is not whether FL can beat centralized performance, but whether it can get close enough to make the privacy and collaboration benefits worthwhile. And beyond raw performance, we need to understand how FL behaves under realistic stress conditions: more partners joining, noisy data, and complex cross-modal relationships.

    To answer this, we designed a series of experiments around four key questions:

    Experiment 1 — Centralized vs. Federated Performance

    • How close can FL get to centralized performance? In a centralized setup, the model sees all the data at once, the ideal scenario for learning. FL, by design, fragments this data across clients. The first question is whether this tradeoff costs us meaningful accuracy, or whether FL can match centralized results despite never accessing the full dataset.
    • What happens as more clients join? In practice, a federated network might involve a handful of partners or dozens. As the number of participants grows, each client holds a smaller, potentially less representative slice of the overall data. We tested how model performance scales as we increase the number of clients.

    Experiment 2 — Resilience to Noisy Data

    • How robust are centralized and federated models to noisy data? Real-world datasets are messy, labels can be wrongly defined, and data quality varies across partners. We deliberately introduced noise into the multimodal dataset to simulate these imperfections and measure how much degradation the model can tolerate before performance breaks down.

    Experiment 3 — Cross-Modal Relationships

    • How sensitive centralized and federated models to underlying cross-modal patterns? Multimodal models learn by connections between different types of data. For example, a luxury brand might target a high-income audience through a premium creative tone on a specific platform. Some of these connections appear frequently in the data, while others are rare. We tested whether emphasizing the most frequent cross-modal patterns in our synthetic data improves performance compared to emphasizing the least frequent ones, helping us understand how much the model benefits from common, naturally occurring relationships versus rare, atypical ones.

    The Data

    For our experiments, we used a multimodal synthetic dataset generated by our own well-tested synthetic data generator, designed to mirror real-world marketing dynamics. The generator allows us to customize various elements of the data and design targeted datasets that stress-test our model architecture under controlled conditions, giving us full visibility into the factors that drive campaign performance.

    Each campaign in the dataset is described using five key modalities:

    • Audience – the consumer segment being targeted
    • Brand – the positioning and perception of the brand
    • Creative – the tone and message of the campaign
    • Platform – where the campaign runs
    • Geography – the markets being targeted

    Each dataset’s sample is assigned a target label, Positive (Over performing), Negative (Under performing), or Average (Average performance), indicating whether that particular combination of modalities would lead to a successful, underperforming, or average campaign outcome.

    Experimental Results

    All federated experiments are implemented using Flower, a widely adopted open-source framework for federated learning research and deployment. Flower allows us to simulate multi-client federated setups in a controlled environment, making it possible to rigorously test different configurations before moving to a fully distributed architecture.

    To ensure a fair comparison between centralized and federated setups, we kept the playing field level. Both setups use the exact same model architecture, so any performance differences come from how the model is trained, not what is being trained. In the federated setup, data is split equally across clients so that each partner sees a representative sample. This way, when we increase the number of clients, any change in performance can be attributed to the scaling itself, not to differences in what each client’s data looks like.

    Experiment 1

    Question: How does FL performance compare to centralized training? What happens as more clients join?

    Figure 2: Impact of increasing client fragmentation on Federated Learning performance. Performance clearly degrade as the number of clients increases from 5 to 15, compared to the baseline centralized model version.

    The centralized model sets the performance ceiling at 79.67%. This is expected, when a single model has direct access to all the data at once, it has the best possible conditions to learn. No information is lost to partitioning, and no coordination overhead is introduced. It’s the ideal scenario, and the benchmark everything else is measured against.

    The federated results tell a clear story: as we add more clients, performance gradually declines. With 5 clients, the model reaches 76.23%, a modest drop from the centralized baseline. But as we scale to 10 and then 15 clients, scores fall to 70.29% and 67.65% respectively. The same pattern holds across all metrics, with the sharpest drops in the model’s ability to correctly identify both positive and negative cases.

    Why does this happen? As more clients join, the total dataset gets divided into smaller slices. Each client sees less data, which means each client’s local training produces a less reliable picture of the overall patterns. When the server combines these local updates, the differences between them make it harder to converge on a strong global model, an effect we call the “aggregation penalty.”

    Lesson Learned: FL with 5 clients comes remarkably close to centralized performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of clients grows, makes it progressively harder for the global model to match centralized results.

    Experiment 2

    Question: How robust are centralized and federated models to noisy data?

    In practice, marketing data is never perfectly clean. Campaign outcomes don’t fall neatly into “this worked” or “this didn’t.” Was a campaign that slightly exceeded expectations truly a success, or just average? Was a modest underperformance a failure, or noise in the measurement? Different teams may label the same outcome differently, tracking systems introduce inconsistencies, and the line between a “positive” and “average” campaign is often blurry.

    To simulate this reality, we deliberately introduced noise into our synthetic dataset by blurring the boundaries between performance classes. With no noise, the labels are clean — positive, negative, and neutral outcomes are clearly separated. As we increase the noise level from low, to medium, and then to high, the boundaries between these classes increasingly overlap, making it harder for the model to tell them apart. Think of it like gradually turning up the fog: the underlying patterns are still there, but they become harder to see. The federated learning simulation for this experiment was configured with 5 participating clients, consistent with the best-performing federated setup identified in Experiment 1.

    Figure 3: Performance comparison of centralized (left) and federated learning (right) configurations across increasing noise levels. Both paradigms degrade gradually, with Positive F1 and Negative F1 most affected, while the performance gap between the two remains approximately constant across all conditions.

    As expected, both models perform best on clean data and gradually decline as noise increases. At high noise:

    • The centralized model’s score drops from 82.05% to 78.11%
    • The FL model’s score drops from 80.74% to 76.09%

    The good news: neither model collapses. Even at the highest noise level both models still perform reasonably well. The overall accuracy dips, and the models struggle most with distinguishing clearly positive or clearly negative campaigns, which makes sense, since those are exactly the boundaries we blurred. However, their ability to capture general patterns across the dataset remains stable throughout.

    As in Experiment 1, the centralized model maintains a consistent edge over the federated setup at every noise level, but the gap between them stays roughly the same. This means that FL doesn’t become more fragile in noisy conditions; it handles data messiness about as well as its centralized counterpart.

    Lesson Learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralized and FL models show strong resilience — performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralized training.

    Experiment 3:

    Question: How sensitive centralized and federated models to underlying cross-modal patterns?

    Our synthetic data generator creates campaign data based on a graph of relationships between five key factors: Audience, Brand, Creative, Platform, and Geography. Each relationship captures whether a particular combination of these factors tends to drive strong or weak campaign performance. Some of these relationships are common and obvious — they show up frequently and reflect well-known marketing dynamics. Others are rare and subtle — unusual combinations that don’t appear often but may carry uniquely valuable signal about what makes a campaign succeed or fail.

    Understanding how these different types of patterns affect learning is important for both training paradigms. If the nature of the underlying data patterns matters, we need to know whether centralized and federated models respond to them in the same way — or whether one setup handles certain patterns better than the other. To investigate this, we generated three versions of our dataset, keeping everything else the same:

    • Common-first: The generator focuses on the most frequently occurring combinations and downplays the rarest ones. This gives us a dataset dominated by typical, familiar marketing patterns.
    • Rare-first: The opposite — the generator prioritizes the rarest combinations and downplays the most common. This fills the dataset with unusual, less obvious patterns.
    • Middle-ground: The generator focuses on combinations that fall in the middle of the frequency spectrum, neither the most common nor the rarest.

    As in Experiment 2, the federated learning simulation was run with 5 participating clients, and performance was compared against the centralized baseline across all three dataset versions.

    Figure 4: Impact of cross-modal relationships on model performance. Prioritizing rare feature combinations (Rare-first) substantially improves accuracy compared with focusing on common patterns, showing that atypical relationships provide a stronger learning signal for both centralized and federated learning paradigms.

    The results were striking. The Rare-first configuration dramatically outperformed the other two, achieving peak scores of 94.41% (Centralized) and 93.43% (FL), compared to scores in the 86–88% range for the Common-first and Middle-ground setups.

    This tells us something counterintuitive: the model learns far more from unusual feature combinations than from common ones. The typical, frequently seen patterns are in some sense “easy”, they don’t give the model much new information. But rare combinations force the model to learn more nuanced and distinctive boundaries between what makes a campaign succeed or fail.

    As in previous experiments, the centralized model maintains a small edge over FL, but the ranking between dataset strategies stays the same in both setups. Whether training centrally or federally, prioritizing rare patterns is the winning strategy.

    Lesson Learned: Not all data is equally valuable. Prioritizing on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

    The Impact and Looking ahead

    This work is just the initial spark for our federated learning efforts. Verifying that the centralized ML model performance our company provides is slightly degraded under a reasonable number of users opens the discussion about delivering ML solutions that address shared challenges among clients who are reluctant to share data to tackle a common industry problem. The FL approach allows companies to securely train a shared global model on their own datasets without the risk of data leakage throughout the training process.


    Although Federated Learning has been an established collaborative learning method since 2017, it remains a highly active research domain in academia and a strategic priority for industrial implementation. The findings from WPP Lab’s initial FL research establish the foundation for further exploration, specifically focusing on the following directions:

    1. Privacy Constraints in Malicious FL Environments

    While FL provides a relatively secure framework for multiple organizations to collaboratively train a global model, a critical question remains: how safe is it to exchange local and global model updates during each communication round? Extensive literature shows that FL networks are vulnerable to attacks from malicious clients or a compromised central server. Consequently, there is a pressing need for robust defense mechanisms that enable honest participants to verify the integrity and security of the collaborative learning process.

    2. Evaluation of Advanced and Realistic FL Scenarios

    While simulating collaborative training with evenly distributed data provides a valuable baseline for foundational FL research, it does not fully capture the complexities of real-world implementations. Our next objective is to build on our preliminary investigations into data heterogeneity, drawing on the noise-injection experiments previously conducted on synthetic datasets. We will also assess the efficacy of using a shared synthetic dataset on the central server as a benchmark to evaluate the integrity of incoming model updates and detect potential malicious activity. Finally, we plan to move from the simulated FL environment currently facilitated by the Flower framework to a fully distributed architecture. By deploying distinct nodes to represent separate corporate entities, we will empirically investigate and address the communication bottlenecks inherent in practical FL deployments.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • From core interests to rabbit holes – an AI experiment

    From core interests to rabbit holes – an AI experiment

    Social Agents Pod builds autonomous AI personas that browse social platforms like real users, each driven by demographic and psychological profiles. Early experiments show recommendation algorithms quickly lock onto core interests, then probe adjacent topics – which agents often adopt. The goal is a scalable, auditable framework for studying how algorithmic exposure shapes beliefs.


    Team Members: Andreas Stavrou, Nikos Gkikizas

    Introduction

    What if an AI agent could experience the internet the way a person does – scroll through feeds, react to content, develop tastes, get influenced, and evolve?

    That is the question behind the SocialAgents research pod. We are building an autonomous agent that browses social media platforms the way a real human would: it sees content, forms opinions based on its personality/background, decides whether to engage and over time develops new interests shaped by what the algorithm chooses to show it. This blog post documents the first phase of that effort, focused on social platforms as the initial source of information.

    The work tries to answer the following question:

    How do online platforms shape what different users see, engage with, and eventually come to think or believe?

    Content-recommendation algorithms do not just match interests – they introduce new content, test engagement, and then reinforce it. By running controlled agents with known starting profiles, we can track how exposure differs across user types, how it changes over time, and what drives those shifts. This gives us a precise way to study algorithmic influence that is impossible with real users.

    Each agent is defined by a rich profile (including age, occupation, cultural background, content affinities, aversions etc.) and interacts with platform content through the same actions available to any user: scrolling, liking, saving, commenting, following and sharing. Engagement decisions are made by a multimodal AI model that reasons over the agent’s personality and the content it encounters. Every session is designed to remain undetected, with timing patterns and warm-up progressions that mirror how a genuine new user explores a platform.

    The sections that follow detail the methodology, early experimental results and the infrastructure required to run these simulations at scale. The early findings show that within a single session, the algorithm accurately identified each agent’s interests and then began expanding them into adjacent territories.


    The mechanics of human navigation of social media

    When you scroll through a feed, your brain runs a rapid filtering process, forming relevance judgments in under 50 milliseconds and pauses when it detects something novel, emotionally charged or personally relevant. Surprise, humor, curiosity and outrage are the strongest scroll-stoppers, because they trigger emotional circuits faster than conscious thought.

    What keeps you scrolling isn’t satisfaction but anticipation: the infinite scroll removes natural stopping points, feeding a dopamine loop where the next post might be the rewarding one, a variable-ratio reinforcement pattern, the most compulsive reward schedule in behavioral psychology.

    Figure 1 – The TikTok feed – each creative takes up the whole vertical space in the “For You” section

    TikTok and Instagram exploit this differently. On TikTok, attention is measured in watch time and rewatches. The algorithm auto-serves content and hooks you within the first second of a video, so seconds of hovering are captured passively and automatically. Instagram earns attention more deliberately: you lean in, judge a visual aesthetically and decide to tap. Key signals are saves and swipes rather than raw watch time. This is why TikTok’s average engagement rate (~4.64%) is much larger than Instagram’s (~0.43%) [TikTok vs. Instagram: A Deep Dive into Engagement Rates and Content Performance].

    Figure 2 – The Instagram “For You” home feed – creatives are being fed in a vertical feed

    The interest of people browsing social media is incredibly short-lived. The average attention span for a video is 1.7 seconds on mobile and 2.5 on desktop [Capturing Attention in Feed: The Science Behind Effective Video Creative].

    A Fors Marsh Group study found that as little as 0.25 seconds of exposure is enough for people to recall mobile feed content at a statistically significant level, meaning the brain is processing and encoding content far faster than conscious attention suggests [Facebook video ad viewability rates are as low as 20%] This suggests that simulating human content browsing on social media using generative AI can be particularly tricky. That is because the response time of multimodal transformer based API ranges from roughly 4 to 8 seconds for 200 tokens [LLM Latency Benchmark by Use Cases in 2026], way above the average attention span, erroneously indicating interest to the platform for every creative just to consider it.


    Simulating human behavior on social media

    Our framework decomposes human browsing into three layers – persona construction, perception and judgment and behavioral execution – each calibrated against real-world engagement distributions. But the framework serves a deeper purpose than creative testing: it is how we test a foundational question – can AI personas reliably stand in for real humans in the eyes of a recommendation algorithm?

    Every simulation begins with a synthetic persona – not a shallow archetype but a deeply specified psychological and demographic profile. Each persona encodes age, gender, location, occupation, education, income bracket, cultural background, daily routines, content affinities and content aversions. These are the digital equivalents of the implicit biases and taste structures that real users carry into every scroll session. A 34-year-old veterinary nurse in Manchester with a dry sense of humor and a distaste for influencer culture will engage with content in measurably different ways from a 22-year-old design student in Brooklyn who follows streetwear accounts.

    For every social post, our agent estimates probabilities for each possible action – scroll away, like, save, comment, follow – accompanied by a reasoning trace explaining why this persona would or would not engage with this specific piece of content. That trace is essential for auditing how the agent is genuinely responding to the persona’s specific traits.

    Raw model outputs are not behaviors. A 16% “Like” probability and an 8% “Comment” probability mean nothing without calibration against platform-specific base rates. We apply a smoothing layer that adjusts per-post probabilities to known engagement benchmarks. The calibrated probabilities are then sampled to produce a single action.


    What each simulation produces

    Each simulation produces two outputs:

    • An interaction log: a record of every post the agent saw, what it did (scrolled past, liked, saved, commented), the probability behind that decision, and the reasoning.
    • A feed report: a snapshot of the content the platform served at different points in the session, showing how the feed changed over time.

    Imagine an agent built to mirror a 28-year-old personal finance enthusiast. Over a one-hour social media session it encounters 500 posts. The interaction log records that it liked 12, saved 3, commented on 1, and scrolled past the rest – along with why (e.g., “liked because the budgeting tip matched the agent’s stated interest in saving strategies”).

    The feed report then shows that by minute 40, the social media platform had started mixing in mental-health and self-improvement clips alongside the finance content – a shift the agent didn’t ask for, but that the algorithm introduced on its own.

    Running multiple distinct agents through the same platform for hours doesn’t just produce engagement metrics – it produces a controlled experiment on the algorithm itself. We observe what content the algorithm pushes to each agent, how that mix shifts over time, and what happens when the algorithm starts exposing the agent to novel or trending types of content.

    By logging the agent’s reasoning at every step, we can identify exactly which creative attributes – visual tone, emotional register, narrative hook – made that unexpected content compelling enough to earn a like or a save.


    Analysis of interactions based on persona characteristics

    We ran two agents through extended sessions on a social media platform. Before diving into results, here’s who they are.



    George is a 36-year-old senior finance analyst based in Athens. He follows investment strategies, personal finance, fitness, and business leadership content. He values data-driven advice, skips past crypto hype and hustle culture, and engages most with content that offers practical, actionable takeaways. He scrolls deliberately - slowing down for charts and analysis, skipping memes in under two seconds.


    Sofia is a 25-year-old social media coordinator, also in Athens, who creates content around fashion, travel, and fitness. She engages with styling tips, travel itineraries, workout routines, and creator growth strategies. She scrolls fast past ads but lingers on vibrant visuals and aesthetic content. Her feed time is high - she checks social media five times a day.

    Within the first session, the platform identified each agent’s core interests accurately. George’s feed was dominated by stock analysis, personal finance tips, and fitness content. Sofia’s feed filled with recipe tutorials, fitness routines, and travel vlogs. Roughly 60–80% of the content served matched their declared interests – measured by whether the content category aligned with the agent’s stated affinities.

    But the remaining 20–40% is where the story gets interesting.

    The off-topic content was not random. George was shown mental health clips, motivational content, and street food showcases – adjacent emotional territories that share the aspirational tone of self-improvement media. Sofia received tech gadget unboxings, entrepreneurship stories, and macro-economic forecasts – probing whether her preference for short-form, personality-driven content would transfer to informational topics. The algorithm wasn’t guessing. It was testing the edges of each agent’s taste profile.

    And the agents followed. George developed sustained engagement with psychology content and food showcases, reaching interaction rates comparable to his core finance interests. Sofia adopted tech gadgets and entrepreneurship narratives — topics that traditional demographic targeting would never have surfaced to a 25-year-old fashion content creator. By session five, these weren’t exploratory recommendations anymore. They were part of each agent’s regular content diet.

    Figures 3 and 4 below visualize this shift. Each chart tracks the proportion of content categories served to the agent over time, showing how the feed gradually expanded beyond the original interest profile.

    Figure 3 – George’s Content Ecosystem Evolution


    Figure 4 – Sofia’s Content Ecosystem Evolution

    What these results suggest is that the algorithm doesn’t just confirm existing tastes – it actively expands them. It found the edges of each agent’s interest profile and pushed content into those gaps, widening what each agent consumed over time.


    Persona adaptation to trends and suggestions

    The previous section showed that the algorithm quickly identifies what each agent cares about – and then starts pushing content beyond those boundaries. The natural follow-up question is: what happens if the agent actually adopts those new interests?

    To test this, we took the content categories that the algorithm surfaced and that each agent consistently engaged with during the first round of experiments, and folded them into the agent’s profile as declared interests. In other words, we let the first round of browsing reshape who the agent claims to be.

    For George, the enrichment added five categories that emerged from his initial sessions: player performance clips, quick recipe tutorials, media bias and propaganda breakdowns, music performances and concerts, and travel destination vlogs. None of these were part of his original finance-and-fitness profile – they were interests the algorithm introduced and George chose to engage with.

    For Sofia, the enrichment was broader – nine new categories: motivational speeches and quotes, day-in-the-life vlogs, mental health and psychology clips, personal finance hacks, home and furniture, music performances and concerts, tech gadget unboxings, workout tutorials, and geopolitical conflict updates. Some of these, like tech gadgets and personal finance, were far outside the fashion-travel-fitness profile she started with.

    We then re-ran the full simulation with these enriched agents. Same platform, same session structure, same interaction approach – but with agents whose declared interests now reflected the expanded taste profiles earned in the first round.

    The results confirmed that the cycle continues. With a richer interest profile to work from, the algorithm pushed even further. George, who originally cared about finance and fitness and had since adopted recipe content and travel vlogs, was now being served bodybuilding content, tech gadget reviews, and podcast highlight reels – and engaging with them. Sofia’s feed expanded in similar ways. Each round of enrichment gave the algorithm more surface area to explore, and it used that surface area aggressively.

    Figures 5 and 6 below show the content mix evolution for George and Sofia’s enriched profiles, following the same format as Figures 3 and 4. The key difference is the starting point: the agents entered this round with a wider interest profile, and the algorithm expanded it further still.

    Figure 5 – George’s Content Ecosystem Evolution (Enriched Profile)


    Figure 6 – Sofia’s Content Ecosystem Evolution (Enriched Profile)

    This observe-enrich-rerun approach turns a single experiment into an iterative process. Each cycle produces agents whose interests more closely resemble how real users evolve on a platform over time – not just what they start with, but what they become after sustained exposure to algorithmic recommendations.

    Conclusion

    AI Agents give us a controlled way to observe something we couldn’t observe before: how algorithms reshape what people care about. George started as a finance-and-fitness person. After two rounds of interaction, he was engaging with bodybuilding content, recipe tutorials, and podcast highlight reels – none of which he would have sought out on his own. Sofia went from fashion and travel to tech gadgets and geopolitical updates. These shifts weren’t random. They followed a clear pattern: the algorithm identified adjacent emotional territories, tested them, and when the agent responded, it pushed further.

    The next step is to give our agents access to more sources of information beyond social media – news, trends, search – making their online experience even closer to that of a real person browsing the web. The closer the agent gets to a full human browsing experience, the more we learn about how the digital world shapes what people see, think, and ultimately believe.

    Future Work

    Topics that deserve more focus over the next months are:

    1. Expansion to other sources of dynamic information (News, Trends etc.) – Social media platforms are interesting but specific content types might never surface to them, or be delayed. An interesting question to answer is: how do other sources of dynamic information affect the way personas perceive content and interact with it?
    2. Impact of trends on personas – Determine how social media trends (i.e. viral videos, trending brands etc.) influence the interests of different personas. This is particularly important to understand which demographics are more susceptible to targeting for specific brands and products.
    3. Marginal contribution of specific interests on the variability of content – We have already seen that engaging with specific content types might be more influential on what the algorithm serves. More work is needed to understand which personas are more sensitive to adding/removing interests in terms of how their feed evolves.
    4. Understanding of the content adaptation velocity between slow and fast-paced platforms – Not all algorithms are created equal. Further research is required to measure how quickly content evolves on different social media platforms.
  • Data Quality Assurance Agent Blog Post

    Meet Your New Data Guardian

    1. The High Cost of “Dirty” Data

    In the modern enterprise, data is the raw ingredient behind every strategic decision. Think of it like a premier restaurant: the Data Engineer is the sous-chef, meticulously sourcing and preparing ingredients, while the Data Scientist is the executive chef, transforming them into the predictive models and insights that drive the business forward. If the ingredients are spoiled or mislabelled, the final dish fails, no matter how talented the chef.

    Across several of our AI initiatives at WPP, we uncovered a pattern that was quietly draining velocity from our most ambitious projects. Our “sous-chefs”, skilled data engineers responsible for pipeline integrity, were spending up to one full day per week on tedious, largely manual Quality Assurance (QA) of data flowing into BigQuery. Row by row, column by column, they checked for missing values, logical contradictions, and phantom duplicates, work that was essential but deeply repetitive.

    This wasn’t just an inconvenience. It was a strategic bottleneck: it slowed the delivery of every downstream AI application, consumed senior engineering talent on janitorial tasks, and most dangerously created risk. When a human eye is the only safeguard between raw data and a production model, errors don’t just slip through occasionally. They slip through systematically, at exactly the moments when the data is most complex and the engineer is most fatigued.

    We asked ourselves a different question: What if, instead of building another dashboard or writing another validation script, we built an intelligent agent, one that could reason about data quality the way an experienced engineer does, learn from every audit it performs, and get better over time?

    This article describes how we built that agent, what makes it fundamentally different from traditional automation, and what happened when we put it to the test.


    2. The Problem: Why Data Quality Demands More Than Scripts

    The Data & The Modeling Ecosystem

    The agent operates on digital marketing campaign performance data hosted in BigQuery, massive tables that track how advertising campaigns perform on a daily basis across major ad networks like Meta (Facebook and Instagram). Each row represents a highly granular intersection of a specific campaign, audience segment, platform, device, and creative asset. This data captures everything from broad identifiers—like the parent brand and geographical targeting—down to precise performance metrics, including impressions, clicks, daily spend, conversions, leads, and app installs.

    This foundational data is the lifeblood of two critical machine learning systems:

    1. The Prediction Model: A classification system designed to predict whether a planned campaign will yield a negative, neutral, or positive outcome.
    2. The Recommendation System: A highly flexible advisory engine capable of handling any combination of “missing modalities.” For example, if a media planner inputs a specific Brand, Target Audience, and Location, the system dynamically recommends the optimal missing parameters, such as the best platform to use and the most effective creative asset to deploy.

    For more background on the broader modeling context, see From Guesswork to Glimpse: How AI is Predicting the Future of Marketing Campaigns.

    The Silent Threat of Data Corruption

    Because these models directly inform real-world media spend and strategic campaign planning, their accuracy is paramount. The underlying data is regularly refreshed directly from the advertising platforms to keep the models up to date. However, this automated refresh process frequently introduces subtle corruption and systemic inconsistencies.

    For instance, while metrics like engagement and clicks generally remain stable, downstream pipeline issues frequently render conversions and awareness metrics unreliable (“not high quality”). At the individual row level, these anomalies are often entirely invisible. But at scale, they are devastating. If left unchecked, these untrustworthy data points bleed into the training sets, silently degrading the prediction model’s accuracy and causing the recommendation engine to suggest sub-optimal, expensive campaign configurations. This makes rigorous, automated data quality validation not just a nice-to-have, but an absolute necessity for the ecosystem to function.

    The Failure Modes

    The scale and velocity of data flowing into BigQuery mean that errors don’t announce themselves. They hide. Through our manual QA process, we catalogued six recurring failure modes, each one capable of silently degrading every model built on top of the data:

    Failure ModeWhat HappensWhy It Matters
    Missing ValuesFields arrive empty — sometimes 5% of a column, sometimes 40%Models trained on incomplete data learn incomplete patterns. Forecasts drift silently.
    OutliersA metric reads 200,000 clicks when the true value is 500A single extreme value can skew an entire model’s calibration, distorting spend recommendations.
    Duplicate RowsIdentical records appear multiple timesInflated counts cascade into inflated budgets. Campaigns appear to outperform reality.
    Categorical CorruptionA brand name like "Nike" is replaced with "zX9pQ"Segmentation breaks. Reports attribute performance to entities that don’t exist.
    Logical InconsistenciesMore clicks than impressions. Spend recorded against zero impressions.These are the most insidious — each value looks valid in isolation, but the relationships between them violate business reality.
    Missing ColumnsAn entire field disappears from a refreshDownstream pipelines fail or, worse, silently fall back to defaults.

    A static validation script can catch some of these — the easy ones, the ones you’ve already seen. But scripts are brittle: they encode yesterday’s assumptions and break on tomorrow’s edge case. They cannot reason about why a pattern looks wrong, weigh it against historical context, or decide whether a recurring anomaly is a genuine error or a known artifact of a data source.

    That requires judgment. And judgment is what we built the agent to provide.


    3. Our Approach: An Agent That Reasons, Remembers, and Improves

    We designed the Data Quality Assurance Agent — not as a script, not as a dashboard, but as a reasoning entity capable of planning an audit strategy, querying data, forming hypotheses about its health, testing those hypotheses, and learning from the results. The distinction matters. A script checks what you tell it to check. An agent decides what to check, based on what it knows and it has the tools to act on that decision end-to-end.

    Architecture: One Agent, Specialized Tools

    The agent is powered by a single reasoning core that plans, decides, and acts. What gives it breadth is its toolkit, a set of specialized capabilities it can invoke as needed, selecting the right tool for each step of the audit:

    Data Agent Architecture Diagram
    • Database Tool: enables the agent to query BigQuery directly, fetching schemas, row counts, column statistics, and raw data samples.
    • Auditing Tool: the agent’s analytical engine. It formulates hypotheses about potential quality issues, runs targeted checks, and compiles structured findings. This tool reads from and writes to the Memory Bank.
    • Analytics Tool: generates visualizations using Python — charts, distributions, and plots that make audit findings immediately legible to stakeholders.
    • Artifact Tool: packages the final audit report, charts, and evidence into downloadable artifacts stored in Google Cloud.

    The agent orchestrates these tools autonomously. When a user asks it to audit a table, the agent formulates a plan, queries the data, runs its checks, generates visualizations where useful, and compiles a structured report, all without the user needing to specify which tool to use or in what order.

    The Key Innovation: Long-Term Memory

    Most AI tools are stateless. When the session ends, everything the system learned disappears. The next audit starts from zero. This is the fundamental limitation we set out to break. The agent maintains a persistent Memory Bank, a long-term knowledge store that survives across sessions and accumulates institutional intelligence over time. This memory captures three categories of knowledge:

    1. Historical Explanations When a data engineer confirms that a recurring anomaly is caused by a known tracking limitation or data source quirk, the agent records that explanation. The next time it encounters the same pattern, it doesn’t waste time flagging it as a new issue, it references the known cause, notes it in the report, and moves on to genuinely novel problems.
    2. Business Context Over successive audits, the agent absorbs the specific rhythms and patterns of our marketing data, seasonal spikes, platform-specific reporting delays, expected variance ranges for different campaign types. This contextual awareness allows it to distinguish between a real anomaly and normal business variation.
    3. Evolutionary Learning With every audit, the agent’s knowledge base deepens. Instead of repeating the same blind checks, it refines its hypotheses based on what it has seen before — which columns tend to have issues, which tables are most prone to duplication, which logical inconsistencies recur. The agent doesn’t just run. It compounds.

    This is what separates an agent from a script. A script executes the same logic every time, regardless of history. The agent carries forward everything it has learned and every audit it performs makes the next one sharper.

    The Tech Stack

    To ensure the agent was enterprise-grade, we built on the full Google Cloud AI ecosystem:

    ComponentRole
    Vertex AI Agent EngineManages the agent’s long-term specific memory persistence, and saving of the chat sessions
    BigQueryThe single source of truth — the agent performs direct, in-place auditing against production tables
    Agent Development Kit (ADK)The framework used to define the agent’s tools, constraints, and interaction boundaries
    Google Cloud StoragePersistent storage for audit trails, PDF reports, and visual evidence
    Cloud RunsUsed to deploy the A2A Agent API, and the ADK Web UI for demo purposes
    A2AThe protocol to expose our Agent as a headless API

    4. Proving It Works: Synthetic Error Injection

    We didn’t hope the agent worked. We proved it using a controlled methodology we call Synthetic Error Injection. The premise is straightforward: take a perfectly clean dataset, intentionally corrupt it in specific, measurable ways, and then challenge the agent to find every error we planted. If the agent can detect artificially injected errors, whose exact type, location, and severity we control, we can be confident it will handle real-world data corruption, which is typically far less extreme.

    Step 1: Preparing the Test Data

    Before injecting errors, we prepare the data for safe, controlled experimentation:

    • Anonymization — Real brand and advertiser names are replaced with generic identifiers ("Brand 1", "Company A"). Sensitive business information never enters the test environment.
    • Corruption— The dataset then receives a different severity level of corruption. This allows us to map the agent’s detection accuracy as a function of error density, from subtle (5%) to extreme (40%).

    Step 2: Injecting Controlled Errors

    Using purpose-built scripts, we introduce precisely calibrated corruptions into a table, 4 types of Structural and 7 types of Logical errors:

    CategoryErrorDescription
    StructuralMissing Values (Nulls)X% of cells set to NULL
    StructuralDuplicate RowsX% exact row copies
    StructuralDropped ColumnsX% of columns removed
    StructuralCategorical ErrorsRandom alphanumeric strings in category fields
    LogicalClicks > ImpressionsCan’t click what wasn’t shown
    LogicalConversions > ClicksCan’t convert without clicking
    LogicalSpend with 0 ImpressionsPaying for zero ad delivery
    LogicalVideo Completions > PlaysCan’t finish without starting
    LogicalPurchases without Add-to-CartFunnel step skipped
    LogicalLanding Page Views > ClicksMore landings than clicks
    LogicalNegative Metric ValuesPerformance metrics can’t be negative

    Step 3: Synthetic Ground Truth Dataset

    We keep track of the errors we introduce in a table and produce a ground truth dataset that looks like:

    Table_namenumber_of_injected_logical_errortype_of_logical_errornumber_of_injected_structural_errorstype_of_structural_error
    table_0101categorical errors
    table_0201dropped columns
    table_031clicks_exceed_impressions0
    table_041spend_with_zero_impressions0

    5. Evaluation Pipeline, Experiments and Results

    To evaluate our Agent we follow the pipeline below:

    Evaluation pipeline flow diagram

    The 4 Experiments and Results

    Each experiment isolates a single variable to understand what affects the auditor agent’s detection quality.

    Experiment 1: Prompt Comparison

    Question: Does giving the agent a more detailed prompt improve error detection?

    Runs the agent 3 times on the same table, each time with a different user query style:

    Prompt LevelWhat the user asksAgent’s system instruction
    Simple“Check if there are any errors for table X”Stays constant (forensic mode)
    Medium“Perform a structured assessment checking physical integrity, numerical sanity, categorical validity”Stays constant
    Complex“Conduct a forensic audit checking for 11 specific error types with detailed cross-column logical checks”Stays constant

    Key insight from results:  Only the complex prompt successfully detected the injected spend_with_zero_impressions error (139 rows, 1.82%), while both the simple and medium prompts missed it entirely — confirming that more detailed, forensic-style instructions are critical for the agent to test nuanced logical relationships rather than just surface-level checks.


    Experiment 2: Table Sweep

    Question: How well does the agent detect different types of errors?

    Experiment 2 stress-tests the Auditor agent (using the complex prompt) on 11 synthetic BigQuery tables with progressively stacked error combinations, ranging from a single logical violation to the full set of 7 logical plus 4 structural error types (11 total). The agent achieved perfect detection on 8 of 11 tables (72.7% with F1 = 1.0) and delivered strong overall performance (precision 0.905, recall 0.864, F1 0.883).

    Experiment 3: Consistency Check

    Question: If we run the exact same audit N times, do we get the same results?

    Runs the identical configuration multiple times (default: 3).

    The Consistency Experiment demonstrated that the auditor agent achieves perfect 100% reproducibility, consistently detecting both injected errors and real data quality issues with the exact same metrics and explanations across all three runs. This deterministic performance confirms that the agent’s complex prompt configuration is highly reliable effectively eliminating the need for redundant audits.


    Experiment 4: Memory Ablation

    Question: Does loading past audit memories help the agent perform better?

    Runs the agent twice on the same table with the complex prompt:

    ConditionWhat happens
    No MemoryCold start — agent has no context from prior audits
    With MemoryAgent loads past findings from Vertex AI memory before auditing

    Key insight from results:  Memory doesn’t unlock new capabilities when tables contain simple structural errors or when the user prompt leads to a very detailed auditing but acts as a critical consistency enforcer, stabilizing the audit scope to guarantee a 100% success rate. However, in cases that there are many different logical errors in a table and the prompt is simple, when past memories are loaded into the agent’s present context performance becomes optimal.


    6. The Agent in Action

    Below is an actual interaction with the deployed agent, demonstrating its reasoning in practice. The agent discovers available tables, loads its accumulated memory from prior audits, and then, guided by that memory, conducts a targeted audit.

    Discovery

    Human: “What tables do you have access to?”

    Agent: “I have access to the following tables in the dataset:

    • campaign_performance_table_1
    • campaign_performance_table_2
    • campaign_performance_table_3
    • campaign_performance_table_4
    • campaign_performance_table_5

    Memory Recall

    Human: “Load your memories.”

    Agent: “I have loaded the following memories:

    1. A general instruction to always check for null values in tables.
    2. A specific data quality issue in campaign_performance_table_2 where 188 rows (18.8%) were duplicates based on event_date, campaign_id, and ad_creative_id.
    3. A preference for reports to be formatted in Markdown and include an executive summary.”

    The agent doesn’t start from scratch. It arrives with institutional knowledge — prior findings, user preferences, and standing instructions — all retrieved from its persistent Memory Bank.

    Hypothesis-Driven Audit

    Human: “Based on these memories, audit campaign_performance_table_3.”

    The agent formulates a plan, queries the table using its database tool, and — informed by its memories — produces a structured audit report:

    Executive Summary: This audit revealed a consistent pattern of missing values across key columns, affecting exactly 5% of total rows. Specifically, event_date, campaign_id, impressions, clicks, and conversions each had 50 null entries out of 1,000 rows. No exact duplicate rows were found. This suggests a systemic issue with data completeness rather than random omissions.

    The agent then provides detailed findings per column — quantifying impact, assessing severity, and recommending next steps — before offering to proceed with outlier detection, categorical validation, and logical consistency checks.

    Persistent Learning

    Human: “Save the results for this table.”

    Agent: “The audit results have been saved to memory.”

    The next time the agent audits any table, it will carry forward what it learned here — the 5% missing-value pattern, the columns affected, the systemic nature of the issue. Every audit makes the next one faster, more targeted, and more informed.


    7. Impact and What Comes Next

    Operational Impact

    The deployment of the Data Quality Assurance Agent has delivered measurable results:

    • Reclaimed engineering capacity. By automating the bulk of manual QA, data engineers have shifted from repetitive inspection work to higher-value activities.
    • Earlier error detection. Quality issues are now caught at the point of ingestion, before they propagate into downstream models. This “shift-left” approach reduces the blast radius of bad data from hours to minutes.
    • Higher model reliability. Marketing agents, analytics pipelines, and machine learning models now operate on data that has been systematically validated, reducing the risk of predictions and recommendations built on flawed foundations.

    The Bigger Picture

    This agent is more than a tool. It is a blueprint for autonomous data governance, a pattern that can be replicated across any data pipeline where quality, scale, and velocity collide.

    We are currently extending the agent along three axes:

    • Cross-table auditing: enabling the agent to detect inconsistencies across related datasets, not just within a single table. Many of the most damaging data quality issues manifest as contradictions between tables that individually look clean.
    • Event-driven execution: triggering the agent automatically whenever a BigQuery table is updated, transforming data quality monitoring from a scheduled chore into a continuous, always-on safeguard.
    • Adversarial stress-testing: today, our synthetic error injection is script-based and manually configured. We are building a dedicated adversarial agent whose sole purpose is to generate increasingly complex, realistic data corruptions, subtle logical contradictions, plausible-looking outliers, correlated missing-value patterns, specifically designed to challenge the QA agent’s detection capabilities. By putting one agent against the other in a continuous red-team / blue-team loop, both improve: the adversarial agent learns to craft harder-to-detect errors, and the QA agent learns to catch them, driving each other toward sharper, more robust performance over time.

    Together, these extensions move us toward a future where data quality monitoring is not a task that consumes an engineer’s day. It is a capability the agent handles continuously and intelligently, surfacing only the issues that require human judgment and decision-making.

  • A Self-Improving AI Agent for Optimizing and Explaining Media Performance

    TL;DR Most tools that claim to predict post performance tell you what’s in a post — they can’t tell you why it would resonate. The Prediction Optimization Agent closes that gap by converting the image, caption, and influencer metadata into a single natural-language description that captures tone, credibility, and visual cohesion the way a strategist would. It then forecasts engagement, measures its errors, and autonomously rewrites its own instructions to produce better descriptions next time — with no human prompt engineering required. The result is a system that doesn’t just predict what will perform, but allows creative teams to draw insights on what makes a post tick — and what doesn’t.

    Introduction

    Influencer marketing is a multi-billion industry, yet many spending decisions still rely heavily on intuition. Brands pick creators by follower count, approve content by gut feel, and learn what worked only after the budget is gone. The cost of a wrong bet is not just wasted spend. It is missed revenue, lost momentum, and a creative team back at square one.

    Everyone in the industry wants the same thing: a way to know, before publishing, whether a post will perform. But the tools built to answer that question often disappoint. They extract surface features like hashtag counts, color palettes, and face detection, then output a score no one can interpret or trust. They tell you what is in a post but nothing about why it would resonate.

    The gap is not technical. It is perceptual. What makes a post work is something a skilled marketer can often articulate after the fact: the influencer’s tone felt effortless, the product placement didn’t interrupt the narrative, the caption hit a cultural nerve. These judgments require understanding context, intent, and audience, not just pixels and metadata.

    This is the problem the Prediction Optimization Agent was designed to close. Rather than extracting features and producing an opaque score, the agent reads a post the way a strategist would. It examines the image, the caption, and the influencer’s history, then writes a structured natural-language description of everything that contributed to the post’s performance. A creative director can then read this description, challenge it, and act on it without ever needing to trust a black box.

    The intuition is simple: the description that best predicts performance is, by definition, the description that best explains it. The agent iteratively refines this description by diagnosing its own errors, identifying what the previous descriptions failed to capture, and rewriting its own instructions to self-improve. Over successive iterations it converges on the specific qualities that actually drive engagement, not because anyone told it what to look for, but because it is guided by the minimization of its own predictive error.

    The Anatomy of a Viral Post

    Predicting the performance of an ad or a social media post before publishing — and certainly before spending money to boost it — Predicting the performance of an ad or a social media post before publishing — and certainly before spending money to boost it — remains a primary objective for marketers and influencers alike. But how do you distill something as complex and subjective as a social media post into a single prediction?

    Consider a typical Instagram post. It is never just a picture. It’s a complex combination of different data types working together simultaneously. Take the influencer post shown in Figure 1. To truly understand why this post succeeds or fails, you need to consider:

    • The image itself — composition, lighting, color palette, subjects, products, and setting.
    • The caption — where the influencer might share a discount code, crack a joke, or strike an emotional chord.
    • The influencer’s identity — their bio, follower count, niche credibility, and historical performance.
    • The metadata — the time of day, geographic location, hashtags, and platform-specific context.

    Each of these dimensions carries signal. None of them tells the full story alone. The magic — and the difficulty — lies in how they interact.

    Figure 1: A typical influencer post. Traditional analytics struggle to measure the combined impact of the visual aesthetic, the caption’s tone, and the underlying metadata. To accurately predict engagement for a post like this, our system analyzes the image, caption and influencer statistics together as a single cohesive unit.

    Why Traditional Approaches Fall Short

    Historically, the industry has tried to solve this problem by breaking a post into its component parts and analyzing each one in isolation. Engineering teams would string together separate, disjointed machine-learning models — each responsible for a single modality — and attempt to fuse their outputs at the end:

    • Standard Computer Vision Models: Isolated image-recognition algorithms scan the visual to detect objects, people, or products. Separate models handle face detection and emotion recognition. The output is a list of labels: “person detected,” “beverage detected,” “outdoor setting.”
    • Text Analyzers & OCR: NLP tools parse the caption — counting hashtags, flagging emojis, scoring sentiment. Meanwhile, optical character recognition (OCR) software reads any text visible within the image itself.
    • Tabular Metadata Algorithms: A statistical algorithm ingests structured fields — follower count, posting time, engagement history — and produces its own independent prediction.

    Engineers then attempt to fuse these isolated outputs into a single forecast. But this approach has a fundamental, structural flaw: it cannot capture meaning that emerges from the interaction between modalities.

    Consider a concrete example. Imagine a fitness influencer posts a photo of herself laughing mid-sip from an energy drink, with the caption: “My face when someone says they don’t need pre-workout 😂.”

    A traditional computer vision model would tag this as: “person detected,” “beverage detected,” “outdoor setting,” “positive facial expression.” A separate text analyzer would count the hashtags and flag the emoji. But neither system — nor any crude fusion of their outputs — would understand the joke: that the humor comes from the caption reframing the laugh as a reaction, turning a standard product shot into a relatable meme.

    It would also completely miss that this influencer is a certified nutritionist — meaning her credentials paired with an energy drink carry implicit credibility that a fashion influencer holding the same product would not.

    Our Approach: Unifying Multimodal Data Through Semantic Translation

    To address this limitation, we developed the Prediction Optimization Agent, a self-improving AI agent that moves beyond traditional approaches by unifying all available data into a single format it can reason about: natural language.

    The agent’s core mechanism is deceptively simple but profoundly effective. It takes complex, disparate multimodal data — numerical metrics, images, video, and text captions — and converts everything into a single natural-language paragraph that holistically describes the post’s content, aesthetic, tone, and context. By projecting all of these distinct formats into readable text, heterogeneous data is normalized into a structure that a language model can process as a unified whole.

    Instead of treating image and text as separate inputs, the agent uses a single prompt to digest all available information at once. Multimodal LLMs serve as one of the agent’s tools, acting as universal feature extractors that capture the abstract, human-centric concepts that traditional pipelines structurally cannot.

    But the agent does not simply produce any description and hope it is useful. It is driven by a feedback loop grounded in predictive error: the descriptions it generates are used to forecast engagement, those forecasts are compared against real outcomes, and the resulting errors tell the agent exactly how much predictive value its current descriptions are capturing — and how much they are missing. Through successive rounds of this loop, the agent autonomously rewrites the instructions that govern how descriptions are composed, converging on the paragraph structure that maximizes predictive accuracy.

    This error-driven process has a profound consequence for explainability. The description the agent converges on is not a generic summary — it is the description that the agent has discovered, through empirical optimization, to be the most predictive of real engagement outcomes. In other words, the features highlighted in the final description are there because they matter — because including them reduced prediction error. When the optimized description of a high-performing post calls out “candid humor,” “golden-hour lighting,” and “influencer credibility,” those aren’t arbitrary observations. They are the factors the agent learned to pay attention to because they measurably improved its ability to predict what performs well.

    How the Prediction Optimization Agent Works

    The Prediction Optimization Agent orchestrates three internal stages in a continuous feedback loop: it observes a post, describes it, predicts its performance, measures how far off it was, and then rewrites its own instructions to produce better descriptions next time — closing the loop and getting measurably better with every iteration, without any human intervention.

    Figure 2: The Prediction Optimization Agent architecture. Raw media, metadata, and an initial prompt are fed into Stage 1 (Semantic Translation), which produces a natural-language description of the post. Stage 2 (Engagement Predictor) reads that description and predicts engagement. Prediction errors are then passed to Stage 3 (Self-Optimizer), which autonomously analyzes what went wrong and rewrites the Stage 1 prompt — closing the feedback loop and improving the system’s accuracy with every iteration.

    Stage 1: Semantic Translation

    The agent begins by ingesting the raw post — the image or video file, the caption text, and all available metadata (follower count, posting time, influencer bio, etc.). Using a multimodal LLM as its translation tool, and guided by a detailed set of internal instructions (its prompt), it produces a single, rich natural-language paragraph that captures not just what is in the post, but what the post means: the visual mood, the emotional tone, the relationship between caption and image, and the brand alignment.

    The quality and focus of this description is entirely governed by the prompt — and as we will see, it is the prompt that the agent learns to optimize.

    Stage 2: The Engagement Predictor

    The agent passes the semantic paragraph to its prediction tool — a model that evaluates the post’s potential performance based entirely on the natural-language description from Stage 1.

    The predictor can be any machine learning model with the ability to understand text paragraphs. It can be based on trees, deep learning, or any other compatible architecture. It can even be a fine-tuned LLM, upskilled for predictions in a specific domain. Our Agent is compatible with all these options.

    In all cases, the quality of the prediction is only as good as the quality of the semantic description. A richer, more targeted paragraph gives the predictor more signal to work with — which is precisely why improving the description is the single most important action the agent can take.

    Stage 3: The Self-Optimizer

    This is the core of the Prediction Optimization Agent’s intelligence — and what makes it a true agent rather than a static model.

    After predictions are generated for a batch of posts and compared against actual engagement outcomes, the agent enters a self-evaluation cycle. The agent is presented with:

    1. The prompt it used to generate descriptions in the current round.
    2. The per-post prediction errors — for each post, the predicted engagement versus the actual engagement, and the magnitude and direction of the miss.
    3. The aggregate error metrics — overall accuracy statistics such as mean absolute error, broken down by content category, audience size, or post type, revealing systematic blind spots.
    4. The complete history of prior prompts and their corresponding error metrics — so the agent can see not just how it performed this round, but how each previous prompt version performed, which changes helped, and which didn’t.

    Armed with this structured feedback, the agent doesn’t guess at improvements — it reasons about them. It asks: Which types of posts am I consistently getting wrong? What do those posts have in common? What dimensions of the content is my current prompt failing to capture?

    Based on this analysis, the agent autonomously rewrites the prompt instructions used in Stage 1. The rewritten prompt is not a minor tweak — it can add entirely new analytical dimensions, remove instructions that proved unhelpful, or restructure the description format to foreground the features most correlated with engagement.

    The critical insight is that the predictive error is the signal that drives the entire system. Every improvement the agent makes to its prompt is motivated by a concrete, quantifiable gap between what it predicted and what actually happened. The agent isn’t optimizing descriptions to be “better” in some abstract sense — it is optimizing them to be more predictive, and the error metrics tell it exactly where it is falling short.

    The Agent in Action: Watching the Prompt Evolve

    To make this concrete, let’s follow the agent through three optimization rounds on a single post — the Hype Energy ad featuring two young models in blue athletic wear, photographed from a dramatic low angle against a vivid blue sky lined with palm trees.

    Figure 3: Example Instagram post used for semantic translation.

    Round 1: The Baseline

    The agent starts with a deliberately simple, generic prompt:

    Prompt (Round 1): “Analyze this Instagram post and the provided metadata. Describe the visual elements of the image, the style, and the mood. Then, explain how the image relates to the brand or influencer described in the metadata.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy, utilizing a vibrant, ‘streetwear-meets-summer’ aesthetic to target a young, active demographic. The image uses a low-angle shot, which makes the subjects appear dominant and ‘larger than life.’ Two young models are featured in athletic/leisure wear. The female subject in the foreground holds the product directly toward the lens. The male subject in the background wears reflective ‘matrix-style’ sunglasses and a blue Jordan hoodie. The post aligns perfectly with the brand’s identity as described in the metadata. The visual style is polished and fashionable, positioning the drink as a lifestyle accessory rather than just a functional caffeine source. The caption emphasizes ‘Zero Sugar’ and staying ‘sharp’ and ‘unstoppable.’”

    This is already more nuanced than what a traditional CV model would produce — but the predictor, working from this description, significantly underestimates the post’s engagement. The error gets logged. Across the full batch, the agent notices a pattern: it is systematically underperforming on highly produced brand-partnership posts.

    Round 2: The Agent Adapts

    The self-optimizer reviews the Round 1 errors and identifies the gap: the prompt never asked the agent to evaluate visual cohesion between talent styling and brand identity, or the psychological impact of compositional techniques like camera angle. It rewrites the prompt:

    Prompt (Round 2): “Analyze this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling (wardrobe, accessories, posture) and the brand’s visual identity. Evaluate how compositional choices — such as camera angle, framing, and subject placement — contribute to the aspirational or emotional tone of the post. Explain how the image, caption, and influencer profile work together to create a unified message.”

    The agent’s description now reads:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy that demonstrates strong visual-brand cohesion. The coordinated blue wardrobe across both models — the female subject’s athletic top and the male subject’s Jordan hoodie — creates a seamless extension of the brand’s blue-and-white visual identity, embedding the product within the scene rather than presenting it as a foreign element. The dramatic low-angle shot elevates the subjects into an aspirational, dominant position, making the viewer look up at them — a compositional choice that reinforces the caption’s message of being ‘unstoppable.’ The female subject holds the product at the camera’s focal point with deliberate confidence, while the male subject’s reflective sunglasses and relaxed posture add an effortless, streetwear credibility. Together, the image, caption, and influencer positioning frame the energy drink not as a beverage but as a lifestyle symbol for a young, fashion-conscious audience.”

    The prediction for this post improves noticeably. Across the batch, the error on brand-partnership posts drops. But the agent notices it is still missing on posts where influencer credibility plays a key role — posts where who the influencer is matters as much as what’s in the frame.

    Round 3: The Agent Sharpens Further

    The self-optimizer reviews Rounds 1 and 2 side by side — seeing that the Round 2 prompt improved brand-cohesion analysis but still lacks attention to influencer authority and audience trust signals. It rewrites again:

    Prompt (Round 3): “Analyze this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling and the brand’s visual identity. Evaluate how compositional choices contribute to the aspirational or emotional tone. Analyze the influencer’s credibility and authority relative to the product category — consider their professional background, niche expertise, audience demographics, and how these factors amplify or undermine the authenticity of the endorsement. Assess the emotional resonance of the caption: does it use humor, aspiration, relatability, or urgency, and how does this tone interact with the visual content? Explain how all elements — image, caption, influencer identity, and audience context — work together to create a unified, persuasive message.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy that achieves strong visual-brand cohesion and leverages influencer-audience alignment to maximize credibility. The coordinated blue wardrobe across both models creates a seamless extension of the brand’s visual identity. The dramatic low-angle shot places the subjects in an aspirational, dominant position, reinforcing the caption’s message of being ‘unstoppable.’ The female subject’s direct-to-camera product hold at the image’s focal point conveys deliberate confidence, while the male subject’s reflective sunglasses and relaxed posture add streetwear authenticity. Critically, the influencer’s established presence in the fitness and lifestyle niche — with a predominantly young, health-conscious following — makes the ‘Zero Sugar’ energy drink endorsement feel credible rather than transactional. The caption’s tone (‘Stay sharp. Stay unstoppable.’) uses aspirational brevity that mirrors the visual intensity, creating a tight loop between what the viewer sees and what they read. The overall effect positions the product as an identity marker for an active, fashion-forward audience, rather than a commodity — a framing that this influencer’s audience is primed to respond to.”

    This description now captures visual strategy, brand cohesion, compositional psychology, influencer credibility, audience alignment, and caption-image interaction — all in a single readable paragraph. The prediction error drops significantly. And a marketing team reading this description can immediately see why the post works: the credibility is authentic, the visual identity is seamless, and the caption reinforces the image rather than fighting it.

    The agent arrived at this level of descriptive richness not because someone hand-engineered the prompt, but because the predictive errors from Rounds 1 and 2 told it exactly what it was missing.

    Real-World Application: Lessons and Implications

    To validate the Prediction Optimization Agent in a real-world scenario, it was tested using an Instagram Influencer Dataset containing approximately 10.18 million posts spanning a diverse range of content categories and audience sizes. The results revealed key insights about both the agent’s learning dynamics and the practical implications for marketing teams.

    The Agent Learns What Matters — Autonomously

    By processing its own historical error rates, the Prediction Optimization Agent autonomously learned to rewrite its prompts — producing richer, more targeted post descriptions with every iteration, which in turn drove increasingly accurate predictions.

    Figure 4 — Autonomous Learning: The chart tracks the agent’s predictive performance (y-axis) across successive optimization rounds (x-axis). Each point represents a full cycle of the agent’s loop: describe → predict → evaluate → rewrite. The trend demonstrates that as the agent iteratively refined its own prompt — guided by quantitative error metrics from prior rounds — forecast accuracy improved consistently and autonomously, without any human prompt engineering.

    The agent’s optimization works by feeding it the complete history of prior prompts alongside rigorous, quantitative error breakdowns from every previous round. Armed with this granular self-knowledge, the agent identifies precisely which content dimensions it has been under-analyzing — production quality, humor style, credibility signals, visual-brand cohesion — and surgically updates its prompt to capture those features in subsequent iterations. Each round’s prompt is built on the empirical lessons of every round before it, producing a compounding improvement curve where the descriptions become progressively more predictive and, as a direct consequence, more explanatory.

    Conclusions

    The Prediction Optimization Agent demonstrates something that extends well beyond social media: natural-language prompts can be treated as tunable parameters, optimized autonomously by the AI itself. By allowing the agent to refine its own instructions through predictive error, the system progressively discovers what drives human engagement and expresses that knowledge in plain language.

    For marketing teams, this is a significant departure from opaque AI tools that output a score and leave everyone guessing. When a team wants to understand why one campaign outperforms another, they don’t need to interpret a dashboard of abstract model coefficients. They can compare the text profiles of a high-performing post and a low-performing one, side by side, and immediately see the differences the AI picked up on: one might highlight “authentic, candid composition with humor-driven caption and strong influencer-niche alignment,” while the other notes “generic studio shot with formulaic promotional language and weak audience-brand fit.” The patterns reveal themselves in plain English, and they are the right patterns, because the agent discovered them by optimizing for predictive accuracy.

    In practice, this means teams can run draft campaign concepts through the system before committing production and media budgets, getting a readable assessment of how the AI interprets the creative. Designers and copywriters can test variations of a post and compare descriptions side by side to see, in their own language, which direction resonates more strongly. And by normalizing visual and written media into a unified, readable format, brands can pair creative intuition with precise forecasting, treating creative assets as predictable drivers of revenue rather than expensive gambles.

    The same architectural pattern, semantic translation, error-driven prediction, and autonomous self-optimization, is not limited to social media. Any domain where success depends on understanding the interplay of qualitative and quantitative signals, from political messaging to product design to entertainment, stands to benefit from systems that can read, reason, reflect, and improve on their own. The question is no longer whether AI can predict what resonates with people. It is whether we are ready to let it teach itself how.

  • From Hype to Impact: Predicting Campaign Performance with Fine-Tuned LLMs

    From Hype to Impact: Predicting Campaign Performance with Fine-Tuned LLMs

    Artificial intelligence is experiencing a moment of unprecedented attention. New models appear almost weekly, and headlines regularly promise that AI will transform every industry. Large Language Models (LLMs)—the technology behind conversational systems like ChatGPT—have demonstrated remarkable abilities to write, summarize, reason, and interpret complex information.

    But beneath the excitement lies a practical business question:

    Can these new GenAI systems actually outperform the specialized machine learning models that companies already rely on?

    At WPP AI Lab, we decided to answer that question using the gold standard for applied engineering and science: through empirical experimentation.

    Rather than relying on claims or assumptions, we conducted an in-house investigation designed to evaluate whether modern AI models could improve one of the most important functions in marketing: predicting campaign success.

    The results were revealing.

    We discovered that large language models can rival highly specialized systems—but only when applied in the right way. The key was not simply choosing the most powerful model, but designing the correct architecture and training strategy around it.

    This article tells the story of that journey: the problem we set out to solve, the challenges we encountered, the models we tested, and the lessons we learned about how AI can deliver real business value.


    The Motivation: Turning Marketing Insight Into Prediction

    Marketing has always combined creativity with data. Campaigns are built on ideas—stories that connect brands with audiences—but their success ultimately depends on how well those ideas resonate in the real world.

    Marketers must make decisions about many interconnected factors:

    • Who the campaign targets
    • How the brand is positioned
    • What message the creative communicates
    • Where the campaign is distributed
    • Which markets it reaches

    Each of these factors influences how audiences respond.

    Traditionally, organizations attempt to manage this complexity through specialized machine learning models. These models analyse historical data and identify patterns associated with successful campaigns.

    Over time, they can become highly effective prediction engines.

    However, they also have limitations. Most traditional systems require structured inputs, carefully engineered features, and extensive retraining whenever new signals are introduced.

    Meanwhile, a new generation of AI models has emerged with a very different capability. Large language models are trained on enormous amounts of text and information, giving them an ability to interpret complex descriptions and contextual relationships.

    This raised an intriguing possibility.

    Instead of forcing marketing data into rigid modelling structures, could AI models directly interpret the narrative descriptions behind campaigns and predict their outcomes?

    If so, the implications would be significant. Marketers could evaluate campaign ideas before launch, test creative variations more rapidly, and gain deeper insight into why certain strategies succeed.

    But before adopting such technology, we needed to determine whether it could truly match the performance of the systems already in use.


    The Challenge: Predicting Campaign Performance

    To explore this question, we designed a controlled experiment around an actual marketing prediction problem.

    To ensure robust and repeatable results, we utilized proprietary synthetic datasets designed to mirror real-world marketing dynamics, allowing full control and visibility over the factors driving campaign performance.

    Each campaign in our dataset was described using five key components:

    Audience – the consumer segment being targeted

    Brand – the positioning and perception of the brand

    Creative – the tone and message of the campaign

    Platform – where the campaign runs

    Geography – the markets being targeted

    Based on these characteristics, campaigns were classified into one of three outcomes:

    • Overperforming
    • Average performance
    • Underperforming

    Our task was to train AI systems capable of predicting the likely performance category of a campaign.

    To make the challenge realistic, we introduced an important complication.

    In real marketing environments, most campaigns perform roughly as expected. Only a small proportion dramatically outperform or significantly underperform.

    We recreated this condition by generating datasets where “average performance” campaigns were far more common than the other two categories.

    This imbalance creates a difficult prediction problem. Many models simply learn to predict the most common outcome rather than identifying the subtle patterns that signal exceptional success or failure.

    Any model that succeeded in this environment would therefore need to detect nuanced relationships between campaign characteristics.


    Our Approach: Testing Three AI Strategies

    Rather than evaluating a single system, we designed our experiment to test three different strategic approaches to applying modern AI.

    Each strategy tested represented a different balance between speed, flexibility, and customization.


    Strategy 1: The Managed Model

    The first strategy relied on a fully managed AI platform using Gemini 2.5 Flash-Lite, developed by Google.

    This approach offered the fastest path to experimentation. The model could be fine-tuned quickly using cloud infrastructure without needing to build custom training systems.

    From a business perspective, this kind of solution is attractive because it minimizes engineering overhead and enables rapid prototyping.

    However, it also limits how much control teams have over how the model learns and makes decisions.

    We used this workflow to establish a baseline: how well could a modern AI model perform with minimal customization?


    Strategy 2: The Customizable Model

    Our second strategy used Gemma, another model developed by Google, but released with open weights.

    Open-weight models provide much greater flexibility. Instead of relying on a predefined training pipeline, we could directly control how the model was further trained on our datasets and how it generated predictions.

    This allowed us to reshape the system so that it behaved more like a structured decision engine rather than a conversational AI.

    The additional control proved crucial.

    By carefully designing how the model made predictions, we were able to prevent it from defaulting to the most common outcome and instead encourage it to learn the deeper relationships within the data.


    Strategy 3: The Fully Hands-On Modeling Approach

    Our final strategy pushed customization even further.

    Using Qwen3, developed by Alibaba Cloud, we built a completely custom training environment and directly modified parts of the model’s architecture.

    This approach allowed us to reshape the model into a dedicated classifier that could only produce the specific categories required by our task.

    While this workflow required significantly more engineering effort, it provided the highest degree of flexibility and experimentation.


    The Results: A Clear Winner Emerges

    When we compared the results across all three systems, the differences were striking.

    The managed Gemini workflow proved easy to deploy but struggled with the imbalance in our dataset. The model tended to predict “average performance” too frequently, resulting in relatively low predictive accuracy overall.

    This highlighted a key limitation of treating a language model purely as a text generator: it may not naturally behave like a structured prediction system.

    The Qwen experiment demonstrated promising improvements through architectural customization. However, because this phase of the project focused on proving the concept rather than exhaustive optimization, the model did not yet reach the highest performance levels.

    The standout performer was the fine-tuned Gemma model.

    With the right training strategy and architectural adjustments, Gemma achieved prediction accuracy very close to our specialized internal machine learning system, performing strongly across all performance categories.

    In other words, a modern language model—when properly configured—was able to rival a system specifically designed for the task with a ~90% F1 score, virtually matching our specialized internal model’s ~93% performance.


    Why Gemma Won

    The key lesson from this experiment is that success in AI depends less on the specific model and more on how it is implemented.

    The managed Gemini workflow prioritized convenience and rapid deployment, but its limited customization prevented us from adapting the model’s behaviour to the nuances of our dataset.

    The Qwen workflow provided maximum flexibility but required significantly more tuning and engineering to unlock its full potential.

    Gemma offered the ideal balance.

    It allowed enough architectural control to reshape the model into a structured prediction engine while still benefiting from robust training infrastructure.

    This combination enabled the model to learn the subtle relationships that determine campaign success.


    Insights Beyond Prediction

    Beyond pure performance, the experiment produced valuable insights about marketing dynamics themselves.

    By analysing how the best-performing model made predictions, we could identify which campaign factors had the greatest influence on outcomes.

    Three signals stood out consistently:

    • Brand positioning
    • Platform selection
    • Creative messaging

    These factors had a stronger influence on campaign performance than geographic targeting or audience description.

    While this aligns with marketing intuition, the ability to quantify these relationships provides a powerful foundation for future decision-making.

    Instead of relying solely on experience or instinct, marketers can combine creative expertise with data-driven insight.


    What This Means for Businesses

    For organizations exploring AI in marketing and advertising, this research carries several important implications.

    First, modern language models can indeed rival specialized machine learning systems when applied correctly.

    Second, the architecture surrounding the model is just as important as the model itself. Systems designed for open-ended conversation must often be adapted before they can function effectively as prediction engines.

    Third, the ability to interpret narrative context opens new possibilities for marketing intelligence.

    Instead of relying solely on structured datasets, organizations can analyse campaign descriptions, creative briefs, and strategic messaging to generate predictive insight.

    This represents a significant shift in how marketing data can be used.


    The Impact for the WPP AI Lab

    For the WPP AI Lab, this experiment served several important purposes.

    It demonstrated that advanced AI systems can be adapted to solve complex marketing prediction problems with performance comparable to specialized models.

    It strengthened our internal capabilities in designing and deploying custom AI architectures.

    And it provided a clear blueprint for future innovation.

    By combining deep expertise in machine learning with a practical understanding of marketing challenges, we can build systems that help organizations make smarter decisions faster.


    Looking Ahead

    Artificial intelligence is evolving at an extraordinary pace.

    However, the most important innovations will not come from models alone. They will emerge from how organizations integrate these technologies into real business workflows.

    Our experiment showed that when applied thoughtfully, AI can move beyond hype and deliver measurable value.

    It can help marketers understand their audiences more deeply, design campaigns more effectively, and allocate resources more intelligently.

    Most importantly, it demonstrates that the future of marketing intelligence lies not in replacing human creativity but in augmenting it with powerful analytical tools.

    The question was never whether AI would transform marketing. The real challenge is learning how to apply it in ways that truly work. And through this journey, we have taken an important step toward that future.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • The Brand Perception Atlas – Mapping the Modern Brand, From Social Signal to Core Equity

    Introduction

    In the digital age, a brand is no longer defined solely by what a company communicates about itself. Instead, it is shaped by what millions of people collectively believe it to be. These beliefs are expressed continuously through social media posts, product reviews, news coverage, and countless other digital interactions.

    For organizations attempting to understand brand perception, this environment presents a fundamental challenge. The volume of available data is unprecedented, yet the signals it produces are often inconsistent and contradictory. Traditional research tools such as surveys and focus groups remain essential for measuring brand equity, but they capture perception only at specific points in time and cannot fully reflect the fast-moving nature of digital conversation.

    At the same time, social media offers a continuous stream of public commentary, revealing how brands are discussed, interpreted, and compared in everyday discourse. However, these signals are noisy and difficult to interpret in isolation.

    This tension motivated the development of a Brand Perception Atlas, designed to integrate diverse perception signals into a unified analytical framework. By combining social media data, public knowledge sources, Large Language Model (LLM) summaries, and established brand equity research, the atlas aims to provide a more comprehensive understanding of how brands are perceived across the digital ecosystem.

    The first iteration of this project analyzed perception signals for more than 200 brands and over 4,000 individual data points, enabling the construction of a visual representation of brand perception that reveals relationships, consistencies, and divergences across multiple sources.

    The Brand Perception Atlas

    The Brand Perception Atlas functions as a navigational system for brand strategy, like how a satellite navigation map helps a pilot understand the terrain below. A pilot does not see individual roads or buildings from the cockpit. Instead, they rely on a map that reveals patterns, landmarks, and relationships across the entire landscape.

    Brand leaders face a similar challenge. Individual data points, social posts, reviews, survey responses, are like isolated coordinates. On their own, they reveal very little. But when thousands of signals are plotted together, patterns begin to emerge.

    The Brand Perception Atlas converts these signals into a shared perceptual map, allowing leaders to see how brands cluster, where they compete, and which meanings they occupy in the public imagination.

    To achieve this, the atlas synthesizes perception signals from several sources:

    • Content from official brand accounts on platforms such as TikTok and Instagram
    • Public narratives reflected in sources like Wikipedia
    • AI-generated summaries, using Gemini, describing how brands are perceived in digital discourse
    • To anchor the analysis in long-term brand perception, the system incorporates survey-based brand equity data from the WPP Brand Asset Valuator® (BAV).

    Figure 1 illustrates the Brand Perception Atlas and the different data sources used. By integrating these sources, the atlas creates a more holistic representation of brand perception than any individual data source can provide alone.

    Figure 1: Data sources flowing into the Brand Perception Atlas

    The actual data from our sensors for two of the world’s most recognizable brands, Target and Amazon, are outlined in Table 1. By presenting the actual reporting generated by each source, we can see how the Brand Perception Atlas bridges the gap between decades of structured consumer research and the lived experience of these brands in digital spaces.

    Sample Brand Perception Reports
    SourceTargetAmazon
    BAV Survey
    (Long-term Equity)“Consumers perceive this brand as a highly accessible and dependable choice, offering excellent value for money. It consistently earns praise for its reliable, high-quality, and original offerings… Seen as a dynamic leader that stays up-to-date and trendy.”“Broadly perceived as an innovative and reliable leader, offering high quality and exceptional value. Consumers praise its distinctiveness and original thinking… It resonates with intelligent, forward-thinking individuals.”
    Wikipedia
    (public narrative)“A major American retail chain… positioned as offering upscale products at below-average costs, appealing to a younger, more educated, and higher-income demographic… noted for its significant market presence.”“Perceived as a global technology behemoth recognized for innovation and market leadership… However, its perception is tempered by persistent discussions regarding labor practices and its influence on traditional retail.”
    Gemini LLM
    (digital discourse)“Target cultivates an ‘upscale-for-less’ reputation… It’s perceived as a clean, organized, and pleasant one-stop-shop that blends everyday necessities with trendy, affordable finds.”“A study in contrasts, embodying both indispensable convenience and controversial corporate power. The vibe is one of a modern utility, deeply integrated into daily life, often viewed as a ‘necessary evil’.”
    Instagram
    (brand official account)“An accessible and enjoyable retail destination… The playful and organized presentation of shopping reinforces a positive, discovery-driven customer experience, emphasizing Target’s ability to host unique product lines.”“Viewed as an essential, incredibly convenient, and reliable platform offering immense selection with unparalleled speed. Users celebrate its efficiency while simultaneously expressing a desire for greater focus on ethical concerns.”
    TikTok
    (brand official account)“Widely seen as a trendy, accessible retailer offering stylish, curated products and collaborations that enable a joyful lifestyle… captures a highly visual and energetic side of the brand.”“Projects a vibe of convenience, accessibility, and an expanding reach into curated wellness and lifestyle products… making niche, high-quality products feel easily attainable.”

    Table 1: Sample brand perception reports, for Target and Amazon, from the Brand Perception Atlas dataset

    Developed by WPP, BAV is one of the largest and longest-running brand equity research programs in the world, with data collected over more than three decades across thousands of brands and markets. Unlike social media and digital sources, which provide indirect signals by analyzing what people post or what algorithms curate, BAV is survey-based and taps directly into human sentiment. By asking consumers directly, it captures deep-seated beliefs through structured imagery attributes such as innovative, reliable, and trustworthy.

    Within the Brand Perception Atlas, BAV serves as the foundational benchmark. It represents the long-term equity and established identity of the brand, providing a stable baseline against which the more volatile, “everyday” signals of social media can be compared. This distinction is crucial: while social media shows what people are talking about today, BAV reveals what they fundamentally believe about the brand.

    To provide a granular view of brand personality, the BAV system evaluates each brand against a standardized set of 48 Imagery Attributes. These attributes are the “building blocks” of perception, ranging from functional traits like Reliable, High Quality, and Simple to emotional and aspirational markers such as Charming, Daring, and Sensuous. By measuring a brand against this consistent list, the system can pinpoint exactly where its personality “peaks”, whether it is anchored in the “sturdiness” of a traditional leader or the “energy” of a modern disruptor.

    In the Brand Perception Atlas, these 48 attributes act as a high-resolution lens. Rather than simply stating a brand is “liked,” the data reveals the specific nuances of why it is liked. For example, a brand may score highly on Intelligent and Visionary while scoring low on Friendly or Down to Earth. This multidimensional data allows the Atlas to plot brands with extreme precision, identifying “perceptual neighbors” who share the same underlying DNA across these 48 distinct dimensions of human sentiment.

    Mapping Brand Perception

    At the center of the Brand Perception Atlas lies the Perception Map, a visual representation of how brands relate to one another in terms of public perception.

    Each perception signal is converted into a numerical representation of semantic meaning, known as an embedding. It is helpful to think of semantic meaning as the “vibe” or “essence” of a word, rather than just the word itself. In a traditional search, if you looked for “Luxury,” a computer might ignore a result for “High-end” because the letters don’t match. However, the Brand Perception Atlas uses semantic embeddings to understand that “Luxury,” “Premium,” and “Prestigious” all mean roughly the same thing. Because their meanings are similar, the system places brands described with these words in the same neighborhood on the map. It maps the ideas behind the words, not just the text.

    This semantic translation approach is shared across the Satalia AI Research Lab. The Ad Performance Pod applies a complementary technique, using multimodal LLMs to convert raw ad creatives, images, video, and metadata, into structured text descriptions, and then predicting post-level engagement from those descriptions. Where the Atlas asks “what does this brand mean?”, the Ad Performance Pod asks “what makes this content perform?” — both relying on the principle that rich semantic embeddings can unlock patterns invisible to traditional feature engineering.

    These embeddings are then plotted within a shared space. In this space, the distance between points reflects the similarity of the associations surrounding different brands. Brands with similar perceptions appear close together, while those associated with different ideas occupy distinct regions.

    The resulting map resembles a landscape of brand meaning, allowing brand leaders to observe clusters of brands that share common associations and identify outliers that occupy unique perceptual positions.

    Figure 2: Perception map for Nike across all platforms (US, 2025-26)

    When a brand is selected within the Perception Map interface, as shown in Figure 2, the system reveals a rich, multi-layered profile that visualizes how that brand lives in the public imagination. The interface acts as a command center, surfacing the distinct signals captured by each sensor to provide a 360-degree view of the brand’s identity.

    Once a brand like Nike is selected, the map highlights its specific coordinates across all platforms, while the sidebar populates with synthesized insights including the brand’s industry, country, and a “Cluster Label”, for example “Hope Innovation Compassion”, which summarizes the overarching sentiment. Users can navigate the Perception Signal Summaries to see the unique reports for each platform, from the foundational BAV sentiment to real-time narratives on TikTok and Instagram, allowing for an immediate comparison of how the brand “sounds” in social discourse versus how it is “valued” in long-term surveys.

    Insights from the Unified Perception Map

    Once the perception spaces were aligned, several patterns became visible:

    1. Omnichannel Consistency (Unified Brands): These brands have the most “clustered” perception signals, meaning they look the same regardless of which sensor you use vs Heterogeneous Perception (Diverse Brands): These brands are “dispersed”, they present different facets of their identity depending on where you look.
    2. Shared Equity, Different Vibe (Close on BAV, Far on Socials): These brands share deep-seated consumer beliefs but look different on TikTok/Instagram.
    3. Different Equity, Shared Vibe (Far on BAV, Close on Socials): These brands come from different worlds but converge into a single “vibe” in the public imagination.
    4. Invisible Infrastructure Brands: These foundational entities occupy a shared emotional territory of essentiality and high-stakes reliability, despite having radically different levels of public visibility.

    These insights demonstrate the value of examining brand perception across multiple data sources rather than relying on any single indicator.

    Omnichannel Consistency (Unified Brands) vs Heterogeneous Perception (Diverse Brands)

    One of the most useful insights derived from the perception map is a metric we refer to as Omnichannel Consistency. This measure evaluates how closely aligned a brand’s perception is across different information sources. If the signals derived from social media, surveys, and public narratives cluster tightly together, the brand is communicating a coherent identity. Conversely, if these signals are widely dispersed, the brand’s presence is more diverse.

    Analysis of the dataset revealed several brands with particularly strong consistency across channels, including John Deere,Caterpillar, and Oscar Health, with over 99% omnichannel consistency.

    John Deere serves as the definitive benchmark for Omnichannel Consistency, maintaining one of the most stable identities in the Brand Perception Atlas. While many brands act like chameleons, shifting their personality to mirror the fleeting trends of each platform, John Deere operates as a perceptual anchor, where its core essence of “rugged reliability” remains unchanged whether viewed through a 30-year longitudinal BAV survey or a viral TikTok demonstration. Figure 3 demonstrates this pattern, showing all of the John Deere data points close together on the map.

    Figure 3: Perception map showing high omnichannel consistency for John Deere on all platforms (US, 2025-26)

    This “Nothing Moves Like a Deere” factor is so powerful that even complex public debates, such as the “Right to Repair” controversy, appear with remarkable alignment across all five sensors, reinforcing the brand’s status as an essential, high-stakes infrastructure provider for its community. For leadership, this proves that a sufficiently clear perceptual identity pulls every digital channel toward a single, unified truth.

    As shown in Figure 4, the brand’s actual digital content reinforces this stability by bridging the gap between historical heritage and modern utility. The Brand Perception Report for this specific Instagram post highlights a reputation for “practical innovation,” where powerful, reliable machinery transforms arduous tasks like snow removal into efficient, even enjoyable endeavors. The data reveals a narrative centered on problem-solving and the ultimate reward of reclaimed time, positioning the brand as a premium, aspirational solution. While the content emphasizes the ease of machine efficiency over manual labor, it also subtly reflects the “significant investment” required for such equipment, a sentiment that aligns perfectly with the “rugged yet refined” and “worth more” attributes captured by the BAV survey sensors.

    Figure 4: Screenshot of a John Deere Instagram video showing snow removal equipment

    In contrast, several brands demonstrated much greater variation in perception across sources. These included Marriott, Southern Living, and American Airlines, whose signals appeared more widely distributed across the perception map.

    Marriott provides a contrasting example of a multi-faceted brand that successfully occupies distinct perceptual regions depending on the sensor used. The BAV sensor positions the brand within a territory of reliability, superior quality, and prestigious appeal, where consumers fundamentally view Marriott as an intelligent, high-performing, and traditional leader. However, the social sensors (Instagram and TikTok) pull the brand into a separate region defined by aspirational luxury and curated escapism. In these digital spaces, Marriott’s intended portrayal is human-centric and inclusive, with official reports describing an organization that “champions individuals” and demonstrates “deep cultural understanding.” This is brought to life through actual content featuring stories of people, such as the one shown in Figure 5, an Instagram post highlighting Chef Kashif Khan using sign language to communicate at the St. Regis Mumbai. This mapping is diagnostic for leadership, revealing that Marriott is not a monolith; it is a reliable corporate giant that strategically uses social media to humanize its identity through stories of diversity, sustainability, and global connection.

    Figure 5: Screenshot of the Instagram video of Chef Kashif Khan using sign language

    Interestingly, multiple travel-related brands appeared in this lower-consistency group. The high volume of online discussion surrounding travel experiences, ranging from positive stories to customer complaints, may contribute to a more fragmented perception environment for brands in this sector.

    Perhaps the most critical lesson our dashboard taught us is that online buzz doesn’t automatically change what people fundamentally believe about a brand. Social media is a powerful megaphone, but it’s not always a mind-changer. It is easy to be distracted by conversation volume and the sheer amount of noise, visibility, and viral activity a brand generates on platforms like TikTok or Instagram. However, the Atlas looks past this “digital footprint” to uncover a brand’s equity: the deep-seated associations that define how a brand is understood by the public. When we look at these deeper signals, brands that seem worlds apart on social media often turn out to be close neighbors in the mind of the consumer.

    The perfect case study for this is Palo Alto Networks, a titan in the cybersecurity industry.

    If you were to judge them solely by their Instagram presence, you’d see a flawless picture of a market champion. The conversation is filled with words like “Reliable, Innovative, Leader.” It’s a polished, professional, and highly positive image, the brand broadcasting its strengths and amplifying its message of leadership.

    But our BAV data, the “ground truth” of long-term brand equity, built from years of customer experiences and market reputation, tells a more complicated story. The BAV data agrees on their high status, describing them as “Intelligent” and “Prestigious.” However, it adds a critical, and very human, third dimension: “Arrogant.” This single word is the key. The perception of “arrogance” isn’t something a few Instagram posts can erase. It’s a feeling built from countless interactions: sales calls, pricing negotiations, customer support tickets, and industry chatter. It’s a core part of their brand’s DNA, for better or worse.

    What the map shows is that the high volume of positive conversation on Instagram is successfully amplifying the “Leader” and “Innovative” message. However, it hasn’t fundamentally altered the deep-seated perception of arrogance. The social media buzz is happening on the surface, while the core brand equity, the good and the bad, remains firmly in place. It’s a powerful reminder that you can’t just tweet your way to a new reputation.

    In other words, social media can amplify a reputation, but it rarely rewrites it.

    For leadership, this mapping is diagnostic, offering a clear view of whether this dual identity represents a successful multi-segment strategy or an unintended drift from the brand’s core promise.

    It is important to note that omnichannel consistency is not inherently positive or negative. Some brands benefit from a tightly aligned identity across all platforms, while others thrive precisely because they present different facets of themselves in different contexts. For example, entertainment, fashion, and travel brands often generate highly diverse narratives online, reflecting the variety of experiences consumers have with them. In these cases, perceptual fragmentation may actually be a sign of adaptability and cultural relevance.

    The value of the metric is therefore diagnostic rather than prescriptive. It reveals where a brand sits on the spectrum between a unified or multi-faceted perception ecosystem, allowing leaders to determine whether that specific pattern aligns with their strategic intent. For some, a tightly clustered, unified identity is the goal; for others, a more distributed, multi-faceted presence across different platforms is a deliberate strategy to reach diverse audiences.

    Shared Equity, Different Vibe (Close on BAV, Far on Socials)

    The Atlas also reveals unexpected relationships between brands that operate in entirely different industries. The most striking examples of the “Perception Paradox” occur when two brands share an identical equity backbone but present entirely different faces to the digital world.

    Consider 3M, (originally the Minnesota Mining and Manufacturing Company) a global diversified technology conglomerate that applies material science to develop over 60,000 products for industrial, worker safety, healthcare, and consumer markets. At its core, 3M functions as a science-driven “innovation engine.” The company is famous for its “15% Rule,” which allows employees to spend a portion of their time on passion projects, a culture that famously led to the invention of the Post-it Note.

    Then consider Marriott, a premier global hospitality company that operates, franchises, and licenses a vast and diverse portfolio of hotels and lodging facilities.

    On social media, these brands live in completely different worlds. 3M’s presence is defined by science and industrial innovation, while Marriott trends through a lens of “aspirational luxury” and global travel. If you judge only by social footprint, you would assume these brands occupy different cultural spaces; indeed, on social sensors, they move to opposite ends of the map, as 3M focuses on the mechanics of research and Marriott on the spirit of leisure.

    Figure 6 illustrates this social-layer distinction for 3M. This video is just one example of the content driving an overall Instagram vibe characterized as Innovative, Responsible, and Sustainable. The brand perception report for this post captures an aspirational narrative of “innovation, education, and community.” By highlighting initiatives like the “3M Young Scientist Challenge,” the brand is perceived as a champion of scientific exploration, fostering a “fun” and “inspiring” platform for young people to solve real-world problems. While this specific content positions 3M as a driver of STEM education, the Atlas reminds us that beneath this vibrant digital layer lies the same sturdy, reliable identity that makes it a perceptual neighbor to a hospitality giant like Marriott.

    However, when we look at the BAV sensor, they appear as surprisingly close neighbors. The reason becomes clear when you look at the deeper associations: both brands are anchored by Reliability and Leadership. In the foundational mind of the consumer, 3M is an “innovation backbone,” while Marriott is a “service backbone.” While their social streams look nothing alike, their underlying brand equity is rooted in being an essential, dependable institution.

    Figure 6: Screenshot of 3M Instagram video highlighting the Young Scientist Challenge

    Now, consider Costco, a warehouse retailer known for bulk purchasing, and JetBlue, a major passenger airline. At first glance, these brands share little in common; however, when we look at the BAV sensor, they emerge as close neighbors within a shared “Consumer Champion” territory.

    On BAV imagery attributes, both brands are anchored by a core identity of being Friendly and Reliable. Costco is fundamentally perceived through a lens of Simplicity, while JetBlue is defined by its Value. In the foundational mind of the consumer, both brands occupy a similar emotional space: they are the “trusted alternatives” that provide high-quality essential services without the exploitative friction often associated with their respective industries.

    As shown in Figure 8, the interface allows us to see how these signals diverge as we move into the “Experience Layer” of social media. On Instagram, Costco remains tightly aligned with its BAV profile, with its digital presence reinforcing a narrative of affordability and value for money. JetBlue, however, shifts into a more aspirational territory on social platforms, emphasizing convenience and the lifestyle of travel.

    Figure 7: Screenshot of the Brand Perception Atlas showing Costco and JetBlue, with Costco selected as the focus brand

    This distinction is diagnostic for brand leaders. It shows that while two brands may share an identical “equity backbone” as consumer champions, their social expressions can vary significantly based on platform context, Costco doubling down on its “no-frills” value while JetBlue leans into the aspirational allure of the journey.

    The result is that two brands from entirely different industries converge within the same emotional territory: brands that make consumers feel smart, respected, and well served.

    Different Equity, Shared Vibe (Far on BAV, Close on Socials)

    The Brand Perception Atlas also reveals an inverse pattern: brands that come from entirely different industries and share almost no long-term equity, yet converge into a single, unified “vibe” on social media. In these cases, the social layer acts as a cultural blender, pulling disparate corporate giants into the same neighborhood.

    Fox is a major American media conglomerate primarily known for its television broadcasting, news, and sports programming. It operates the Fox Broadcasting Company, Fox News, and Fox Sports. Its identity is built on high-stakes live events, character-driven drama, and real-time news reporting—a business that naturally scores high on “Dynamic” and “Charming” attributes but often generates polarizing sentiment.

    Kimberly-Clark is a global giant in personal care and hygiene products. It manufactures essential household brands such as Huggies diapers, Kleenex tissues, Cottonelle toilet paper, and Kotex feminine hygiene products. Its business is rooted in high-volume manufacturing and daily utility, leading to a long-term reputation for “Reliability” and “Practicality” that is very different from the world of mass media.

    When we look at the BAV sensor, Fox and Kimberly-Clark are worlds apart, with a distance. Fox is perceived through a lens of high-stakes entertainment and sophistication, scoring highly on attributes like Charming, Sensuous, and Dynamic. Kimberly-Clark, meanwhile, is anchored in the “Practical Utility” of household essentials, defined by Reliability, Quality, and Trustworthiness.

    Figure 8: Screenshot of the Brand Perception Atlas showing Fox and Kimberly-Clark, with Fox selected as the focus brand

    To the average consumer, a polarizing media giant and a dependable tissue manufacturer should share no common ground. However, as shown in Figure 10, the Social Sensors reveal a convergence. On Instagram and TikTok, the distinction between a news broadcast and a household utility fades as both brands pivot toward lifestyle-centric storytelling.

    According to the social reports, Fox is perceived through a lens of “intense, character-driven drama” and high-stakes emotional depth, appealing to those who value mature storytelling. Simultaneously, Kimberly-Clark’s social presence is described as “authentic and engaging,” characterized by a “vibrant” and “charismatic” brand presence that uses humor and social commentary to create a “deeper connection beyond entertainment.” In this digital layer, Fox’s dramatic clips and Kimberly-Clark’s community-focused narratives occupy the same neighborhood. They are no longer judged as “The News” vs. “The Diaper Company”; instead, they are unified as Established Lifestyle Entities that use high-quality content to build emotional resonance with their audiences.

    Invisible Infrastructure Brands

    The Atlas also reveals a unique category of brands that operate as “Invisible Infrastructure.” These are entities that sit at the center of the perception map not because they are famous, but because they are foundational. One striking example is the proximity of Pfizer and Kyndryl.

    Kyndryl is one of the world’s largest IT infrastructure services companies, spun off from IBM in 2021. It manages the mission-critical technology backbone for hospitals, banks, airlines, and governments around the world. Most consumers have never heard of its name. Its revenue is enormous; its consumer presence is almost zero. And yet, in our perception map, it sits directly next to Pfizer.

    Figure 9: Screenshot of the Brand Perception Atlas showing Kyndryl and Pfizer, with Kyndryl selected as the focus brand

    On the BAV survey, both brands score highly on imagery such as intelligent, reliable, progressive, and original, placing them far from lifestyle or aspirational brands. The depth of the BAV dataset further reveals that this “Expertise” profile is universally recognized across demographics. For example, both Male and Female audiences consistently identify these brands as Visionary and High-Performance leaders, regardless of their personal familiarity with the company’s day-to-day operations. Despite radically different visibility levels, they share the same underlying promise: critical systems do not fail because of us. Kyndryl has no social media presence. Post-Covid, Pfizer is perceived as controversial on Wikipedia and Gemini.

    Kyndryl, a brand name most consumers don’t recognise, sits directly next to one of the world’s most recognised pharmaceutical giants in perception space. Not because they sell similar products, but because they occupy identical emotional territory: the invisible infrastructure of modern life that you only notice when it fails. The data from our sensors provides strong evidence for this “Invisible Infrastructure” connection between Pfizer and Kyndryl. Despite their radically different levels of public fame, they share a distinct perceptual fingerprint. When we look at the BAV data, the reason for this proximity becomes clear. Both brands are anchored by attributes associated with competence and essentiality rather than lifestyle appeal.

    The Gemini sensor perceives Kyndryl as a “Powerful and Reliable” entity that provides the backbone for critical systems. The report emphasizes its “indispensable” nature to global commerce. Pfizer, while highly recognized, is described as a “polarization” where the brand is viewed as a “scientific savior” clashing with a “profit-driven corporate giant.” The Wiki and Gemini and reports specifically mention it is a “Big Pharma” entity whose “trust is a highly contested issue.”

    Even though one is a household name and the other is a ghost to the average consumer, the Atlas recognizes them as the same thing: the high-stakes machinery that keeps the modern world running. One is noticed because of its global news cycle; the other is only noticed when the “mission-critical backbone” fails.

    Conclusion

    Understanding brand perception has always been one of the central challenges of marketing and brand strategy. In the digital era, the difficulty lies not in the absence of information but in its overwhelming abundance. Organizations now face a paradox: more data than ever before, yet less clarity about what it means.

    The Brand Perception Atlas represents an effort to bring structure to this complexity. By integrating diverse perception signals, ranging from social media discourse to public narratives and LLM summaries, and anchoring them against the long-term ground truth of BAV brand equity data, the atlas makes it possible to observe how brand meaning evolves across different layers of public perception.

    This unified view reveals something important: the signals surrounding a brand are not random noise. They form patterns. Brands cluster around shared emotional territories, compete in perceptual spaces that often cut across industry boundaries, and reveal whether social conversation reinforces or diverges from long-standing brand equity.

    Rather than replacing traditional research methods, this approach extends them. It allows brand leaders to see not only what people say about brands today, but also how those conversations relate to the deeper beliefs that define brand meaning over time.

    In a world where social media amplifies attention but does not always reshape perception, the real strategic advantage lies in understanding the relationship between visibility and meaning. The Brand Perception Atlas offers a way to navigate that relationship. Not as a replacement for human judgment, but as a new map, one that helps brand leaders see where their brand truly sits in the landscape of consumer perception, and how that landscape is changing.

    Because in the end, the challenge is not simply measuring brand performance, it is understanding the territory of meaning that a brand occupies in people’s minds and learning how to navigate it.

    Learn more about WPP Brand Asset Valuator® (BAV) and how you can access this data for your own analysis.