Author: Andreas Stergioulas

  • A Self-Improving AI Agent for Optimizing and Explaining Media Performance

    TL;DR Most tools that claim to predict post performance tell you what’s in a post — they can’t tell you why it would resonate. The Prediction Optimization Agent closes that gap by converting the image, caption, and influencer metadata into a single natural-language description that captures tone, credibility, and visual cohesion the way a strategist would. It then forecasts engagement, measures its errors, and autonomously rewrites its own instructions to produce better descriptions next time — with no human prompt engineering required. The result is a system that doesn’t just predict what will perform, but allows creative teams to draw insights on what makes a post tick — and what doesn’t.

    Introduction

    Influencer marketing is a multi-billion industry, yet many spending decisions still rely heavily on intuition. Brands pick creators by follower count, approve content by gut feel, and learn what worked only after the budget is gone. The cost of a wrong bet is not just wasted spend. It is missed revenue, lost momentum, and a creative team back at square one.

    Everyone in the industry wants the same thing: a way to know, before publishing, whether a post will perform. But the tools built to answer that question often disappoint. They extract surface features like hashtag counts, color palettes, and face detection, then output a score no one can interpret or trust. They tell you what is in a post but nothing about why it would resonate.

    The gap is not technical. It is perceptual. What makes a post work is something a skilled marketer can often articulate after the fact: the influencer’s tone felt effortless, the product placement didn’t interrupt the narrative, the caption hit a cultural nerve. These judgments require understanding context, intent, and audience, not just pixels and metadata.

    This is the problem the Prediction Optimization Agent was designed to close. Rather than extracting features and producing an opaque score, the agent reads a post the way a strategist would. It examines the image, the caption, and the influencer’s history, then writes a structured natural-language description of everything that contributed to the post’s performance. A creative director can then read this description, challenge it, and act on it without ever needing to trust a black box.

    The intuition is simple: the description that best predicts performance is, by definition, the description that best explains it. The agent iteratively refines this description by diagnosing its own errors, identifying what the previous descriptions failed to capture, and rewriting its own instructions to self-improve. Over successive iterations it converges on the specific qualities that actually drive engagement, not because anyone told it what to look for, but because it is guided by the minimization of its own predictive error.

    The Anatomy of a Viral Post

    Predicting the performance of an ad or a social media post before publishing — and certainly before spending money to boost it — Predicting the performance of an ad or a social media post before publishing — and certainly before spending money to boost it — remains a primary objective for marketers and influencers alike. But how do you distill something as complex and subjective as a social media post into a single prediction?

    Consider a typical Instagram post. It is never just a picture. It’s a complex combination of different data types working together simultaneously. Take the influencer post shown in Figure 1. To truly understand why this post succeeds or fails, you need to consider:

    • The image itself — composition, lighting, color palette, subjects, products, and setting.
    • The caption — where the influencer might share a discount code, crack a joke, or strike an emotional chord.
    • The influencer’s identity — their bio, follower count, niche credibility, and historical performance.
    • The metadata — the time of day, geographic location, hashtags, and platform-specific context.

    Each of these dimensions carries signal. None of them tells the full story alone. The magic — and the difficulty — lies in how they interact.

    Figure 1: A typical influencer post. Traditional analytics struggle to measure the combined impact of the visual aesthetic, the caption’s tone, and the underlying metadata. To accurately predict engagement for a post like this, our system analyzes the image, caption and influencer statistics together as a single cohesive unit.

    Why Traditional Approaches Fall Short

    Historically, the industry has tried to solve this problem by breaking a post into its component parts and analyzing each one in isolation. Engineering teams would string together separate, disjointed machine-learning models — each responsible for a single modality — and attempt to fuse their outputs at the end:

    • Standard Computer Vision Models: Isolated image-recognition algorithms scan the visual to detect objects, people, or products. Separate models handle face detection and emotion recognition. The output is a list of labels: “person detected,” “beverage detected,” “outdoor setting.”
    • Text Analyzers & OCR: NLP tools parse the caption — counting hashtags, flagging emojis, scoring sentiment. Meanwhile, optical character recognition (OCR) software reads any text visible within the image itself.
    • Tabular Metadata Algorithms: A statistical algorithm ingests structured fields — follower count, posting time, engagement history — and produces its own independent prediction.

    Engineers then attempt to fuse these isolated outputs into a single forecast. But this approach has a fundamental, structural flaw: it cannot capture meaning that emerges from the interaction between modalities.

    Consider a concrete example. Imagine a fitness influencer posts a photo of herself laughing mid-sip from an energy drink, with the caption: “My face when someone says they don’t need pre-workout 😂.”

    A traditional computer vision model would tag this as: “person detected,” “beverage detected,” “outdoor setting,” “positive facial expression.” A separate text analyzer would count the hashtags and flag the emoji. But neither system — nor any crude fusion of their outputs — would understand the joke: that the humor comes from the caption reframing the laugh as a reaction, turning a standard product shot into a relatable meme.

    It would also completely miss that this influencer is a certified nutritionist — meaning her credentials paired with an energy drink carry implicit credibility that a fashion influencer holding the same product would not.

    Our Approach: Unifying Multimodal Data Through Semantic Translation

    To address this limitation, we developed the Prediction Optimization Agent, a self-improving AI agent that moves beyond traditional approaches by unifying all available data into a single format it can reason about: natural language.

    The agent’s core mechanism is deceptively simple but profoundly effective. It takes complex, disparate multimodal data — numerical metrics, images, video, and text captions — and converts everything into a single natural-language paragraph that holistically describes the post’s content, aesthetic, tone, and context. By projecting all of these distinct formats into readable text, heterogeneous data is normalized into a structure that a language model can process as a unified whole.

    Instead of treating image and text as separate inputs, the agent uses a single prompt to digest all available information at once. Multimodal LLMs serve as one of the agent’s tools, acting as universal feature extractors that capture the abstract, human-centric concepts that traditional pipelines structurally cannot.

    But the agent does not simply produce any description and hope it is useful. It is driven by a feedback loop grounded in predictive error: the descriptions it generates are used to forecast engagement, those forecasts are compared against real outcomes, and the resulting errors tell the agent exactly how much predictive value its current descriptions are capturing — and how much they are missing. Through successive rounds of this loop, the agent autonomously rewrites the instructions that govern how descriptions are composed, converging on the paragraph structure that maximizes predictive accuracy.

    This error-driven process has a profound consequence for explainability. The description the agent converges on is not a generic summary — it is the description that the agent has discovered, through empirical optimization, to be the most predictive of real engagement outcomes. In other words, the features highlighted in the final description are there because they matter — because including them reduced prediction error. When the optimized description of a high-performing post calls out “candid humor,” “golden-hour lighting,” and “influencer credibility,” those aren’t arbitrary observations. They are the factors the agent learned to pay attention to because they measurably improved its ability to predict what performs well.

    How the Prediction Optimization Agent Works

    The Prediction Optimization Agent orchestrates three internal stages in a continuous feedback loop: it observes a post, describes it, predicts its performance, measures how far off it was, and then rewrites its own instructions to produce better descriptions next time — closing the loop and getting measurably better with every iteration, without any human intervention.

    Figure 2: The Prediction Optimization Agent architecture. Raw media, metadata, and an initial prompt are fed into Stage 1 (Semantic Translation), which produces a natural-language description of the post. Stage 2 (Engagement Predictor) reads that description and predicts engagement. Prediction errors are then passed to Stage 3 (Self-Optimizer), which autonomously analyzes what went wrong and rewrites the Stage 1 prompt — closing the feedback loop and improving the system’s accuracy with every iteration.

    Stage 1: Semantic Translation

    The agent begins by ingesting the raw post — the image or video file, the caption text, and all available metadata (follower count, posting time, influencer bio, etc.). Using a multimodal LLM as its translation tool, and guided by a detailed set of internal instructions (its prompt), it produces a single, rich natural-language paragraph that captures not just what is in the post, but what the post means: the visual mood, the emotional tone, the relationship between caption and image, and the brand alignment.

    The quality and focus of this description is entirely governed by the prompt — and as we will see, it is the prompt that the agent learns to optimize.

    Stage 2: The Engagement Predictor

    The agent passes the semantic paragraph to its prediction tool — a model that evaluates the post’s potential performance based entirely on the natural-language description from Stage 1.

    The predictor can be any machine learning model with the ability to understand text paragraphs. It can be based on trees, deep learning, or any other compatible architecture. It can even be a fine-tuned LLM, upskilled for predictions in a specific domain. Our Agent is compatible with all these options.

    In all cases, the quality of the prediction is only as good as the quality of the semantic description. A richer, more targeted paragraph gives the predictor more signal to work with — which is precisely why improving the description is the single most important action the agent can take.

    Stage 3: The Self-Optimizer

    This is the core of the Prediction Optimization Agent’s intelligence — and what makes it a true agent rather than a static model.

    After predictions are generated for a batch of posts and compared against actual engagement outcomes, the agent enters a self-evaluation cycle. The agent is presented with:

    1. The prompt it used to generate descriptions in the current round.
    2. The per-post prediction errors — for each post, the predicted engagement versus the actual engagement, and the magnitude and direction of the miss.
    3. The aggregate error metrics — overall accuracy statistics such as mean absolute error, broken down by content category, audience size, or post type, revealing systematic blind spots.
    4. The complete history of prior prompts and their corresponding error metrics — so the agent can see not just how it performed this round, but how each previous prompt version performed, which changes helped, and which didn’t.

    Armed with this structured feedback, the agent doesn’t guess at improvements — it reasons about them. It asks: Which types of posts am I consistently getting wrong? What do those posts have in common? What dimensions of the content is my current prompt failing to capture?

    Based on this analysis, the agent autonomously rewrites the prompt instructions used in Stage 1. The rewritten prompt is not a minor tweak — it can add entirely new analytical dimensions, remove instructions that proved unhelpful, or restructure the description format to foreground the features most correlated with engagement.

    The critical insight is that the predictive error is the signal that drives the entire system. Every improvement the agent makes to its prompt is motivated by a concrete, quantifiable gap between what it predicted and what actually happened. The agent isn’t optimizing descriptions to be “better” in some abstract sense — it is optimizing them to be more predictive, and the error metrics tell it exactly where it is falling short.

    The Agent in Action: Watching the Prompt Evolve

    To make this concrete, let’s follow the agent through three optimization rounds on a single post — the Hype Energy ad featuring two young models in blue athletic wear, photographed from a dramatic low angle against a vivid blue sky lined with palm trees.

    Figure 3: Example Instagram post used for semantic translation.

    Round 1: The Baseline

    The agent starts with a deliberately simple, generic prompt:

    Prompt (Round 1): “Analyze this Instagram post and the provided metadata. Describe the visual elements of the image, the style, and the mood. Then, explain how the image relates to the brand or influencer described in the metadata.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy, utilizing a vibrant, ‘streetwear-meets-summer’ aesthetic to target a young, active demographic. The image uses a low-angle shot, which makes the subjects appear dominant and ‘larger than life.’ Two young models are featured in athletic/leisure wear. The female subject in the foreground holds the product directly toward the lens. The male subject in the background wears reflective ‘matrix-style’ sunglasses and a blue Jordan hoodie. The post aligns perfectly with the brand’s identity as described in the metadata. The visual style is polished and fashionable, positioning the drink as a lifestyle accessory rather than just a functional caffeine source. The caption emphasizes ‘Zero Sugar’ and staying ‘sharp’ and ‘unstoppable.’”

    This is already more nuanced than what a traditional CV model would produce — but the predictor, working from this description, significantly underestimates the post’s engagement. The error gets logged. Across the full batch, the agent notices a pattern: it is systematically underperforming on highly produced brand-partnership posts.

    Round 2: The Agent Adapts

    The self-optimizer reviews the Round 1 errors and identifies the gap: the prompt never asked the agent to evaluate visual cohesion between talent styling and brand identity, or the psychological impact of compositional techniques like camera angle. It rewrites the prompt:

    Prompt (Round 2): “Analyze this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling (wardrobe, accessories, posture) and the brand’s visual identity. Evaluate how compositional choices — such as camera angle, framing, and subject placement — contribute to the aspirational or emotional tone of the post. Explain how the image, caption, and influencer profile work together to create a unified message.”

    The agent’s description now reads:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy that demonstrates strong visual-brand cohesion. The coordinated blue wardrobe across both models — the female subject’s athletic top and the male subject’s Jordan hoodie — creates a seamless extension of the brand’s blue-and-white visual identity, embedding the product within the scene rather than presenting it as a foreign element. The dramatic low-angle shot elevates the subjects into an aspirational, dominant position, making the viewer look up at them — a compositional choice that reinforces the caption’s message of being ‘unstoppable.’ The female subject holds the product at the camera’s focal point with deliberate confidence, while the male subject’s reflective sunglasses and relaxed posture add an effortless, streetwear credibility. Together, the image, caption, and influencer positioning frame the energy drink not as a beverage but as a lifestyle symbol for a young, fashion-conscious audience.”

    The prediction for this post improves noticeably. Across the batch, the error on brand-partnership posts drops. But the agent notices it is still missing on posts where influencer credibility plays a key role — posts where who the influencer is matters as much as what’s in the frame.

    Round 3: The Agent Sharpens Further

    The self-optimizer reviews Rounds 1 and 2 side by side — seeing that the Round 2 prompt improved brand-cohesion analysis but still lacks attention to influencer authority and audience trust signals. It rewrites again:

    Prompt (Round 3): “Analyze this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling and the brand’s visual identity. Evaluate how compositional choices contribute to the aspirational or emotional tone. Analyze the influencer’s credibility and authority relative to the product category — consider their professional background, niche expertise, audience demographics, and how these factors amplify or undermine the authenticity of the endorsement. Assess the emotional resonance of the caption: does it use humor, aspiration, relatability, or urgency, and how does this tone interact with the visual content? Explain how all elements — image, caption, influencer identity, and audience context — work together to create a unified, persuasive message.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Hype Energy that achieves strong visual-brand cohesion and leverages influencer-audience alignment to maximize credibility. The coordinated blue wardrobe across both models creates a seamless extension of the brand’s visual identity. The dramatic low-angle shot places the subjects in an aspirational, dominant position, reinforcing the caption’s message of being ‘unstoppable.’ The female subject’s direct-to-camera product hold at the image’s focal point conveys deliberate confidence, while the male subject’s reflective sunglasses and relaxed posture add streetwear authenticity. Critically, the influencer’s established presence in the fitness and lifestyle niche — with a predominantly young, health-conscious following — makes the ‘Zero Sugar’ energy drink endorsement feel credible rather than transactional. The caption’s tone (‘Stay sharp. Stay unstoppable.’) uses aspirational brevity that mirrors the visual intensity, creating a tight loop between what the viewer sees and what they read. The overall effect positions the product as an identity marker for an active, fashion-forward audience, rather than a commodity — a framing that this influencer’s audience is primed to respond to.”

    This description now captures visual strategy, brand cohesion, compositional psychology, influencer credibility, audience alignment, and caption-image interaction — all in a single readable paragraph. The prediction error drops significantly. And a marketing team reading this description can immediately see why the post works: the credibility is authentic, the visual identity is seamless, and the caption reinforces the image rather than fighting it.

    The agent arrived at this level of descriptive richness not because someone hand-engineered the prompt, but because the predictive errors from Rounds 1 and 2 told it exactly what it was missing.

    Real-World Application: Lessons and Implications

    To validate the Prediction Optimization Agent in a real-world scenario, it was tested using an Instagram Influencer Dataset containing approximately 10.18 million posts spanning a diverse range of content categories and audience sizes. The results revealed key insights about both the agent’s learning dynamics and the practical implications for marketing teams.

    The Agent Learns What Matters — Autonomously

    By processing its own historical error rates, the Prediction Optimization Agent autonomously learned to rewrite its prompts — producing richer, more targeted post descriptions with every iteration, which in turn drove increasingly accurate predictions.

    Figure 4 — Autonomous Learning: The chart tracks the agent’s predictive performance (y-axis) across successive optimization rounds (x-axis). Each point represents a full cycle of the agent’s loop: describe → predict → evaluate → rewrite. The trend demonstrates that as the agent iteratively refined its own prompt — guided by quantitative error metrics from prior rounds — forecast accuracy improved consistently and autonomously, without any human prompt engineering.

    The agent’s optimization works by feeding it the complete history of prior prompts alongside rigorous, quantitative error breakdowns from every previous round. Armed with this granular self-knowledge, the agent identifies precisely which content dimensions it has been under-analyzing — production quality, humor style, credibility signals, visual-brand cohesion — and surgically updates its prompt to capture those features in subsequent iterations. Each round’s prompt is built on the empirical lessons of every round before it, producing a compounding improvement curve where the descriptions become progressively more predictive and, as a direct consequence, more explanatory.

    Conclusions

    The Prediction Optimization Agent demonstrates something that extends well beyond social media: natural-language prompts can be treated as tunable parameters, optimized autonomously by the AI itself. By allowing the agent to refine its own instructions through predictive error, the system progressively discovers what drives human engagement and expresses that knowledge in plain language.

    For marketing teams, this is a significant departure from opaque AI tools that output a score and leave everyone guessing. When a team wants to understand why one campaign outperforms another, they don’t need to interpret a dashboard of abstract model coefficients. They can compare the text profiles of a high-performing post and a low-performing one, side by side, and immediately see the differences the AI picked up on: one might highlight “authentic, candid composition with humor-driven caption and strong influencer-niche alignment,” while the other notes “generic studio shot with formulaic promotional language and weak audience-brand fit.” The patterns reveal themselves in plain English, and they are the right patterns, because the agent discovered them by optimizing for predictive accuracy.

    In practice, this means teams can run draft campaign concepts through the system before committing production and media budgets, getting a readable assessment of how the AI interprets the creative. Designers and copywriters can test variations of a post and compare descriptions side by side to see, in their own language, which direction resonates more strongly. And by normalizing visual and written media into a unified, readable format, brands can pair creative intuition with precise forecasting, treating creative assets as predictable drivers of revenue rather than expensive gambles.

    The same architectural pattern, semantic translation, error-driven prediction, and autonomous self-optimization, is not limited to social media. Any domain where success depends on understanding the interplay of qualitative and quantitative signals, from political messaging to product design to entertainment, stands to benefit from systems that can read, reason, reflect, and improve on their own. The question is no longer whether AI can predict what resonates with people. It is whether we are ready to let it teach itself how.

  • Self-Improving Performance Agent Pod

    TL;DR Most tools that claim to predict post performance tell you what’s in a post — they can’t tell you why it would resonate. The Prediction Optimization Agent closes that gap by using a reasoning LLM to convert diverse multimodal inputs — images, captions, metadata — into a single natural-language description that captures tone, credibility, and visual cohesion the way a strategist would. Instead of hand-engineering rigid features like brightness or face detection, the agent extracts higher-level concepts like humor, urgency, and brand fit that traditional pipelines structurally cannot. That description becomes the feature set for a downstream predictive model, continuously refined by an autonomous optimizer that treats the prompt as a tunable parameter — iteratively rewriting it based on prediction error until it converges on the description that best explains real engagement outcomes. The result is a system that doesn’t just predict what will perform, but allows creative teams to draw insights on what makes a post tick — and what doesn’t.

    The Performance Optimization Agent: Technical Pipeline and Evaluation

    Introduction

    Predicting the engagement of social media content before publication is a high-value problem across marketing, advertising, and platform analytics. The challenge is inherently multimodal: a single Instagram post combines visual content (composition, lighting, subjects), caption text (tone, humor, calls-to-action), and structured metadata (follower count, posting time, influencer category)—and engagement is driven not by any one signal in isolation, but by the interaction between them. Traditional approaches attempt to handle this by stitching together separate computer vision, NLP, and tabular models, then fusing their outputs. But this siloed architecture is structurally unable to capture cross-modal semantics—humor that arises from the interplay of image and caption, or credibility that depends on the match between an influencer’s niche and the product they endorse.

    This report presents the technical implementation and experimental evaluation of the Prediction Optimization Agent, an agentic system that addresses this limitation through a single unifying mechanism: semantic translation. Rather than training an end-to-end multimodal model, the agent uses a multimodal LLM to convert each post—image, caption, and metadata together—into a structured natural-language description. A lightweight downstream model then predicts engagement from that text alone. Critically, the translation prompt is not static: an LLM-based optimizer iteratively rewrites it using quantitative prediction error as its signal, treating the prompt as a tunable hyperparameter that converges toward descriptions maximally predictive of real engagement outcomes.

    For a fuller discussion of the business motivation, use cases, and strategic implications of this approach, we refer the reader to the accompanying blog post. The remainder of this document focuses on the system architecture, cloud infrastructure, dataset preparation, experimental methodology, and quantitative results.

    Cloud Architecture & Technical Implementation

    This technical report details the specific technical implementation relies on a scalable cloud architecture:

    • Semantic Translation: A Multimodal LLM (e.g., Gemini, GPT, Claude) ingests the media. To process large volumes of posts efficiently during each optimization round, the pipeline utilizes Vertex AI Batch Predictions. The system prompt directs the model to extract specific features to output a rich, structured text document.
    • Engagement Predictor: A lightweight, text-only Language Model is trained on these generated descriptions. It outputs a performance probability score and validation metrics.
    • Self-Optimizer: An LLM agent analyzes the validation results, comparing the current prompt against error analysis data, and rewrites the System Prompt to be more effective.

    Summary of Workflow:

    1. Initialize: Start with a generic prompt (“Describe this ad”).
    2. Translate: Convert media to text using the current prompt.
    3. Train: Train the text classifier/regressor.
    4. Evaluate: Measure accuracy.
    5. Refine: The agent updates the prompt to extract better predictive features.
    6. Repeat: Loop until performance plateaus.

    Dataset

    Dataset overview

    We utilized the Instagram Influencer Dataset to extract text descriptions of posts and predict engagement metrics (such as the number of likes).

    • Type: Category classification and regression.
    • Description: This dataset contains 33,935 Instagram influencers categorized into nine domains: beauty, family, fashion, fitness, food, interior, pet, travel, and other. It features 300 posts per influencer, totaling roughly 10.18 million posts.
    • Structure: Post metadata is stored in JSON format (caption, user tags, hashtags, timestamp, sponsorship status, likes, comments). The image files are in JPEG format. Because a single post can contain multiple images, the dataset provides a JSON-to-Image mapping file to link metadata with its corresponding visual assets.

    Exploratory Data Analysis (EDA)

    To better understand the target variables for our Predictor engine, we conducted a rigorous EDA on the dataset, revealing several key structural behaviors:

    • Visualizing the Distribution of Likes: When visualizing the distribution of likes across the dataset, we observed a massive right-skew. The average (mean) post receives ~4,344 likes, but the median is only 662. Because of this severe, exponential variance, we cannot perform regression directly on the raw number of likes. Instead, the target variable must be transformed using log(likes + 1) to normalize the distribution, stabilize the variance, and ensure our regression model can learn effectively.
    • Likes vs. Followers Correlation: The scatter plot distributions show a strong positive correlation (0.7853) between an influencer’s follower count and the number of likes they receive.
    • **Engagement Rate Baseline:**We calculated the Engagement Rate (Likes / Followers * 100). The dataset shows a mean engagement rate of 4.23% and a median of 2.96%.

    Figure 1. Histogram of number of likes and log of number of likes.

    Results

    Data Preparation

    To ensure the integrity of our predictive modeling, we first applied filters based on our EDA. Roughly 13.8% of the dataset contained sponsored labels. We removed these sponsored posts entirely, as financial backing artificially skews organic engagement rates. We then narrowed our focus to create two high-density subsets: one featuring posts from the top 20 influencers, and a larger subset featuring the top 100 influencers (minimum 100 posts each). We opted to use the top 20 influences dataset in most of our experiments.

    To handle the computational load of the iterative prompt optimization, we built a scalable cloud pipeline. Raw images and post metadata were staged in GCP Buckets. Gemini 2.5 Flash was deployed as the Semantic Translator to generate the text profiles, capturing both general post context and specific image content. Because the agentic loop required regenerating descriptions for thousands of posts across multiple prompt iterations, we leveraged Google Batch Predictions. This allowed us to asynchronously and cost-effectively generate the text profiles for each optimization round. Finally, the poster’s profile description, bio, and category were appended to the end of each generated description to provide complete semantic context for the downstream classifier.

    Experiment 1: Comparative Baseline Analysis & Model Selection

    We initially framed the problem as a 3-class classification task (predicting low, average, and high likes) using a custom Deep Neural Network (three Linear layers with ReLU activation, and Cross-Entropy Loss). However, results showed that treating the problem as a regression task on the log(likes + 1) target, yielded significantly better, more granular predictive performance.

    For the regression task, we benchmarked three models:

    • XGBoost Regressor: (n_estimators=300, learning_rate=0.05, max_depth=6, subsample=0.8)
    • LightGBM Regressor: (n_estimators=300, learning_rate=0.05, max_depth=6, num_leaves=31)
    • Transformer Model: distilbert-base-uncased, fine-tuned end-to-end.

    For XGBoost and LightGBM, the text embeddings were calculated using the all-mpnet-base-v2 text embedding model.

    ModelR2MAERMSE
    XGBoost0.69080.41840.5759
    LightGBM0.57490.50840.6752
    distilbert-base-uncased0.79250.37750.4804

    Table 1: Results on the regression task, using History-Based Optimization

    The fine-tuned DistilBERT model substantially outperforms both tree-based baselines, achieving an R² of 0.7925—a 10-point improvement, using History-Based Optimization, over XGBoost and a 22-point improvement over LightGBM. This gap is expected: DistilBERT processes the raw text descriptions end-to-end and can learn task-specific token-level interactions, whereas XGBoost and LightGBM operate on pre-computed embedding vectors that compress away some of this nuance. Among the tree-based models, XGBoost’s clear advantage over LightGBM (R² 0.69 vs. 0.57) suggests that it better captures the nonlinear relationships present in the embedding feature space. Notably, even the XGBoost pipeline achieves a reasonably strong R² of 0.69, indicating that the semantic descriptions generated by the translator carry substantial predictive signal regardless of the downstream model.

    Given the trade-off between training cost and accuracy, we carried both XGBoost (as a fast, interpretable baseline) and DistilBERT (as the top performer) forward into the prompt optimization experiments, omitting LightGBM entirely.

    Experiment 2: Iterative Prompt Optimization Strategies

    Using the Top-20 influencer subset, we tested two distinct agentic prompt optimization approaches to see which method helped the LLM extract the most predictive features:

    1. History-Based Optimization: The agent was provided with the prompt history alongside the actual regression metrics (R2, MAE, RMSE) from previous iterations. The prompt instructed the LLM to deduce how to improve feature extraction based on these hard metrics.
    2. Google Few-Shot Prompt Optimizer: Utilizing Vertex AI’s Few-Shot Optimizer, the agent was provided with 20 “good” and 20 “bad” prediction examples from the prior iteration. The optimization rubric was defined strictly as: [“Acceptable prediction error”, “Absolute prediction error value”].

    Figure 2: R2 value across 20 prompt optimization rounds using the Few-Shot prompt optimization strategy.

    Figure 3: R2 value across 20 prompt optimization rounds using the History-Based optimization strategy.

    Prompt Optimization StrategyModelR2MAERMSE
    History-Based OptimizationXGBoost0.69080.41840.5759
    distilbert-base-uncased0.79250.37750.4804
    Google Few-Shot Prompt OptimizerXGBoost0.67630.46910.6113
    distilbert-base-uncased0.80680.34630.4544

    Table 2: Quantitative evaluation of prompt optimization strategies on the regression task

    The convergence plot for the Few-Shot Optimizer illustrates the strategy’s core limitation: without access to the full optimization history, the agent has no memory of what has already been tried. For example, the XGBoost R² oscillates erratically across the 20 rounds—rising above 0.68 in one iteration, then dropping back below 0.60 in the next—because the optimizer can only react to the most recent batch of good and bad examples rather than reason over long-term trends. In contrast, the History-Based strategy converges more steadily under the same XGBoost evaluation setup, as the agent can trace which specific prompt changes improved or degraded each error metric across all prior rounds and avoid regressing to previously failed formulations.

    Across both History-Based Optimization and the Google Few-Shot Prompt Optimizer, DistilBERT consistently outperforms XGBoost on all three evaluation metrics, achieving higher R² as well as lower MAE and RMSE. One plausible explanation is that DistilBERT benefits from end-to-end fine-tuning directly on the textual descriptions, allowing it to learn task-specific semantic patterns from the input. XGBoost, in contrast, depends on fixed upstream representations and therefore has less ability to adapt to nuance in wording, context, and structure. As a result, the Transformer-based approach appears better able to extract predictive signal from the generated descriptions than the tree-based regression pipeline.

    Temperature Benchmarks

    We tested different model temperatures using Gemini-2.5-Flash for text description generation (Semantic Translation) and Gemini-3.0-Pro for prompt optimization (Self-Optimization). The table below reflects temperature adjustments for the text generation phase, with XGBoost as the downstream regressor.

    TemperatureR2MAERMSE
    0.20.65770.43920.5746
    0.40.69080.41840.5759
    0.60.57650.51740.6933
    0.80.59110.50080.6622
    1.00.58520.52430.6825

    Table 3: Performance degrades noticeably at temperatures ≥ 0.6, suggesting lower/moderate temperature for description generation leads to more consistent, predictive descriptions.

    A temperature of 0.4 yields the strongest results, achieving the highest R² (0.6908) with competitive MAE. Performance degrades noticeably at 0.6 and above, with R² dropping by as much as 0.11 points. This is consistent with expectations: higher temperatures introduce hallucinated or loosely grounded details that add noise rather than predictive signal to the generated descriptions. When the downstream regressor encounters inconsistent or fabricated features across similar posts, its ability to learn stable patterns deteriorates. Conversely, the lowest temperature tested (0.2) underperforms 0.4, likely because overly deterministic outputs produce near-identical phrasing for visually similar but distinct posts, collapsing meaningful variation that the predictor could otherwise exploit. The sweet spot at 0.4 balances descriptive consistency with enough variation to differentiate posts along dimensions that matter for engagement. Based on these results, we fixed the Semantic Translation temperature at 0.4 for all subsequent experiments

    Embedding Model Benchmarking

    We additionally conducted an experiment to find the optimal text embedding model. We vectorized the generated descriptions (2,677 in total) using several popular embedding architectures and measured the downstream XGBoost regression performance.

    Embedding ModelDimensionR2 ScoreMAERMSE
    thenlper/gte-base7680.73720.40440.5308
    thenlper/gte-large10240.71650.41000.5513
    sentence-transformers/gtr-t5-large7680.71220.41790.5554
    all-mpnet-base-v27680.69080.41840.5759
    all-MiniLM-L12-v23840.62730.47750.6322
    all-MiniLM-L6-v23840.55940.49960.6874
    all-roberta-large-v110240.55200.51980.6931

    Table 4: Evaluation of different embedding models on the regression task, using XGBoost model.

    Three key findings emerge from the embedding benchmark. First, dimensionality alone is not predictive of downstream quality. The 1024-dimensional all-roberta-large-v1 ranks last (R² 0.5520), while the 768-dimensional gte-base leads the table (R² 0.7372), demonstrating that the training objective and data composition of the embedding model matter far more than raw vector size for this text domain.

    Second, a clear performance tier structure is visible. The GTE family and gtr-t5-large form a top tier (R² > 0.71), while the MiniLM variants and RoBERTa fall noticeably behind (R² < 0.63). The top-tier models share a common trait: they were trained with contrastive objectives on diverse, semantically rich corpora, which aligns well with the structured but descriptive text our Semantic Translator produces.

    Third, the 384-dimensional MiniLM models, while attractive for latency-sensitive deployments, lose a substantial amount of signal—an R² drop of 0.11 to 0.18 compared to gte-base. Their smaller embedding dimensions and shallower architectures lack the capacity to encode the dense, multi-attribute descriptions our Semantic Translator produces, where a single paragraph may simultaneously capture visual composition, emotional tone, brand cohesion, and influencer credibility.

    Conclusion

    This project demonstrates a highly effective, interpretable alternative to traditional opaque multimodal models for predicting media performance. By leveraging Large Language Models as universal feature extractors (the Performance Optimization Agent), we successfully unified heterogeneous data inputs into a single, human-readable semantic modality.

    Several key insights emerged from our experimental pipeline:

    • Target Transformation is Crucial: Predicting raw engagement metrics directly might inherently lead to faulty predictions due to label skewness. Transforming the target variable to log(likes + 1) and framing the problem as a continuous regression task yielded superior and more granular results compared to our baseline classification approach.
    • Prompt Optimization Strategy: In our agentic optimization loop, History-Based Optimization was more suitable for the task at hand . Explicitly feeding the LLM agent hard quantitative error metrics (R2, MAE, RMSE) from previous iterations allowed it to reason more effectively about feature importance. It successfully “learned” to rewrite prompts that extracted visual and semantic elements highly correlated with user engagement.
    • Embedding Efficiency Over Size: Our benchmarking revealed that bigger isn’t always better. The thenlper/gte-base model (768 dimensions) achieved the highest predictive performance (R2: 0.7372), outperforming significantly heavier models like gte-large and all-roberta-large-v1. This highlights that for this specific text space, highly optimized, mid-sized embeddings offer the best linear separability for tree-based regressors like XGBoost.

    Ultimately, this agentic feedback loop proves that natural language prompts can be treated as tunable hyperparameters. This architecture not only predicts media success with strong accuracy but, more importantly, provides the crucial “why” behind the prediction—giving marketers and engineers the transparency that traditional approaches lack.