Author: MIchal Koziki

Data Quality Assurance Agent Blog Post

Meet Your New Data Guardian

1. The High Cost of “Dirty” Data

In the modern enterprise, data is the raw ingredient behind every strategic decision. Think of it like a premier restaurant: the Data Engineer is the sous-chef, meticulously sourcing and preparing ingredients, while the Data Scientist is the executive chef, transforming them into the predictive models and insights that drive the business forward. If the ingredients are spoiled or mislabelled, the final dish fails, no matter how talented the chef.

Across several of our AI initiatives at WPP, we uncovered a pattern that was quietly draining velocity from our most ambitious projects. Our “sous-chefs”, skilled data engineers responsible for pipeline integrity, were spending up to one full day per week on tedious, largely manual Quality Assurance (QA) of data flowing into BigQuery. Row by row, column by column, they checked for missing values, logical contradictions, and phantom duplicates, work that was essential but deeply repetitive.

This wasn’t just an inconvenience. It was a strategic bottleneck: it slowed the delivery of every downstream AI application, consumed senior engineering talent on janitorial tasks, and most dangerously created risk. When a human eye is the only safeguard between raw data and a production model, errors don’t just slip through occasionally. They slip through systematically, at exactly the moments when the data is most complex and the engineer is most fatigued.

We asked ourselves a different question: What if, instead of building another dashboard or writing another validation script, we built an intelligent agent, one that could reason about data quality the way an experienced engineer does, learn from every audit it performs, and get better over time?

This article describes how we built that agent, what makes it fundamentally different from traditional automation, and what happened when we put it to the test.

2. The Problem: Why Data Quality Demands More Than Scripts

The Data & The Modeling Ecosystem

The agent operates on digital marketing campaign performance data hosted in BigQuery, massive tables that track how advertising campaigns perform on a daily basis across major ad networks like Meta (Facebook and Instagram). Each row represents a highly granular intersection of a specific campaign, audience segment, platform, device, and creative asset. This data captures everything from broad identifiers—like the parent brand and geographical targeting—down to precise performance metrics, including impressions, clicks, daily spend, conversions, leads, and app installs.

This foundational data is the lifeblood of two critical machine learning systems:

The Prediction Model: A classification system designed to predict whether a planned campaign will yield a negative, neutral, or positive outcome.
The Recommendation System: A highly flexible advisory engine capable of handling any combination of “missing modalities.” For example, if a media planner inputs a specific Brand, Target Audience, and Location, the system dynamically recommends the optimal missing parameters, such as the best platform to use and the most effective creative asset to deploy.

For more background on the broader modeling context, see From Guesswork to Glimpse: How AI is Predicting the Future of Marketing Campaigns.

The Silent Threat of Data Corruption

Because these models directly inform real-world media spend and strategic campaign planning, their accuracy is paramount. The underlying data is regularly refreshed directly from the advertising platforms to keep the models up to date. However, this automated refresh process frequently introduces subtle corruption and systemic inconsistencies.

For instance, while metrics like engagement and clicks generally remain stable, downstream pipeline issues frequently render conversions and awareness metrics unreliable (“not high quality”). At the individual row level, these anomalies are often entirely invisible. But at scale, they are devastating. If left unchecked, these untrustworthy data points bleed into the training sets, silently degrading the prediction model’s accuracy and causing the recommendation engine to suggest sub-optimal, expensive campaign configurations. This makes rigorous, automated data quality validation not just a nice-to-have, but an absolute necessity for the ecosystem to function.

The Failure Modes

The scale and velocity of data flowing into BigQuery mean that errors don’t announce themselves. They hide. Through our manual QA process, we catalogued six recurring failure modes, each one capable of silently degrading every model built on top of the data:

Failure Mode	What Happens	Why It Matters
Missing Values	Fields arrive empty — sometimes 5% of a column, sometimes 40%	Models trained on incomplete data learn incomplete patterns. Forecasts drift silently.
Outliers	A metric reads 200,000 clicks when the true value is 500	A single extreme value can skew an entire model’s calibration, distorting spend recommendations.
Duplicate Rows	Identical records appear multiple times	Inflated counts cascade into inflated budgets. Campaigns appear to outperform reality.
Categorical Corruption	A brand name like `"Nike"` is replaced with `"zX9pQ"`	Segmentation breaks. Reports attribute performance to entities that don’t exist.
Logical Inconsistencies	More clicks than impressions. Spend recorded against zero impressions.	These are the most insidious — each value looks valid in isolation, but the relationships between them violate business reality.
Missing Columns	An entire field disappears from a refresh	Downstream pipelines fail or, worse, silently fall back to defaults.

A static validation script can catch some of these — the easy ones, the ones you’ve already seen. But scripts are brittle: they encode yesterday’s assumptions and break on tomorrow’s edge case. They cannot reason about why a pattern looks wrong, weigh it against historical context, or decide whether a recurring anomaly is a genuine error or a known artifact of a data source.

That requires judgment. And judgment is what we built the agent to provide.

3. Our Approach: An Agent That Reasons, Remembers, and Improves

We designed the Data Quality Assurance Agent — not as a script, not as a dashboard, but as a reasoning entity capable of planning an audit strategy, querying data, forming hypotheses about its health, testing those hypotheses, and learning from the results. The distinction matters. A script checks what you tell it to check. An agent decides what to check, based on what it knows and it has the tools to act on that decision end-to-end.

Architecture: One Agent, Specialized Tools

The agent is powered by a single reasoning core that plans, decides, and acts. What gives it breadth is its toolkit, a set of specialized capabilities it can invoke as needed, selecting the right tool for each step of the audit:

Database Tool: enables the agent to query BigQuery directly, fetching schemas, row counts, column statistics, and raw data samples.
Auditing Tool: the agent’s analytical engine. It formulates hypotheses about potential quality issues, runs targeted checks, and compiles structured findings. This tool reads from and writes to the Memory Bank.
Analytics Tool: generates visualizations using Python — charts, distributions, and plots that make audit findings immediately legible to stakeholders.
Artifact Tool: packages the final audit report, charts, and evidence into downloadable artifacts stored in Google Cloud.

The agent orchestrates these tools autonomously. When a user asks it to audit a table, the agent formulates a plan, queries the data, runs its checks, generates visualizations where useful, and compiles a structured report, all without the user needing to specify which tool to use or in what order.

The Key Innovation: Long-Term Memory

Most AI tools are stateless. When the session ends, everything the system learned disappears. The next audit starts from zero. This is the fundamental limitation we set out to break. The agent maintains a persistent Memory Bank, a long-term knowledge store that survives across sessions and accumulates institutional intelligence over time. This memory captures three categories of knowledge:

Historical Explanations When a data engineer confirms that a recurring anomaly is caused by a known tracking limitation or data source quirk, the agent records that explanation. The next time it encounters the same pattern, it doesn’t waste time flagging it as a new issue, it references the known cause, notes it in the report, and moves on to genuinely novel problems.
Business Context Over successive audits, the agent absorbs the specific rhythms and patterns of our marketing data, seasonal spikes, platform-specific reporting delays, expected variance ranges for different campaign types. This contextual awareness allows it to distinguish between a real anomaly and normal business variation.
Evolutionary Learning With every audit, the agent’s knowledge base deepens. Instead of repeating the same blind checks, it refines its hypotheses based on what it has seen before — which columns tend to have issues, which tables are most prone to duplication, which logical inconsistencies recur. The agent doesn’t just run. It compounds.

This is what separates an agent from a script. A script executes the same logic every time, regardless of history. The agent carries forward everything it has learned and every audit it performs makes the next one sharper.

The Tech Stack

To ensure the agent was enterprise-grade, we built on the full Google Cloud AI ecosystem:

Component	Role
Vertex AI Agent Engine	Manages the agent’s long-term specific memory persistence, and saving of the chat sessions
BigQuery	The single source of truth — the agent performs direct, in-place auditing against production tables
Agent Development Kit (ADK)	The framework used to define the agent’s tools, constraints, and interaction boundaries
Google Cloud Storage	Persistent storage for audit trails, PDF reports, and visual evidence
Cloud Runs	Used to deploy the A2A Agent API, and the ADK Web UI for demo purposes
A2A	The protocol to expose our Agent as a headless API

4. Proving It Works: Synthetic Error Injection

We didn’t hope the agent worked. We proved it using a controlled methodology we call Synthetic Error Injection. The premise is straightforward: take a perfectly clean dataset, intentionally corrupt it in specific, measurable ways, and then challenge the agent to find every error we planted. If the agent can detect artificially injected errors, whose exact type, location, and severity we control, we can be confident it will handle real-world data corruption, which is typically far less extreme.

Step 1: Preparing the Test Data

Before injecting errors, we prepare the data for safe, controlled experimentation:

Anonymization — Real brand and advertiser names are replaced with generic identifiers ("Brand 1", "Company A"). Sensitive business information never enters the test environment.
Corruption— The dataset then receives a different severity level of corruption. This allows us to map the agent’s detection accuracy as a function of error density, from subtle (5%) to extreme (40%).

Step 2: Injecting Controlled Errors

Using purpose-built scripts, we introduce precisely calibrated corruptions into a table, 4 types of Structural and 7 types of Logical errors:

Category	Error	Description
Structural	Missing Values (Nulls)	X% of cells set to NULL
Structural	Duplicate Rows	X% exact row copies
Structural	Dropped Columns	X% of columns removed
Structural	Categorical Errors	Random alphanumeric strings in category fields
Logical	Clicks > Impressions	Can’t click what wasn’t shown
Logical	Conversions > Clicks	Can’t convert without clicking
Logical	Spend with 0 Impressions	Paying for zero ad delivery
Logical	Video Completions > Plays	Can’t finish without starting
Logical	Purchases without Add-to-Cart	Funnel step skipped
Logical	Landing Page Views > Clicks	More landings than clicks
Logical	Negative Metric Values	Performance metrics can’t be negative

Step 3: Synthetic Ground Truth Dataset

We keep track of the errors we introduce in a table and produce a ground truth dataset that looks like:

Table_name	number_of_injected_logical_errors	type_of_logical_error	number_of_injected_structural_errors	type_of_structural_error
table_01	0	–	1	categorical errors
table_02	0	–	1	dropped columns
table_03	1	clicks_exceed_impressions	0	–
table_04	1	spend_with_zero_impressions	0	–

5. Evaluation Pipeline, Experiments and Results

To evaluate our Agent we follow the pipeline below:

The 4 Experiments and Results

Each experiment isolates a single variable to understand what affects the auditor agent’s detection quality.

Experiment 1: Prompt Comparison

Question: Does giving the agent a more detailed prompt improve error detection?

Runs the agent 3 times on the same table, each time with a different user query style:

Prompt Level	What the user asks	Agent’s system instruction
Simple	“Check if there are any errors for table X”	Stays constant (forensic mode)
Medium	“Perform a structured assessment checking physical integrity, numerical sanity, categorical validity”	Stays constant
Complex	“Conduct a forensic audit checking for 11 specific error types with detailed cross-column logical checks”	Stays constant

Key insight from results: Only the complex prompt successfully detected the injected spend_with_zero_impressions error (139 rows, 1.82%), while both the simple and medium prompts missed it entirely — confirming that more detailed, forensic-style instructions are critical for the agent to test nuanced logical relationships rather than just surface-level checks.

Experiment 2: Table Sweep

Question: How well does the agent detect different types of errors?

Experiment 2 stress-tests the Auditor agent (using the complex prompt) on 11 synthetic BigQuery tables with progressively stacked error combinations, ranging from a single logical violation to the full set of 7 logical plus 4 structural error types (11 total). The agent achieved perfect detection on 8 of 11 tables (72.7% with F1 = 1.0) and delivered strong overall performance (precision 0.905, recall 0.864, F1 0.883).

Experiment 3: Consistency Check

Question: If we run the exact same audit N times, do we get the same results?

Runs the identical configuration multiple times (default: 3).

The Consistency Experiment demonstrated that the auditor agent achieves perfect 100% reproducibility, consistently detecting both injected errors and real data quality issues with the exact same metrics and explanations across all three runs. This deterministic performance confirms that the agent’s complex prompt configuration is highly reliable effectively eliminating the need for redundant audits.

Experiment 4: Memory Ablation

Question: Does loading past audit memories help the agent perform better?

Runs the agent twice on the same table with the complex prompt:

Condition	What happens
No Memory	Cold start — agent has no context from prior audits
With Memory	Agent loads past findings from Vertex AI memory before auditing

Key insight from results: Memory doesn’t unlock new capabilities when tables contain simple structural errors or when the user prompt leads to a very detailed auditing but acts as a critical consistency enforcer, stabilizing the audit scope to guarantee a 100% success rate. However, in cases that there are many different logical errors in a table and the prompt is simple, when past memories are loaded into the agent’s present context performance becomes optimal.

6. The Agent in Action

Below is an actual interaction with the deployed agent, demonstrating its reasoning in practice. The agent discovers available tables, loads its accumulated memory from prior audits, and then, guided by that memory, conducts a targeted audit.

Discovery

Human: “What tables do you have access to?”

Agent: “I have access to the following tables in the dataset:

campaign_performance_table_1

campaign_performance_table_2

campaign_performance_table_3

campaign_performance_table_4

campaign_performance_table_5

Memory Recall

Human: “Load your memories.”

Agent: “I have loaded the following memories:

A general instruction to always check for null values in tables.

A specific data quality issue in campaign_performance_table_2 where 188 rows (18.8%) were duplicates based on event_date, campaign_id, and ad_creative_id.

A preference for reports to be formatted in Markdown and include an executive summary.”

The agent doesn’t start from scratch. It arrives with institutional knowledge — prior findings, user preferences, and standing instructions — all retrieved from its persistent Memory Bank.

Hypothesis-Driven Audit

Human: “Based on these memories, audit campaign_performance_table_3.”

The agent formulates a plan, queries the table using its database tool, and — informed by its memories — produces a structured audit report:

Executive Summary: This audit revealed a consistent pattern of missing values across key columns, affecting exactly 5% of total rows. Specifically, event_date, campaign_id, impressions, clicks, and conversions each had 50 null entries out of 1,000 rows. No exact duplicate rows were found. This suggests a systemic issue with data completeness rather than random omissions.

The agent then provides detailed findings per column — quantifying impact, assessing severity, and recommending next steps — before offering to proceed with outlier detection, categorical validation, and logical consistency checks.

Persistent Learning

Human: “Save the results for this table.”

Agent: “The audit results have been saved to memory.”

The next time the agent audits any table, it will carry forward what it learned here — the 5% missing-value pattern, the columns affected, the systemic nature of the issue. Every audit makes the next one faster, more targeted, and more informed.

7. Impact and What Comes Next

Operational Impact

The deployment of the Data Quality Assurance Agent has delivered measurable results:

Reclaimed engineering capacity. By automating the bulk of manual QA, data engineers have shifted from repetitive inspection work to higher-value activities.
Earlier error detection. Quality issues are now caught at the point of ingestion, before they propagate into downstream models. This “shift-left” approach reduces the blast radius of bad data from hours to minutes.
Higher model reliability. Marketing agents, analytics pipelines, and machine learning models now operate on data that has been systematically validated, reducing the risk of predictions and recommendations built on flawed foundations.

The Bigger Picture

This agent is more than a tool. It is a blueprint for autonomous data governance, a pattern that can be replicated across any data pipeline where quality, scale, and velocity collide.

We are currently extending the agent along three axes:

Cross-table auditing: enabling the agent to detect inconsistencies across related datasets, not just within a single table. Many of the most damaging data quality issues manifest as contradictions between tables that individually look clean.
Event-driven execution: triggering the agent automatically whenever a BigQuery table is updated, transforming data quality monitoring from a scheduled chore into a continuous, always-on safeguard.
Adversarial stress-testing: today, our synthetic error injection is script-based and manually configured. We are building a dedicated adversarial agent whose sole purpose is to generate increasingly complex, realistic data corruptions, subtle logical contradictions, plausible-looking outliers, correlated missing-value patterns, specifically designed to challenge the QA agent’s detection capabilities. By putting one agent against the other in a continuous red-team / blue-team loop, both improve: the adversarial agent learns to craft harder-to-detect errors, and the QA agent learns to catch them, driving each other toward sharper, more robust performance over time.

Together, these extensions move us toward a future where data quality monitoring is not a task that consumes an engineer’s day. It is a capability the agent handles continuously and intelligently, surfacing only the issues that require human judgment and decision-making.

March 24, 2026

LLM Finetuning Pod

Beyond the Hype: Can Fine-Tuned LLMs Beat Custom Classifiers in Marketing Campaign Prediction?

Our custom-built machine learning models have served us well, becoming the reliable champions for our classification needs. But a new challenger has emerged. The rapid advance of Large Language Models (LLMs) raises a compelling question that we couldn’t ignore:

Can the vast, pre-trained knowledge of an LLM actually beat a specialized model on its own turf?

To find a definitive answer, we launched an in-house deep dive with three core objectives:

1. The Testbed: A Realistic Business Problem. First, we needed a meaningful challenge. We chose a marketing campaign classification task, using our own synthetically generated datasets to predict success as Underperforming, Average Performing, or Overperforming.

2. The Core Question: A Head-to-Head Showdown. Our primary goal was to definitively determine if a fine-tuned LLM could match or exceed the performance of our best internal classifiers on this specific task.

3. The Strategic Goal: Building In-House Capability. Beyond this single experiment, we aimed to develop our team’s practical expertise in fine-tuning and deploying LLMs by systematically exploring different workflows and their trade-offs.

This blog post details that journey. We will walk you through our methodology, from our multi-pronged fine-tuning strategy to a final showdown between our internal champion and the chosen LLM challengers: Gemini, Gemma, and Qwen.

The Quest: Our Three-Pronged Experimental Strategy

A core objective was to develop a versatile, in-house capability for LLM fine-tuning. To achieve this, we deliberately explored three distinct workflows, each representing a different point on the spectrum of control and complexity:

Workflow	Model	Approach	Key Characteristic
1. The Managed Expressway	Gemini 2.5 Flash-Lite	Fully Managed (Vertex AI)	Low-overhead and serverless, but a “black box” with limited control.
2. The Custom Highway	Gemma3-1b-it	Custom Cloud Training (Vertex AI)	A middle ground offering more control with the scalability of the cloud.
3. The Hands-On Workshop	Qwen3-4b-base	Local Training (Local GPU)	Maximum control and deep architectural customization.

This multi-faceted approach enabled us to build robust internal knowledge for future LLM projects while systematically testing our core hypothesis.

The Proving Ground: Our Datasets

To ensure a controlled and repeatable experiment, our analysis was performed on a suite of synthetically generated datasets, all created by our in-house ground-truth graph generator, a custom-built system designed specifically to simulate realistic marketing and media performance dynamics. By generating data with known underlying relationships, this system allows us to test AI models in a fully transparent, controlled environment while preserving the complexity of real-world campaign dynamics.

Each campaign in our dataset is described by five text-based features, designed to mimic a real-world marketing brief:

Audience: A qualitative description of the target consumer segment, including their values and interests.
Brand: The market positioning, values, and perceived credibility of the brand.
Creative: The tone, messaging style, and persuasive approach of the ad content.
Platform: The advertising platform where the campaign runs (e.g., Amazon, TikTok), which implies user intent and context.
Geography: The geographic region being targeted as a series of zip-codes, providing demographic and market signals.

With this setup, the model’s task was to predict campaign success. All dataset versions contain 42,000 training examples but were designed with two distinct class distributions to test the models under different conditions:

Highly Imbalanced Sets (V14, V15): These datasets present a significant class imbalance challenge where the Average Performing class is overwhelmingly dominant. This setup was specifically designed to see if the models would fall into the “average trap.”
Balanced Sets (V25, V26): In contrast, these datasets are far more balanced, with the three classes distributed much more evenly. These test the model’s ability to distinguish between classes when no single class is an easy default.

The chart below provides a clear visual breakdown of this intentional variation, illustrating the stark difference in class distribution across our four primary training datasets:

Synthetic Dataset Class Distributions’ Summary

Finally, a crucial component of our methodology was the strict separation of respective test datasets containing approximately 18,000 entries each. Unlike the training sets, the true labels for this test data were never exposed to our team. Instead, evaluation was conducted exclusively through a blind API service. This protocol was essential to prevent any form of data leakage or unconscious bias from influencing our model development and hyperparameter tuning, ensuring a truly honest assessment of each model’s performance.

Round 1: The Managed Approach with Gemini

For the first leg of our journey, we turned to Google’s fully managed, serverless pipeline on Vertex AI to fine-tune the Gemini model. This approach represented the “path of least resistance,” allowing us to establish a performance baseline with a state-of-the-art model using a streamlined, integrated platform with minimal operational overhead.

The Setup: LoRA, a “Black Box” with Limited Levers

It’s critical to understand that the supervised fine-tuning for Gemini models on Vertex AI uses a Parameter-Efficient Fine-Tuning (PEFT) method called LoRA, which stands for Low-Rank Adaptation.

Instead of retraining all the billions of parameters in the model (a process known as full fine-tuning), LoRA freezes the vast majority of the base model’s weights.

It then injects small, trainable modules called “adapters” into the model’s architecture.

The fine-tuning process only updates the parameters within these lightweight adapters, which can reduce the number of trainable parameters by up to 10,000 times.

This LoRA-based methodology has several profound implications:

Efficiency and Cost: It dramatically reduces the computational cost, memory requirements, and time needed for fine-tuning.
Portability: The output of the process is not a new multi-billion parameter model, but a small adapter file (typically just a few megabytes) that is “placed on top” of the base model during inference.
Preservation of Knowledge: Because the core model is frozen, this method prevents “catastrophic forgetting,” where a model can lose its powerful, general-purpose knowledge during specialization.

However, for our experiment, the most important implication is that it limits the scope of customization. We are not changing the core model; we are only training the adapters. The primary levers available in the Vertex AI service are dials to control this adapter-based training:

Training Steps: The total number of steps the model trains for (which we map from epochs).
Learning Rate Multiplier: A multiplier applied to a recommended, pre-tested learning rate for the adapter.
Adapter Size (Rank): This directly controls the capacity (the number of trainable parameters) of the LoRA matrices. A larger rank gives the adapter more power to learn the new task but also increases its size. Common options are powers of 2 (e.g., 4, 8, 16).

Experiments and Results: Turning the Dials on the Black Box

We ran a series of experiments across our datasets to see if we could find a combination of these limited hyperparameters that would yield competitive performance. The table below summarizes our key runs and their results.

Table 1: Gemini Flash Lite Performance Across All Datasets

Dataset	Model	Overall F1	Negative F1	Average F1	Positive F1	Epochs	LR Multiplier	Adapter Size
V15	Internal (Best)	0.9280	0.9215	0.9807	0.8818	N/A	N/A	N/A
V15	Gemini (Baseline)	0.2693	0.1249	0.5350	0.1480	0	N/A	N/A
V15	Gemini (Fine-Tuned)	0.3381	0.0998	0.8333	0.0812	5	1	4
V25	Internal (Best)	0.8947	0.9125	0.8595	0.9122	N/A	N/A	N/A
V25	Gemini (Fine-Tuned)	0.3329	0.2928	0.3974	0.3086	10	2	16
V26	Internal (Best)	0.8884	0.9113	0.8526	0.9014	N/A	N/A	N/A
V26	Gemini (Fine-Tuned)	0.3335	0.2912	0.4019	0.3074	10	2	16

Dissecting the Performance: A Look at Controlled Experiments

To understand if this performance ceiling could be breached, we conducted several controlled experiments to isolate the impact of different variables.

Model Scale

We first questioned if a larger model would perform better. Table 2 compares gemini-2.5-flash-lite with the slightly larger gemini-2.5-flash, keeping all other hyperparameters identical.

Table 2: Gemini Model Comparison (Flash Lite vs. Flash)

Model Name	Overall F1	Negative F1	Average F1	Positive F1	Epochs	LR Multiplier	Adapter Size	Sys Prompt
`gemini-2.5-flash-lite`	0.3254	0.0728	0.8037	0.0996	5	1	Four	2
`gemini-2.5-flash`	0.3381	0.0998	0.8333	0.0812	5	1	Four	2

The larger gemini-2.5-flash model provided a marginal improvement in the Overall F1-score. This suggests that while scale can help slightly, it does not fundamentally solve the architectural challenges of this workflow.

Hyperparameter Tuning

Next, we analyzed the impact of training duration and learning rate. Table 3 compares our two primary training configurations.

Table 3: Gemini Flash Lite Epoch and LRM Comparison

Epochs	LR Multiplier	Dataset	Overall F1	Negative F1	Average F1	Positive F1
5	1	V15	0.3293	0.0842	0.8068	0.0969
10	2	V25	0.3329	0.2928	0.3974	0.3086
10	2	V26	0.3335	0.2912	0.4019	0.3074

Here, we see a crucial finding. The runs with 10 Epochs and a Learning Rate Multiplier of 2 (on V25 and V26) yielded significantly better F1-scores on the minority “Negative” and “Positive” classes compared to the 5-epoch run. This indicates that more extensive training helped the model learn more about the rare classes, but it still wasn’t enough to lift the overall performance to a competitive level.

Our experiments with Adapter Size and System Prompts showed that these parameters had a negligible impact on performance. Whether using an adapter size of “Four” or “Sixteen,” or using prompts with text versus integer labels, the results remained consistently low, confirming that the bottleneck was not these specific settings.

System Prompt Strategy

Next we needed to control for was the prompt itself. Was the model’s poor performance simply due to a misunderstanding of the task’s objective? To isolate this, we designed and tested distinct prompt strategies that framed the classification problem in fundamentally different ways:

Persona-Based Reasoning (System Prompt 1): This prompt instructed the model to assume the role of a senior marketing expert and use its vast pre-trained knowledge of consumer behavior and brand strategy to make a qualitative judgment.
Rule-Based Classification (System Prompts 2 & 3): These prompts explicitly described the quantitative, rule-based logic used to generate our synthetic data. They instructed the model to ignore qualitative interpretation and instead focus on tallying the positive and negative relationships between features to arrive at a classification, outputting either a text label (Prompt 2) or an integer label (Prompt 3).

The results of this test were immediate and revealing.

SYSTEM_PROMPT = {
    # System prompt 1: Assuming the role of a marketing professional and relying on the models' historical data
    1: """
### Role
You are a senior marketing campaign manager with deep experience designing and evaluating advertising campaigns across digital platforms.

You understand:
- Brand positioning, perceived value, and product–market fit
- Consumer behavior, lifestyle signals, and purchasing psychology
- Audience–channel alignment and creative effectiveness
- How targeting, creative tone, and platform context influence campaign outcomes

### Task
Your task is to accurately predict the expected performance of an advertising campaign based on the provided features. Using your professional judgment, classify the campaign’s expected performance into exactly ONE of the following categories.

### Input Field Definitions
- **Audience**: A qualitative description of the target consumer segment, including lifestyle habits, behavioral tendencies, interests, and personal values. This input should be interpreted as an indicator of audience–product and audience–platform fit.
- **Brand**: A description of the brand’s positioning, values, credibility, and market perception. This input reflects brand strength, authority, and alignment with the intended audience and campaign objectives.
- **Creative description**: The tone, messaging style, emotional intensity, and persuasive approach of the campaign’s creative content. This input should be evaluated for clarity, relevance, resonance with the audience, and consistency with the brand.
- **Geolocation**: A list of geographic targeting identifiers (e.g., ZIP codes) indicating where the campaign will be shown. This input provides contextual signals such as market characteristics, demographic tendencies, and historical regional performance.
- **Platform**: The advertising or commerce platform where the campaign will run (e.g., Amazon). This input represents platform-specific user intent, audience behavior, format constraints, and historical performance patterns.

### Output Label Definitions
- **Overperforming**: Based on historical data, more than 70% of campaigns with similar input features achieved a positive return on investment (ROI).
- **Average performance**: Based on historical data, between 40% and 70% of campaigns with similar input features achieved a positive return on investment (ROI).
- **Underperforming**: Based on historical data, fewer than 40% of campaigns with similar input features achieved a positive return on investment (ROI).

### Instructions & Output Format
Your output must be a single string, one of:
- Overperforming
- Average performance
- Underperforming

Do not include explanations, reasoning, or any additional text.

### Campaign to Classify
Audience: {{audience}}
Brand: {{brand}}
Creative description: {{creative}}
Platform: {{platform}}
Geography: {{geography}}

### Classification Output
Label: {{label}}
""",
    # System prompt 2: Reverse engineering the methodology used to assign labels via the graph generator
    2: """
### Role
You are a classifier tasked to annotate marketing campaigns of a given setup as Overperforming, Underperforming or Average performance.

### Task
You are provided with examples of synthetic data examples that were generated based on pairwise relationships (positive, negative or neutral) between the different attributes of the campaign settings. If an example has many positive pairwise relationships it is considered overperforming and if it has many negative relationships it is considered underperforming. All attributes are considered of equal importance so it is only the amount of pairwise relationships that determines the overall campaign performance, not their quality.

### Instructions & Output Format
Your output must be a single string, one of:
- Overperforming
- Average performance
- Underperforming

Do not include explanations, reasoning, or any additional text.

### Campaign to Classify
Audience: {{audience}}
Brand: {{brand}}
Creative description: {{creative}}
Platform: {{platform}}
Geography: {{geography}}

### Classification Output
Label: {{label}}
""",
    # System prompt 3: Reverse engineering the methodology used to assign labels via the graph generator using integer output labels
    3: """
### Role
You are a classifier tasked to annotate marketing campaigns of a given setup as Overperforming, Underperforming or Average performance.

### Task
Classify the campaign by tallying the pairwise relationships between all attributes, treating every attribute as mathematically equal in importance. Calculate the net balance of these interactions: a majority of positive relationships defines the campaign as 2 (overperforming), while a majority of negative relationships defines it as 0 (underperforming). If the relationships are predominantly neutral or the positive and negative counts are equal, classify it as 1 (average performance). Your decision must be based strictly on the quantity of relationship types, ignoring the qualitative nature of the descriptions or the perceived importance of any single setting.

### Instructions & Output Format
Your output must be a single integer, one of:
- 0 (for Underperforming)
- 1 (for Average performance)
- 2 (for Overperforming)

Do not include explanations, reasoning, or any additional text.

### Campaign to Classify
Audience: {{audience}}
Brand: {{brand}}
Creative description: {{creative}}
Platform: {{platform}}
Geography: {{geography}}

### Classification Output
Label: {{integer label}}
""",
}

The persona-based approach (System Prompt 1) yielded extremely poor results and was quickly discarded from formal evaluation. Our hypothesis is that instructing the model to rely on its general, pre-trained “marketing knowledge” created a direct conflict with the specific, mathematical classification objective of our synthetic dataset. The model was trying to solve the wrong problem.

Table 4: Gemini System Prompt Comparison

Sys Prompt	Overall F1	Negative F1	Average F1	Positive F1	Epochs	LR Multiplier	Adapter Size
2	0.3381	0.0998	0.8333	0.0812	5	1	Four
3	0.3323	0.0897	0.8247	0.0824	5	1	Four

As shown in Table 4, the two rule-based prompts performed almost identically, confirming that the model understood the task equally well with either a text or integer output format. This confirmed that the prompt strategy was not the bottleneck; the performance ceiling was due to more fundamental architectural limitations of the managed workflow.

Analysis: Hamstrung by the LoRA and Generative Combination

The results table paints a clear and consistent picture across all datasets. Despite our experimentation with hyperparameters—increasing epochs, raising the learning rate multiplier, and using a larger adapter size and even changing the model to the pro and lite versions—we saw no significant breakthrough. Here are the key findings and the perceived reasons behind them.

1. Internal Models Reign Supreme

The most immediate takeaway is the stark performance gap. On our high-performing datasets (V15, V25, and V26), our internal models achieve outstanding Overall F1-scores approaching or exceeding 0.90. The fine-tuned Gemini model, in contrast, struggles to score above 0.34. This highlights the enduring power of a specialized classifier that has been purpose-built and refined for a specific task.

2. The LLM Falls into the “Average” Trap

The class-level F1-scores reveal the core issue. Across all experiments, the Gemini model shows a strong bias toward the majority “Average” class. On dataset V15, for example, it achieves a respectable F1-score of around 0.83 for “Average”, but performs extremely poorly on the minority classes, with scores falling below 0.10.

This behaviour is a classic symptom of learning on imbalanced data. Because the model is trained to generate a text label, it quickly learns that predicting the most common class is the safest way to minimize its training loss. Within the constraints of LoRA-based fine-tuning, the model tends to rely on this shortcut rather than learning the more subtle patterns that distinguish the minority classes.

It’s possible that training for many more epochs could gradually improve these results. However, doing so would significantly increase training time and cost, and given the large performance gap compared to our internal models, it was not a trade-off we considered worthwhile.

3. Specialization Wins at the Edges

In stark contrast, our internal models demonstrate robust and balanced performance. They post high F1-scores not only for the majority class but also for the difficult “Negative” and “Positive” classes. This shows their ability to learn the nuanced features that distinguish these minority groups, a capability that is essential for providing real business value.

4. Architecture, Not Just Parameters, Is Key

The tweaks to LoRA hyperparameters were not enough to bridge the fundamental performance gap. The issue is architectural. The combination of a purely generative training objective (producing a text label) and the inherent limitations of a “black box” PEFT service proved to be an insurmountable obstacle. The approach, while flexible, is less efficient for this highly specific classification task than a traditional discriminative model explicitly designed to optimize a classification objective (like cross-entropy loss).

Conclusion for the Gemini Workflow

Our experiment with the fully managed Gemini pipeline was an invaluable exercise. It provided an accelerated path to establish a performance baseline and was instrumental in building our team’s initial LLM capabilities. However, for our specific, highly imbalanced classification task, it is not a viable replacement for our existing, specialized models.

The key lesson is that while LLMs are incredibly powerful, they are not a universal solution. The managed, “black box” nature of the Gemini fine-tuning workflow, which treats classification as a text-generation task, proved to be a fundamental architectural mismatch for our needs. This approach prevented the model from effectively learning the nuanced features of the minority classes—a problem that could not be solved by simply adjusting standard hyperparameters like learning rate, model size, or prompt strategy.

Round 2: The Custom Highway with Gemma

After hitting a performance ceiling with the managed Gemini pipeline, it was clear we needed more control. The “black box” nature of the service, combined with its generative training objective, was preventing us from tackling the core issue of class imbalance.

For our second round, we moved to the “Custom Highway”: fine-tuning Gemma, Google’s family of open-weights models, using Vertex AI Custom Training. This approach represents the perfect middle ground, granting us the granular control of a custom training script while still leveraging the scalability and convenience of cloud infrastructure.

The Setup: Full Fine-tuning with strategic layer freezing

A critical departure from the Gemini workflow was abandoning the constraints of PEFT/LoRA. Instead, we opted for full fine-tuning, where the model’s core weights are made trainable. This custom approach unlocked a new level of control, giving us direct access to the core mechanics of the training process. The “levers” at our disposal were no longer just dials on a managed service but fundamental architectural and hyperparameter choices:

Epochs: The total number of times the model sees the entire training dataset. We now had direct control over the full training cycle, giving us a clearer understanding of the risk of under-training versus overfitting.
Learning Rate: This determines the size of the adjustments made to the model’s weights. We could now set the absolute learning rate (e.g., 2e-4, 1e-5), giving us far more precision than the simple “multiplier” available in the managed Gemini service.
Layer Freezing Strategy: This became our primary architectural lever for regularization. To manage the power of full fine-tuning and prevent catastrophic forgetting, we employed this strategic technique. By freezing the bottom 14 layers, we anchored the model to its powerful, pre-trained foundation of general language understanding while allowing the upper, more task-specific layers full flexibility to adapt.
Inference-Time Architecture (LogitsProcessor): Our most powerful adaptation was a change to the model’s behavior at the moment of inference. This custom LogitsProcessor is a function that intercepts the model’s output and works as follows:
1. Intercept: After the model processes the input prompt and is ready to generate an output, the LogitsProcessor steps in.
2. Filter: It examines the logits—the raw probability scores for every token in the model’s vocabulary.
3. Constrain: It programmatically sets the probability of all tokens to zero, except for the specific tokens corresponding to our allowed class labels: “0”, “1”, and “2”.
4. Decide: The model is thus forced to choose from this tiny, pre-approved set of tokens.

This final technique is a game-changer. It effectively transforms the generative model into a true classifier at the moment of prediction. Instead of learning to generate the text of a label, it learns to assign the highest probability to the single correct token from a fixed list, directly addressing the architectural flaw we identified in the Gemini workflow.

Experiments and Results: The Journey to Parity

With this powerful new tool in our arsenal, we began a methodical search for the best configuration. Arriving at the optimal result wasn’t a single step but a journey of experimentation with key hyperparameters.

Hyperparameter Exploration: Finding the “Sweet Spot”

The table below details our hyperparameter exploration on the V15 dataset. It illustrates the journey from an unusable baseline, through under-training and over-training, to finding the optimal “sweet spot” for epochs and learning rate that allowed Gemma to rival our best internal model.

Table 5: Gemma Hyperparameter Exploration (Dataset V15)

Model	Overall F1	Negative F1	Average F1	Positive F1	Epochs	LR
Internal (Best)	0.9280	0.9215	0.9807	0.8818	N/A	N/A
Gemma 3.1B (Over-trained)	0.4440	0.3590	0.6035	0.3693	20	1e-5
Gemma 3.1B (Optimal)	0.9015	0.9091	0.9742	0.8213	8	2e-4
Gemma 3.1B (Under-trained)	0.5943	0.5206	0.8070	0.4553	3	1e-5
Gemma 3.1B (Baseline)	0.0523	0.0000	0.0001	0.1569	0	N/A

This exploration reveals several crucial insights:

Fine-Tuning is Essential, But Not Foolproof: The baseline Gemma model was completely ineffective. Just three epochs of fine-tuning yielded a massive performance jump. However, the run with 20 epochs shows a classic case of overfitting, where the model begins to memorize the training data and performance collapses.
The Optimal Configuration: The run with 8 epochs and a learning rate of 2e-4 proved to be the sweet spot. With an Overall F1-score of ~0.90, this fine-tuned Gemma model achieved performance remarkably close to our best internal model (0.9280). Most importantly, the class-level F1-scores are strong and balanced, demonstrating that our LogitsProcessor had successfully forced the model to learn the minority classes.

Architectural Choices Vindicated

This success wasn’t just about hyperparameters. We also validated two other key architectural choices.

Table 6: Layer Freezing & Model Scaling Comparison

Dataset	Model	Note	Overall F1	Negative F1	Average F1	Positive F1
V14	Gemma 3.1B	Default (14 Layers Frozen)	0.3154	0.0039	0.9043	0.0379
V14	Gemma 3.1B	Fewer (8 Layers Frozen)	0.3039	0.0013	0.9103	0.0000
—	—	—	—	—	—	—
V15	Gemma 3.1B (Optimal)		0.9015	0.9091	0.9742	0.8213
V15	Gemma 3.4B FT	Larger Model	0.8663	0.8739	0.9659	0.7592

As shown in Table 6, our default strategy of freezing the bottom 14 layers of the Gemma model proved slightly superior to freezing fewer layers, suggesting it provides a valuable regularizing effect. Furthermore, we found that a larger Gemma 4B model did not outperform our optimal 1.1B model. This is a powerful reminder that for a given task, there is a “sweet spot” for model size, and more parameters are not always the answer.

Table 7: Gemma Performance Across All Datasets

Dataset	Model Type	Overall F1	Negative F1	Average F1	Positive F1	Epochs	LR	Sys Prompt
V14	Internal (Best)	0.4059	0.2225	0.8191	0.1760	N/A	N/A	N/A
V14	Gemma (FT, Default Frozen)	0.3154	0.0039	0.9043	0.0379	8	0.0002	3
V14	Gemma (FT, Fewer Frozen)	0.3039	0.0013	0.9103	0.0000	8	0.0002	3
—	—	—	—	—	—	—	—	—
V15	Internal (Best)	0.9280	0.9215	0.9807	0.8818	N/A	N/A	N/A
V15	Gemma (Baseline)	0.0523	0.0000	0.0001	0.1569	0	N/A	3
V15	Gemma (FT, Optimal)	0.9015	0.9091	0.9742	0.8213	8	0.0002	3
—	—	—	—	—	—	—	—	—
V25	Internal (Best)	0.8947	0.9125	0.8595	0.9122	N/A	N/A	N/A
V25	Gemma (FT, Default Frozen)	0.8527	0.9052	0.8303	0.8226	8	0.0002	3
—	—	—	—	—	—	—	—	—
V26	Internal (Best)	0.8884	0.9113	0.8526	0.9014	N/A	N/A	N/A
V26	Gemma (FT, Default Frozen)	0.8845	0.9059	0.8470	0.9005	8	0.0002	3

Bonus: What Did the LLM Actually Learn?

After achieving parity, we wanted to understand what the fine-tuned LLM had learned. We conducted a feature ablation study, re-training our best Gemma model multiple times while removing one feature at a time. A larger drop in performance indicates a more important feature.

Table 8: Gemma Feature Importance (Dataset V15)

Feature Removed	Overall F1	Negative F1	Average F1	Positive F1	Performance Drop
Brand	0.8449	0.8067	0.9604	0.7677	0.0647
Platform	0.8681	0.8673	0.9627	0.7742	0.0416
Creative	0.8719	0.8617	0.9641	0.7898	0.0378
Audience	0.8798	0.8622	0.9668	0.8105	0.0299
Geography	0.9053	0.9098	0.9745	0.8317	0.0044
—	—	—	—	—	—
None (Full Model)	0.9097	0.9099	0.9753	0.8439	0.0000

The results are remarkably clear and align with marketing intuition: Brand, Platform, and Creative are the most critical predictors of campaign success. This analysis provides invaluable insight, confirming that the LLM is learning logical, interpretable patterns and can be used not just for prediction, but for generating deep, actionable business insights.

Analysis: How Control and Architecture Led to a Breakthrough

The Gemma results stand in stark contrast to our experience with the managed Gemini pipeline. The success story here is not just about the final numbers but about how we achieved them.

1. Parity Achieved: From Unusable to Unbeatable

This is the headline finding. The baseline Gemma model was completely unusable, with an F1-score near zero. However, after methodical fine-tuning, its performance skyrocketed to ~0.90, virtually matching the performance of our highly specialized internal model (~0.93). This is a monumental result, demonstrating that with the right approach, a fine-tuned open-weights LLM can indeed close the gap and compete with a purpose-built classifier.

2. Escaping the “Average” Trap: The LogitsProcessor as a Game-Changer

The catastrophic failure on minority classes we saw with Gemini is gone. Table 5 shows that the optimized Gemma model’s F1-scores are strong and balanced across all classes, including the difficult “Negative” (0.9091) and “Positive” (0.8213) categories. This directly validates our hypothesis: the architectural choice to use a LogitsProcessor was the silver bullet. It prevented the model from defaulting to the majority class and forced it to learn the distinguishing features of the minority classes, a feat the generative-only Gemini workflow could not accomplish.

3. Architectural Choices Are Vindicated

This successful tuning process proves that the right architecture is more important than just turning dials. The control afforded by the custom pipeline was fundamental.

Full Fine-Tuning over PEFT: The decision to use full fine-tuning (with strategic layer freezing) instead of the more constrained LoRA method gave the model the necessary capacity to learn our task.
The Power of the LogitsProcessor: This allowed us to reshape the model’s behavior at its core, turning it into a true classifier.

These architectural choices created a sound foundation where hyperparameter tuning could have a meaningful impact, unlike in the Gemini workflow where tuning was like trying to polish a flawed design.

Conclusion for the Gemma Workflow

The Gemma experiment was a resounding success. It proved that by taking a “middle ground” approach—combining the flexibility of an open-weights model, the power of full fine-tuning with strategic layer freezing, and the architectural rigor of a LogitsProcessor—we can build LLM-based solutions that are truly competitive with established, specialized models. This workflow has become a cornerstone of our internal LLM capability, providing a powerful and versatile template for future projects.

Round 3: The Hands-On Workshop with Qwen

After achieving parity with Gemma, we took the final step in our journey: descending into the “Hands-On Workshop.” The goal of this round was to take ultimate control of the training process by setting up a local environment on a Google Cloud VM to fine-tune Qwen3-4B-Base, an open-weights model.

It is important to note that this workflow was the least explored in terms of exhaustive hyperparameter optimization. Our primary goal here was to establish a proof of concept: could we successfully fine-tune a different open-weights model in a fully custom environment and test our most radical architectural idea yet?

The Setup: A Code Walkthrough of Surgical Precision

The local workshop environment gave us unparalleled control over the entire training stack. Let’s walk through the key implementation details in our Python scripts to see how this was achieved.

Step 1: Efficient PEFT with Unsloth (model.py)

Interestingly, this workflow marked a return to a Parameter-Efficient Fine-Tuning (PEFT) method, specifically LoRA. However, this was fully transparent and controllable, unlike the managed Gemini service. We used the unsloth library, which provides a highly optimized implementation of LoRA.

As seen in model.py, we load the model with 4-bit quantization and prepare it for LoRA training with a single, efficient command:

# From model.py
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit =True,
)

This embraces PEFT as a deliberate choice for efficiency on a single GPU, not as a service-imposed constraint.

Step 2: Direct Architectural Modification

Our most powerful adaptation was performing “surgery” on the model’s architecture. Before applying the LoRA adapters, we fundamentally reshaped the LLM into a direct classifier. The prepare_model_for_classification_head() function in model.py executes this surgical strike.

# From model.py
def prepare_model_for_classification_head(model, num_classes=3):# Get the lm_head's weights
    lm_head = model.get_output_embeddings()# Trim the vocabulary to the number of classes
    trimmed_lm_head_weight = lm_head.weight[:num_classes, :]# Create a new head with the trimmed weights
    new_lm_head = torch.nn.Linear(
        in_features=trimmed_lm_head_weight.shape[1],
        out_features=trimmed_lm_head_weight.shape[0],
        bias=False
    )
    new_lm_head.weight.data = trimmed_lm_head_weight# Set the model's new head
    model.set_output_embeddings(new_lm_head)

This code trims the model’s final output layer (lm_head) from tens of thousands of potential tokens down to just three outputs. The model is no longer physically capable of generating text beyond our class labels. It is now, by its very architecture, a true classification model.

Step 3: Training with a Custom Data Collator (trainer.py)

With a classification head in place, our trainer.py script uses a custom DataCollatorForLastTokenLM. Its job is to ensure that during training, the loss is calculated only on the model’s prediction for the final token, forcing the model to focus all its learning capacity on the classification task.

Step 4: Post-Inference Logic with Confidence Boosting (evaluator.py)

Finally, to combat the class imbalance we knew would still exist in the model’s learned weights, our evaluator.py script implements a post-processing --boost flag.

# From evaluator.py
if args.boost:# Manually increase logits for minority classes
    logits[:,0] = logits[:,0] *2.5
    logits[:,2] = logits[:,2] *2.5

This code intercepts the raw logits from the model’s output and multiplies the scores for the minority “Underperforming” (0) and “Overperforming” (2) classes. This artificially “boosts” their chances of being selected before the final probabilities are calculated, giving us a direct tool to counteract the model’s learned biases.

Experiments and Results: A Powerful Technique with Mixed Results

As this was a proof of concept, our experimentation was focused on these new architectural changes. Table 9 shows the progression on the V15 dataset as we applied each new technique.

Table 9: Qwen Architectural Progression (Dataset V15)

Model	Note	Overall F1	Negative F1	Average F1	Positive F1
Internal (Best)		0.9280	0.9215	0.9807	0.8818
Qwen3 (Baseline)		0.2470	0.1453	0.4902	0.1056
Qwen3 (Fine-Tuned)	Trimmed LM Head	0.4468	0.1919	0.8824	0.2661
Qwen3 (Fine-Tuned)	+ Boosted	0.4668	0.2412	0.8716	0.2876

To confirm if these results were consistent, we ran our best configuration (fine-tuned with boosting) across our other high-quality datasets.

Table 10: Qwen Performance Across All Datasets

Dataset	Model Type	Note	Overall F1	Negative F1	Average F1	Positive F1
V15	Internal (Best)		0.9280	0.9215	0.9807	0.8818
V15	Qwen3 (FT)	+ Boosted	0.4668	0.2412	0.8716	0.2876
—	—	—	—	—	—	—
V25	Internal (Best)		0.8947	0.9125	0.8595	0.9122
V25	Qwen3 (FT)	+ Boosted	0.3480	0.2380	0.5733	0.2327
—	—	—	—	—	—	—
V26	Internal (Best)		0.8884	0.9113	0.8526	0.9014
V26	Qwen3 (FT)	+ Boosted	0.3731	0.2907	0.5766	0.2519

Analysis: An Insightful Technique That Underperformed

The cross-dataset results tell a clear and consistent story.

1. Architectural Promise, But Performance Lags: The direct classification approach showed significant promise in Table 9, far exceeding the baseline. However, Table 10 shows that even our best configuration consistently underperformed against both our internal models and the optimized Gemma pipeline, often by a margin of 0.50 or more on the Overall F1 score.

2. Boosting Helps, But It’s a Patch, Not a Fix: The confidence boosting technique consistently provided a small but measurable improvement (as seen in Table 9). This confirms its value as a tool for combating imbalance. However, the gains were incremental and insufficient to make the model competitive. It was a useful patch applied after the fact, not a fundamental solution like the Gemma workflow’s LogitsProcessor.

3. Untapped Potential: The Confounding Effect of Limited Tuning

While this workflow was an exciting exploration, a direct comparison of Qwen’s final performance to Gemma’s is not an apples-to-apples comparison, and we must be cautious in our conclusions. Several factors could explain the performance gap:

The Primary Factor: Insufficient Optimization. This is the most likely reason for the performance difference. The Qwen workflow was run as a proof of concept with minimal hyperparameter tuning (e.g., only one training epoch was tested). The Gemma model, in contrast, underwent a methodical search for its optimal configuration. It is entirely possible that with more epochs and a proper learning rate schedule, the Qwen model’s performance could have improved significantly.
A Plausible Secondary Hypothesis: Base Model Alignment. A secondary, though less certain, hypothesis is that the pre-trained knowledge within the Qwen3-4B-Base model may have been less aligned with our specific marketing domain than the knowledge within Gemma 3-1b-it. Fine-tuning can only adapt what’s already there; if a model’s starting point is further away, it may require more effort to reach a high-performance state.
Architectural Differences: Finally, the entire training architecture was different (transparent LoRA + trimmed head vs. full fine-tuning + LogitsProcessor). This difference in methodology is a major variable in itself.

Conclusion for the Qwen Workflow

The local Qwen workflow was a valuable scientific endeavor that successfully achieved its primary goal: to serve as a proof of concept for our most hands-on approach. It gave us maximum control, allowing us to validate a powerful new technique of trimming the lm_head for direct classification and introducing confidence boosting as a viable tool.

While its performance did not match our optimized Gemma pipeline in this specific horse race, this was expected given the limited hyperparameter tuning. Therefore, the results should not be seen as a definitive verdict on the Qwen3-4B-Base model’s potential, but rather as a promising baseline from a novel architecture that would require further optimization to be truly competitive. The knowledge gained has equipped our team with deep customization capabilities, preparing us for future projects where such surgical control is paramount.

Final Showdown

After three rounds of experimentation, from the managed expressway to the hands-on workshop, the final results are in. The table below presents the single top-performing model from each workflow, providing a clear, “best-of-the-best” comparison that directly answers our core question.

Final ‘Best-of-the-Best’ Performance Summary

The Verdict: Lessons from the Race

So, can fine-tuned LLMs beat custom classifiers? Yes—but only if you have the right tools and the control to use them. The table above tells the whole story.

Control is King. This was the single most important factor. The “Managed Expressway” (Gemini) was fast, but its “black box” nature locked us into a failing strategy. The “Custom Highway” (Gemma) gave us the control to diagnose and fix the core problem, leading to our breakthrough. Ultimate control (Qwen) offered deep insights but required more optimization. The clear lesson is that for complex tasks, you must have enough control to shape the architecture to your problem.
Full Fine-Tuning vs. PEFT (LoRA): Power vs. Efficiency. Our experiment provides a clear picture of the trade-offs. The full fine-tuning approach with Gemma provided the raw power needed for the model’s weights to fully adapt to our nuanced task, ultimately delivering the winning performance. In contrast, the LoRA-based approach in the managed Gemini workflow was too constrained to succeed. However, our Qwen experiment showed that a transparent and controllable LoRA implementation (via unsloth) is a highly efficient and viable option for resource-constrained environments, even if it doesn’t reach the absolute performance peak of full tuning without further optimization. The choice is a strategic one: seek maximum performance with full tuning, or opt for efficiency with a well-controlled PEFT approach.
Generative Models Need “Guardrails” for Classification. A standard LLM’s goal is to predict the next word, which causes it to favor the majority class in an imbalanced classification task. Our successful workflows show that you must implement “guardrails” to force a true discriminative choice. The LogitsProcessor in our Gemma workflow and the trimmed classification head in the Qwen workflow are two different, powerful examples of such guardrails.
The “How” is More Important Than the “What.” Ultimately, our success was not about choosing “Gemma” over “Gemini.” It was about choosing a custom, controlled process over a managed one. A well-engineered pipeline on an open-weights model proved far superior to a less-controlled approach on a state-of-the-art model.

_{Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.}

March 24, 2026

test

March 24, 2026