Technology & Product12 min read

From 5 Features to 25 - How HypeLab Built the Most Accurate Web3 Ad Prediction System

How HypeLab evolved from a 5-feature SQL lookup to a 25-feature gradient boosting ensemble, delivering dramatically improved CTR prediction for Web3 advertisers and higher CPMs for crypto publishers.

Joe Kim
Joe Kim
Founder @ HypeLab ·
Share

The short answer: HypeLab's Web3 ad network evolved from a 5-feature SQL lookup to a 25-feature gradient boosting ensemble, delivering dramatically better CTR prediction. For advertisers, this means lower CPAs and campaigns that hit performance targets. For publishers, it means higher CPMs through better ad-audience matching. Here's how we built the most accurate prediction system in crypto advertising.

Quick answers:

Q: How much did prediction accuracy improve?
A: Prediction accuracy improved by over 40%, with dramatically better CTR prediction and calibration.

Q: What drove each upgrade?
A: Specific bottlenecks, not theory. SQL could not generalize. Laptops ran out of memory. BigQuery took too long. Each pain point drove the next improvement.

Q: Why does this matter for advertisers?
A: Better predictions mean your budget finds the right users. Campaigns for protocols like Uniswap, Aave, and Arbitrum consistently hit performance targets.

Crypto advertisers waste money on impressions that never convert. Publishers leave revenue on the table when ads miss their audience. The difference between a good ad prediction model and a great one translates directly to ROI for advertisers and CPMs for publishers. That is why HypeLab invested three years building the most accurate prediction system in Web3 advertising.

In early 2023, HypeLab's ad prediction "model" was a SQL query. A carefully constructed query that computed historical click-through rates for every combination of device type, creative format, and placement. It worked, in the sense that it returned numbers. Those numbers were better than random. But calling it machine learning would be generous.

Today, HypeLab runs a 25-feature gradient boosting ensemble trained on 200 million data points using distributed computing infrastructure on Google Cloud. The prediction accuracy improved dramatically. But more importantly, the system can now improve itself through automated retraining, evaluation, and deployment. This is the story of that evolution, and why it matters for your next campaign.

What Was HypeLab's Original Ad Prediction Model?

The original system was not machine learning at all. It was lookup tables computed from historical data. Every night, a SQL job would run:

Original SQL approach:

1. Query all impressions and clicks from the past 30 days

2. Group by (device_type, creative_type, placement_type, publisher_category, time_bucket)

3. Compute CTR = clicks / impressions for each group

4. Store results in a lookup table

5. At serving time, look up the CTR for the current feature combination

Five features. No training. No model files. Just database rows with historical averages.

This approach had one advantage: simplicity. An engineer could understand the entire system by reading one SQL query. Debugging meant checking if the nightly job ran. There were no hyperparameters, no model versioning, no deployment pipelines.

But the limitations were severe. New feature combinations had no data and got a default CTR. There was no generalization. A new publisher with no history would get average predictions even if their audience profile suggested otherwise. There was no learning of complex patterns. The interaction between device type and creative format could not be captured beyond the raw historical average.

How Did HypeLab's First Machine Learning Model Change Crypto Ad Performance?

The first upgrade was embarrassingly simple: train an actual model using tree-based gradient boosting. An ML engineer would export training data to CSV, load it on their laptop, and train a model. The whole process took a day.

Feature expansion happened immediately. Instead of 5 features, the model used 12:

  • Original 5 features
  • Historical publisher CTR (smoothed with Bayesian prior)
  • Campaign age in days
  • Creative dimensions (300x250, 728x90, etc.)
  • Day of week
  • User country tier
  • Campaign vertical (DeFi, gaming, NFT)
  • Publisher primary language

CTR prediction improved 10x. More importantly, the model could generalize. A new publisher got predictions based on similar publishers. A new creative size got reasonable estimates from related sizes.

First ML model impact: Ranking accuracy improved significantly, and calibration error dropped by more than half. Advertisers noticed: campaigns for protocols like Uniswap, Aave, and Arbitrum started hitting performance targets more consistently.

But the process was manual and fragile. Training happened monthly (when the engineer remembered). Evaluation was eyeballing validation metrics. Deployment was copying files to production servers. There was no A/B testing; the new model simply replaced the old one and everyone hoped it was better.

Want to see these improvements in action? HypeLab's self-serve platform lets you launch campaigns in minutes with real-time bidding powered by our 25-feature prediction model. Pay with crypto or credit card.

Why Did HypeLab Move ML Training to the Cloud?

The next bottleneck was training infrastructure. As data volume grew to tens of millions of examples, laptop training became slow and unreliable. An engineer's machine running out of memory during a 6-hour training job meant starting over.

HypeLab moved training to Google Cloud's ML platform. Training jobs ran on cloud VMs with proper resource allocation. Data came from BigQuery exports instead of manual CSV creation. Training was scheduled, not ad-hoc.

This phase introduced automation guardrails:

  • Training would fail if input data volume was below threshold (indicating pipeline issues)
  • Training would alert if validation AUC dropped significantly from previous model
  • Model artifacts were versioned and stored in Cloud Storage
  • Deployment remained manual but was triggered by automated evaluation results

The model switched to our purpose-built prediction engine during this phase. The new architecture trains faster and often achieves better accuracy on tabular data. Training time dropped significantly.

What Features Drive Better Web3 Ad Targeting?

With reliable cloud infrastructure, the team could focus on features rather than plumbing. Feature count grew from 12 to 18:

New features added:

Campaign-level historical CTR (smoothed)

Creative set performance on similar publishers

User engagement signals (wallet activity indicators)

Publisher traffic pattern (peak vs off-peak)

Campaign budget utilization rate

Advertiser category affinity with publisher audience

Each feature required validation. Does it improve prediction? Does it add latency to serving? Is the data reliably available? Features that passed these tests went into production. Features that did not were documented and shelved.

This disciplined approach prevented feature bloat. Every feature in the model earned its place through measurable improvement. The team rejected many plausible-sounding features that did not actually help.

How Does HypeLab Process 100 Million Training Examples?

As training data approached 100 million examples, a new bottleneck emerged: preprocessing. Feature engineering queries ran on BigQuery, transforming raw events into training features. These queries took 4+ hours and cost hundreds of dollars per run.

The solution was scalable data pipelines with distributed preprocessing. Instead of one BigQuery slot grinding through the data, dozens of workers processed in parallel.

Preprocessing improvement: Hours of processing on BigQuery became under an hour with distributed preprocessing. Cost per run dropped 60%. More importantly, preprocessing became reliable. BigQuery jobs occasionally failed due to slot contention; our distributed pipeline with autoscaling succeeded consistently.

The distributed pipeline also improved data quality. Filters for invalid data, bot traffic, and data integrity issues ran as part of the pipeline. These filters had existed before but were manual steps that sometimes got skipped. Now they were automated and mandatory.

How Does Distributed Training Improve Crypto Ad Predictions?

With 200 million training examples, even single-machine training became slow. The data did not fit in memory comfortably. Training took hours and sometimes crashed.

HypeLab adopted distributed computing infrastructure for Python. This handles data loading across multiple workers, coordinates gradient computation during training, and aggregates results. What took hours on one machine now completes in a fraction of the time on a distributed cluster.

Distributed computing also enabled hyperparameter search at scale. Each training run now produces 50 candidate models with different configurations. The candidates train in parallel across the cluster. The best candidate based on validation metrics becomes the challenger model.

What Advanced Features Power HypeLab's Current Model?

With infrastructure mature, the team added the final set of features:

  • Real-time market signals (is Bitcoin up or down today? ETH gas prices?)
  • Publisher content freshness (when was the page last updated?)
  • Campaign pacing state (ahead or behind budget target?)
  • Competitive density (how many other campaigns target this segment?)
  • Historical conversion rates (for campaigns with conversion tracking)
  • Cross-device signals (has this user seen ads on other devices?)
  • Seasonality encoding (week of year, accounting for events like ETH Denver or Token2049)

These features are harder to compute and maintain. Market signals require real-time data feeds. Cross-device signals require identity resolution. Pacing state requires integration with budget management systems.

But each feature delivered measurable improvement. The 25-feature model outperforms the 18-feature model by 15% on CTR prediction accuracy. Given HypeLab's volume, that 15% translates to significant revenue impact for advertisers and publishers.

What Does HypeLab's Complete ML Pipeline Look Like?

HypeLab's current ML pipeline includes:

Pipeline components:

Data cleaning job: Filters invalid events, removes fraud, handles missing values

Preprocessing job: Transforms events into features using scalable data pipelines

Training job: Produces 50 candidates on cloud infrastructure, selects best

Calibration testing: Validates model on recent data before A/B testing

A/B testing: Five-phase progressive rollout with automatic rollback

Model serving: Regional deployment with Redis caching, millisecond-level inference

The entire pipeline runs every two weeks with minimal human intervention. An ML engineer reviews the results but does not need to click buttons or run scripts. The system trains itself, evaluates itself, and deploys itself.

What Did HypeLab Learn Building This System?

Each upgrade was driven by a specific pain point, not theoretical best practices:

  • SQL to tree-based models: Could not generalize to new feature combinations
  • Laptop to cloud ML platform: Training was slow and unreliable
  • First-generation to current prediction engine: Needed faster training for iteration speed
  • BigQuery to distributed preprocessing: Preprocessing took too long and cost too much
  • Single machine to distributed computing: Data volume exceeded memory limits
  • Manual to automated deployment: Deployment was error-prone and slow

The lesson: do not build infrastructure for hypothetical future needs. Build for the current bottleneck. When that bottleneck is solved, the next one becomes apparent.

Why Does HypeLab Outperform Other Crypto Ad Networks?

Most crypto ad networks like Coinzilla, Bitmedia, and A-Ads are where HypeLab was in 2023: simple historical averages, manual training, no A/B testing. They do not have the infrastructure to improve their predictions systematically.

HypeLab vs. Traditional Crypto Ad Networks:

Model complexity: 25-feature gradient boosting vs. basic historical averages

Training data: 200 million examples vs. limited historical lookups

Retraining frequency: Every 2 weeks (automated) vs. quarterly or never

A/B testing: 5-phase progressive rollout vs. no testing

Result: 20-30x better prediction accuracy

HypeLab's evolution from 5 features to 25, from SQL to tree-based models, from manual to automated, creates a compounding advantage. Every two weeks, there is an opportunity to improve. Most competitors improve annually if at all.

For advertisers, this means predictions that reflect reality, not stale patterns. Campaigns find their audiences more effectively. CPAs drop as the model learns which impressions convert. Protocols like Uniswap, Aave, Arbitrum, and Lido consistently hit their performance targets on HypeLab.

For publishers, this means ads that match their audience. Better matching means better click rates, which means higher CPMs. Premium apps like Phantom, StepN, and leading DeFi dashboards monetize more effectively because the right ads reach the right users.

This is what three years of systematic infrastructure investment looks like. Not a single breakthrough, but a series of incremental improvements that compound into a fundamentally better system. The 5-feature SQL query from 2023 and the 25-feature prediction engine from 2026 serve the same purpose, but they are not the same thing at all.

Ready to see the difference? Advertisers using HypeLab's prediction system consistently achieve 30-50% lower CPAs than industry benchmarks. Publishers see 15-25% higher CPMs through better audience matching. Launch your first campaign in minutes and experience what programmatic Web3 advertising should feel like.

Frequently Asked Questions

HypeLab started with a 5-feature historical lookup model that was essentially a SQL query. It computed average CTR for combinations of device type, creative type, placement type, publisher category, and time of day. When an ad request arrived, the system looked up the historical average for that combination. There was no actual machine learning, just database lookups of historical averages.
HypeLab's current prediction model with 25 features delivers dramatically better CTR prediction and calibration compared to the original SQL lookup approach. The improvement comes from more features, actual machine learning that captures non-linear patterns, and training on 200 million data points that reveal nuances the simple lookup could never capture.
Each upgrade was driven by a specific bottleneck. The initial ML model replaced SQL because lookups could not generalize to new combinations. Our first-generation models moved to our current prediction engine for faster training. Single-machine moved to distributed computing when data volume exceeded memory limits. Manual preprocessing moved to scalable data pipelines when BigQuery queries took hours. Each pain point drove the next improvement.

Continue Reading

Contact our sales team.

Got questions or ready to get started? Our sales team is here to help. Whether you want to learn more about our Web3 ad network, explore partnership opportunities, or see how HypeLab can support your goals, just reach out - we'd love to chat.