Methodology

From search mutation to AI-prescriptive influence.

A rigorous, reproducible methodology to measure and grow your share of voice in AI-generated answers — the way the engines actually work.

Talk to an expert

How does SkuLift measure and improve AI visibility?

SkuLift runs a five-phase loop — measure, analyze, recommend, execute, re-measure — scoring a position-weighted share of voice across four engines with N=5 sampling, an A/B/C query classification and a four-level SOV pyramid.

AI visibility is only manageable if it is measured rigorously, and most attempts to measure it are not rigorous at all. A single prompt to a single engine on a single day is an anecdote, not a metric. The SkuLift methodology exists to turn that anecdote into a number you can trust, trend and act on.

Rigour starts with reproducibility. Because generative answers vary from one sampling to the next, any honest measurement has to sample repeatedly, define its question set explicitly, and compute the same way every time. The methodology below is built around that constraint: multi-sampling to tame variance, a fixed query taxonomy to keep comparisons fair, and a position-weighted scoring formula drawn from peer-reviewed generative-engine research rather than a convenient heuristic.

Everything downstream depends on the measurement being sound. The analysis that finds your gaps, the recommendations that close them, and the re-measurement that proves the lift are only as credible as the baseline they rest on. This page documents the full method end to end so that a technical reader can verify it, a sceptical reader can challenge it, and an AI engine can cite it — which, fittingly, is exactly the standard we hold our clients' content to.

Parametric and live retrieval

Consider what the engines are doing under the hood. When a buyer asks a question, the model either answers from what it absorbed during training — its parametric knowledge — or it retrieves live sources and grounds its answer in them.

These are two different surfaces, won by two different kinds of work, and a methodology that ignores the distinction will optimize blindly. SkuLift measures both, because a brand can be strong in a model's trained knowledge yet absent from its live retrieval, or the reverse, and only seeing the split tells you which fight you are in.

Just as important is the query taxonomy. Not all questions are equal: some name your category directly, some describe a problem your product solves without naming it, and some are adjacent explorations a buyer makes on the way to a decision.

The query taxonomy

SkuLift classifies the query set so that share of voice is read in context — winning the product-adjacent questions a buyer actually starts with often matters more than winning the obvious branded query everyone already optimizes for. The classification is what stops the method from chasing the easy, low-value wins.

Throughout, the human stays in the loop. The agents measure, analyse and draft at machine speed, but every recommendation that would change your public presence passes an explicit human gate before anything is produced or published. The methodology is automated where automation is safe and human where judgement is required, which is what makes it both fast enough to keep pace with the engines and accountable enough to trust with brand-critical decisions.

The optimization loop

Measure, analyze, recommend, execute, re-measure.

The methodology is a closed five-phase loop. Each phase has a defined input, a defined output and a defined success criterion, so the whole cycle is auditable rather than aspirational.

1. Measure: Probe each engine with the defined query set under N=5 sampling to establish a position-weighted share-of-voice baseline.
2. Analyze: Decompose every answer: cited sources, mention prominence, sentiment, and the competitor sources that displaced you.
3. Recommend: Translate gaps into a ranked, answer-first backlog of content and authority moves weighted by expected impact.
4. Execute: Produce and publish the approved work under human validation, from on-site answer blocks to off-site authority signals.
5. Re-measure: Re-probe the same engines on cadence to attribute the lift and feed the result into the next cycle.

Measure, analyze, recommend, execute, re-measure.

Measurement establishes the baseline across engines. Analysis decomposes every answer to find why a competitor was cited instead of you. Recommendation translates the gaps into a prioritized, answer-first backlog.

Execution produces and publishes the approved work under human validation. Re-measurement re-probes the same engines to attribute the lift and feed the next cycle. The five phases are deliberately the same vocabulary your team, our platform and this page all use, because a shared language is what keeps a method honest.

Five phases, one shared language

The first phase, measurement, is more than running a few prompts. It means assembling a representative query set that reflects how real buyers actually ask, sampling each query enough times to expose a stable signal rather than a single noisy draw, and recording not just whether your brand appeared but where, how prominently, and alongside which competitors and sources. A weak measurement phase quietly corrupts every phase after it, so the methodology front-loads its rigour here.

Analysis is where measurement becomes understanding. A raw share-of-voice number tells you that you are losing a query; analysis tells you why.

By decomposing the winning responses, the methodology surfaces the exact sources the engine trusted, the structure of the passages it lifted, and the authority signals that backed them. That diagnosis is what makes the recommendation phase targeted rather than speculative — you are not guessing what might help, you are treating the exact reasons a competitor was preferred.

The human gate

Recommendation and execution are deliberately separated by the human gate, and that separation is part of the method rather than an afterthought. The recommendation phase produces a clear, readable proposal: what to change, why, and what it should move.

The execution phase acts only on what a human has approved. Keeping the two distinct means the loop can move at machine speed up to the point of judgment and then slow to human speed exactly where judgment is required — the only responsible way to operate a system that touches a brand's public presence.

Re-measurement, the fifth phase, is where the method earns its credibility. Anyone can claim a change helped; proving it requires re-probing the same engines, with the same query set and the same formula, and showing that the score moved.

Re-measurement

Because the baseline and the re-measurement are produced identically, the difference between them is attributable to the work rather than to a change in counting. That discipline — measure, change one thing, re-measure the same way — is the scientific core of the method, and it is why a SkuLift result reads as evidence rather than a marketing claim.

Taken as a whole, the five phases form a discipline rather than a tool: a repeatable way to know where you stand, why, what to do about it, and whether it worked, run continuously and under human control. Everything else on this page — the pyramid, the formula, the cadence, the rigour — is in service of making each phase trustworthy enough that the loop can be relied upon to compound a position rather than merely report on one.

The continuous loop

What makes this a loop rather than a checklist is the feedback edge from re-measurement back to measurement. A static optimization assumes the world holds still; a loop assumes it moves, which is the realistic assumption for generative engines that are retrained, re-ranked and contested by competitors continuously. Closing that edge — proving each change and learning from it — is what compounds a first lift into a durable, defended position over successive cycles.

The SOV pyramid

A four-level share-of-voice pyramid.

Not every mention is worth the same. The pyramid orders four levels of AI visibility from mere presence to being the recommended default, so a score reflects quality of citation, not just quantity.

At the base is presence: your brand appears somewhere in an answer for a relevant question. One level up is citation: the engine not only mentions you but attributes a claim to your source, which is materially stronger because it signals the model trusts you enough to quote.

Higher still is prominence: your brand is featured early and centrally rather than tacked on at the end. And at the apex is the default: for the strategic question, the engine recommends you first, as the obvious answer.

Ordering visibility this way matters because optimizing for the wrong level wastes effort. A brand can inflate raw mention counts while never climbing past the base of the pyramid, looking busy on a vanity dashboard while losing the decisions that count.

Default choice

Prominence

Citation

Presence

Default choice: For the strategic question, the engine recommends you first.
Prominence: You are featured early and centrally, not tacked on at the end.
Citation: The engine attributes a claim to your source, quoting you as credible.
Presence: Your brand surfaces somewhere in the answer for a relevant question.

A four-level share-of-voice pyramid.

The pyramid keeps the method honest by tying the score to altitude: progress means moving up levels for your strategic queries, not merely accumulating more low-value mentions. It is also how we communicate goals to clients — we name the level you are at and the level we are climbing toward, in plain terms.

The four levels also map cleanly onto strategy. Moving from absence to presence is a content problem: you must exist, in a quotable form, on the question. Moving from presence to citation is a trust problem: the model must believe your source enough to attribute a claim to it.

The four levels as strategy

Moving from citation to prominence is a structure-and-authority problem combined. And reaching the default — being recommended first — is the compounding result of doing the lower three well over time, reinforced every cycle the engine is sampled. Knowing which level a given query sits at tells you exactly which lever to pull next.

Crucially, the pyramid is computed per query and per engine, not as one blurred site-wide average. A brand might sit at the default level for one strategic question on Perplexity while languishing at the absent level for the same question on Gemini.

Per-query, per-engine

That granularity is what makes the score actionable: you know exactly where to concentrate effort, on which engine, for which question. An average hides this; the per-query, per-engine pyramid reveals it.

Reading the pyramid over time is also how progress is communicated honestly. A monthly report that simply shows a rising mention count can flatter a team that is busy without being effective.

Communicating progress

A pyramid-based report shows whether the mentions are happening at higher levels for the queries that matter — and that distinction is the difference between a team that is active and a team that is winning. It is the communication tool that keeps the programme aligned with business outcomes rather than activity metrics.

The scoring formula

Position-weighted citation, with N=5 sampling.

The core metric is a position-weighted citation score adapted from peer-reviewed generative-engine research, sampled five times per query to control the inherent variance of generative answers.

A naive share of voice counts mentions and stops there, which is misleading because position carries meaning. A model that names your brand first is recommending it; a model that names you last, behind three competitors, is hedging.

The position-weighted citation formula assigns more weight to earlier, more prominent citations, so the score rewards the kind of mention that actually shifts a buyer's shortlist rather than treating a leading recommendation and a trailing aside as equal.

Because a single sampling of a generative engine is noisy — ask the same question twice and the answer can differ — the methodology samples each query five times and aggregates. N=5 is a deliberate balance: enough repetitions to tame the variance and expose a stable signal, few enough to keep the measurement affordable at the scale of hundreds of strategic queries across four engines. The aggregate is what we trend; a single run is never reported as a result.

Citation authority score

On top of the position-weighted score sits a Citation Authority Score that captures how authoritative the cited sources are, distinguishing a citation of your own controlled domain from a borrowed mention in a third-party article.

A citation of your owned domain is durable and attributable; a mention in a paragraph citing someone else's research is borrowed authority. Both matter, but they are won by different kinds of work, and combining them into a single metric would mislead the recommendation engine about which lever to pull.

It is worth being explicit about why a raw mention count is actively misleading rather than merely crude. Two brands can have identical mention counts while one is recommended first in every answer and the other is always mentioned last as an afterthought.

Why raw counts mislead

On the buyer side, those are very different outcomes: the first brand is on every shortlist, the second is a hedge. On the optimisation side, they require completely different work to improve. A metric that treats them as equivalent has no diagnostic value and cannot drive a recommendation.

The choice of N=5 is itself a metrological decision rather than an arbitrary one. Too few samples and the variance of generative answers swamps the signal, so a brand looks to have surged or collapsed when the model was simply in a different mood.

Too many samples and the cost-to-signal ratio degrades past the point of usefulness. Five samples is the range where the variance is controlled without making the measurement prohibitively expensive to run at the frequency a managed programme requires.

Sampling depth N=5

All of this connects to the four-level pyramid directly. The position-weighted score is, in effect, the pyramid expressed as a number: a brand that is merely present scores low, a brand that is cited scores higher, and a brand recommended first on its strategic queries scores highest. That coherence is intentional, so that the visual model a stakeholder understands intuitively and the metric an analyst trends are two views of the same underlying truth rather than two disconnected systems that have to be reconciled by hand.

A worked intuition helps. Imagine two queries where your brand is mentioned exactly once. In the first, the engine opens its answer by recommending you and attributes a specific capability to your documentation. In the second, the engine mentions you in a closing clause alongside two competitors without attributing anything specific.

The position-weighted score makes these two distinct: the first is a high-value citation, the second is a low-value mention. A raw count treats them identically. The point is not pedantry; it is that the worked example shows you exactly which kind of citation you should be working to earn.

A worked intuition

None of these choices is exotic; each is simply the honest version of a measurement that is easy to fudge. Counting only valid responses, weighting by position, sampling enough to be stable, and separating parametric from web-grounded results are the differences between a number engineered to flatter and a number engineered to be true. Insisting on all of them is unglamorous, but it is the reason the methodology produces figures a sceptical executive can rely on rather than a dashboard that looks impressive and means little.

Position-Weighted Citation (PWC)

wᵢ: Rank weight
cᵢ: Citation at position i

Earlier citations carry more weight (decaying by rank).

Position-weighted citation, with N=5 sampling.

AEO vs GEO

Answer engine optimization versus generative engine optimization.

AEO and GEO are complementary disciplines, not synonyms. AEO engineers your content to be quotable; GEO builds the authority that makes the quote credible. The methodology applies both, deliberately.

Answer Engine Optimization is the on-page, technical craft: structuring content so a model can lift a clean, self-contained, attributable answer from it. It leads with the conclusion, states facts a model can extract without distortion, and marks up the page so the relevant passage is unambiguous. AEO is what makes your content easy to cite; without it, even an authoritative brand is passed over because its pages are hard to quote.

Generative Engine Optimization is the off-page, strategic half: building the authority signals that make engines trust you enough to cite. Entity presence, knowledge-graph entries, press coverage in authoritative publications, structured data — these are GEO's raw materials.

GEO is slow, expensive and brand-level: you cannot do it at the page level, and its payoff unfolds over months and years rather than a publication cycle. But it is also what makes AEO-produced content sustainable: a well-structured answer on a domain the engine does not trust is still not cited.

AEO and GEO on different clocks

A useful way to hold the distinction is that AEO changes what you publish and GEO changes how the outside world talks about you. AEO lives in your own CMS, your own pages, your own schema.

GEO lives in third-party publications, in knowledge-base entries, in the signals engines pull from the open web. You control AEO directly; GEO you influence rather than own. Both are required because an engine that finds high-quality on-page content but no off-page authority signal will mention you, but with less confidence — and confidence is what determines whether a mention becomes a recommendation.

The two also fail in instructive ways when separated. A brand that pours effort into authority while neglecting content structure earns trust but gives engines nothing quotable to lift — the model knows who you are, but your pages do not answer questions cleanly.

When separated they fail

A brand that perfects answer-first content on a domain with no authority footprint gets cited tentatively, or not at all, because the model has no corroboration that you are a reliable source. GEO without AEO is a reputation without a voice; AEO without GEO is a voice without a reputation.

In practice the two disciplines also operate on different clocks, and managing that is part of the methodology. AEO work can surface in engine responses within weeks; GEO signals build over a year or more.

The managed loop accounts for this: authority signals that are already solid are treated as a foundation, while AEO content work runs on a faster cycle to capitalise on that foundation. When authority is weak, a parallel GEO programme runs alongside, building the foundation while the content work generates early signals. The two clocks are managed together rather than sequenced.

Not SEO rebranded

It is worth stating plainly that neither AEO nor GEO is SEO rebranded. Classic SEO optimises for ranking signals that send a browser to a page. AEO optimises for being quoted inside an answer that may never send a browser anywhere.

GEO optimises for being trusted enough by the model that it quotes you at all. The three share some surface-level tactics — clear writing, structured data, coherent information architecture — but they are aiming at different target mechanisms, and confusing them leads to programmes that optimise for one while thinking they are optimising for another.

AEO

GEO

Objective

AEOBe the answer cited.

GEOBe the source quoted.

Key signals

AEOAnswer-first content, FAQ markup, brand entity.

GEOAuthority backlinks, knowledge graphs, owned media.

Primary KPI

AEOCitation rate per query.

GEOSource share of voice.

Cycle

AEO4 to 8 weeks.

GEO12 to 24 weeks.

Answer engine optimization versus generative engine optimization.

Re-measurement cadence

Four engines, re-measured on a fixed cadence.

SkuLift re-probes ChatGPT, Claude, Gemini and Perplexity on a regular cadence — roughly every six hours — so a regression is caught within a cycle rather than discovered months later.

A measurement taken once is a snapshot; a position is a moving thing. Engines are retrained, retrieval is re-ranked, and competitors publish, so a share of voice won in one cycle can erode in the next without anything you did causing it. Re-measuring on a fixed, frequent cadence turns that volatility from an invisible risk into a managed signal: the curve is watched continuously, and a dip is visible while it is still cheap to fix.

Probing four engines on a six-hourly rhythm is also what makes attribution honest. Because the same probes run before and after a piece of content is published or an authority signal is updated, the delta between the two measurements is attributable to the change rather than to a drift in the engine's state.

Without a fixed cadence, changes in score could equally reflect a model update, a competitor's move, or a seasonal shift in the engine's behaviour. The cadence pins the baseline and makes the counterfactual clean.

Attribution honesty

Frequent re-measurement also disciplines the recommendation side of the loop. When you know a change will be re-measured within days rather than months, the temptation to recommend large, slow-moving programmes fades.

Short, focused moves that can be tested and validated within a measurement cycle become the natural unit of work. Over time this shifts the programme from episodic campaigns to a continuous improvement rhythm, which is what allows the compounding that makes AI visibility a durable asset rather than a periodic spike.

The cadence is also what makes the programme legible to stakeholders who are not in the daily loop. A monthly report built from consistent, regularly-gathered data is more credible than one assembled from ad-hoc samples taken whenever the team finds time.

Legibility and consistency

Because the data comes from the same probes, run the same way, at the same time, a trend line in the monthly report reflects the world rather than the measurement process. That consistency is what makes a score something a CMO can quote to a board without worrying that the next quarter's number was measured differently.

The cadence finally turns the whole programme into a feedback system rather than a sequence of deliverables. Each measurement cycle feeds the analysis phase, which feeds the recommendation phase, which feeds the execution phase, which feeds the next measurement cycle.

The loop only works as a loop if all phases run at the same cadence, or near-to-it. A programme that measures monthly, analyses quarterly and recommends twice a year is not a loop — it is a series of slow, disconnected snapshots. The six-hourly probe cadence is what keeps the phases locked together and the system responsive.

A feedback system

A word on what the cadence does not tell you: it tells you when scores changed, not whether your content was good. That judgment — whether a published Lift was the actual cause of a score movement — requires the analysis and re-measurement phases to work correctly.

The cadence is necessary but not sufficient for attribution. What it provides is the temporal precision that makes attribution possible at all: without regular probing, even a correct analysis of cause would be blurred by the noise of an unknown time window.

A fixed cadence, finally, is what allows the methodology to make promises it can keep. Because the engines are re-probed on a known schedule, a client always knows when the next reading lands and what it will measure, so progress is reported on a rhythm rather than whenever someone remembers to look. Predictability of measurement is itself a feature: it turns AI visibility from an occasional audit into an instrumented, always-on channel.

Reproducibility and rigour

Reproducibility and metrological rigour.

A methodology is only as good as its reproducibility. Same query set, same sampling, same formula, every cycle — so a number means the same thing in April as it does in July.

Metrological rigour is the unglamorous foundation that makes everything else trustworthy. The query set is defined and version-controlled, not improvised per run. The sampling depth is fixed at N=5.

The scoring formula is constant. Invalid responses are excluded by an explicit rule rather than by whoever happens to read the dashboard. These choices sound pedantic until you realise that without them, two people looking at the same brand reach different conclusions and a trend line becomes an artefact of inconsistent measurement rather than a real signal.

Rigour as a constraint

This is also why the methodology resists vanity. Because every metric is defined precisely and computed identically each cycle, a score can only go up if something in the world actually changed — a model updated its knowledge, a piece of content earned a citation, an authority signal accumulated enough weight.

There is no room for an analyst to smooth a bad number by changing the query set or the sampling depth, because those are fixed. Rigour here is a constraint on what counts as an improvement, not just a quality marker, and that constraint is what makes the results stakeholder-grade rather than team-internal.

Reproducibility has a second, subtler payoff: it makes disagreement productive. When the query set, the formula and the sampling depth are all fixed, a disagreement about a score is a disagreement about something in the world rather than a disagreement about how the measurement was run.

Productive disagreement

A client who disputes a result can challenge the query set's relevance or the brand-kit definition — both legitimate — but not the arithmetic. That separates signal from methodology debate and keeps the work moving forward rather than stalling on measurement methodology every quarter.

Finally, rigour is what makes the method portable across very different brands and sectors. Because the measurement is defined independently of what you sell or who your competitors are, a result from a software brand is methodologically comparable to a result from a travel brand.

The numbers will differ, the context will differ, but the ruler is the same. That portability is what lets SkuLift build benchmark data across sectors and offer a client a sense of what a score at their current level implies for a brand with their competitive dynamics.

Portability across sectors

In the end, the methodology is the product as much as any dashboard. A brand does not buy a number; it buys a disciplined, reproducible way of turning AI visibility into a managed asset, with each phase open to inspection and each result open to challenge. That is the standard a serious channel has to meet, and it is the standard this page is written to demonstrate as much as to describe.

Why length and lexical field matter

Why length and lexical coverage change citations.

Thin pages are hard to cite. A source that covers a topic's full lexical field — the terms, questions and entities a model associates with it — is far more likely to be the one an engine quotes.

Models retrieve and cite based on semantic match, so a page that mentions a concept once and moves on is a weak candidate next to one that genuinely covers the field around it. Depth here does not mean padding; it means addressing the real sub-questions a buyer asks, naming the related entities and terms, and answering each cleanly.

A page that does this gives a model many places to match a query and a clean passage to lift for each, which is why competitor-grade pillar pages run to thousands of words rather than a few hundred.

This is the principle SkuLift applies to its own site, including this page. We practise the answer-first structure, the lexical coverage and the depth we advise, partly because it is the right thing to do and partly because it is the strongest demonstration that the method works.

Dog-fooding the method

A consultancy that recommends deep answer-first content and publishes thin brochure pages has a credibility problem. One that dog-foods the methodology on its own properties does not, and the audit trail of citations earned by this page is the proof.

There is a temptation to read "longer pages win" as license to pad, and the methodology is explicit that this is wrong. Every additional word should earn its place by answering a question the reader might have or by naming an entity or term the model might match. Padding — repetition, throat-clearing, synonyms of synonyms — does not add semantic density; it dilutes it.

The discipline is to ask, for each section, what question it answers and whether that question is one a buyer would actually ask. If the answer is no, the section is a liability. If yes, it should be an answer, not an essay.

Depth, not padding

This is also where the methodology connects back to the rest of the platform. The lexical field for a brand is not fixed; it is built from the brand kit, the competitor set and the query taxonomy, which tells you what terms matter and which questions the engine is using to evaluate your category.

Writing to the lexical field means writing to those terms and questions, not to a keyword list assembled without reference to how the engines actually think about your space. That alignment between your content and the engine's frame is what turns a well-written page into a citable one.

Depth, finally, is a durability play as much as a discovery one. A thin page that wins a citation today is fragile: one better-structured competitor page can displace it in the next model update.

Durability

A page that genuinely covers the question, the sub-questions, the related terms and the entity network is harder to displace because the engine would have to find a source that covers as much ground as clearly. Building that depth is the work; once it is done, it tends to compound rather than erode.

Proof in practice

From baseline to measured lift.

Applied end to end, the loop moves real numbers: a brand that starts near-invisible in AI answers can reach a meaningful, position-weighted share of voice within a handful of cycles, with every gain attributed to a specific change and proven by re-measurement rather than asserted. The case studies show the method working on real catalogues across the four engines.

See the case studies

Methodology questions

Questions about how we measure.

What is share of voice in AI answers?

Share of voice is how often your brand is named relative to competitors across a defined set of strategic questions, computed per engine. SkuLift uses a position-weighted version so a leading recommendation counts for more than a trailing mention, reflecting how a citation actually influences a buyer.

How does the position-weighted citation formula work?

It scores each citation by its prominence and position in the answer, weighting earlier and more central mentions more heavily, then aggregates across N=5 samples per query. A Citation Authority Score layers on how authoritative the cited sources are, and only valid responses are included.

Should I focus on AEO or GEO?

Both. AEO is the on-page craft that makes your content quotable; GEO is the off-page authority work that makes the quote credible. Engines weigh both, so optimizing one while neglecting the other leaves the easier half of the win unclaimed. The methodology applies them together.

How often do you re-measure?

SkuLift re-probes ChatGPT, Claude, Gemini and Perplexity on a fixed cadence of roughly every six hours, so regressions are caught within a cycle and a lift can be attributed to the change that caused it rather than to chance or seasonality.