OKR Program Design: Why One Scoring Rubric Breaks Most Rollouts

TL;DR: OKR program design fails when you score every team against the same rubric. Different teams do different shapes of work (delivery, research, operational, revenue, customer) and each shape needs its own KR logic and scoring standard. This post is…

TL;DR: OKR program design fails when you score every team against the same rubric. Different teams do different shapes of work (delivery, research, operational, revenue, customer) and each shape needs its own KR logic and scoring standard. This post is the program-level playbook for running a single OKR program across teams whose work has fundamentally different shapes, without the program slowly losing whichever teams the default rubric doesn’t fit.

Your rollout went well for the first two quarters. By the end of Q3 it didn’t. The R&D team has quietly checked out of weekly check-ins. The ops team’s maintenance KRs read like a comedy of gaming. Sales is the only function whose scoring still looks normal, which means the program is now telling you sales is your only real team. You know that isn’t true. But the framework, as you rolled it out, can’t see anything else. The problem isn’t the OKR framework. It’s OKR program design that treated five different shapes of work as one.

This isn’t the foundational guide on implementing OKRs in large organizations. That post covers the org-scale rollout sequence. This is the deep dive on the design decision underneath every rollout: how you write the scoring logic so it survives contact with teams doing different kinds of work.

What is OKR program design? OKR program design is the set of decisions about how a single OKR program operates across teams. It covers how KRs are structured for different work shapes, how scoring is standardized across teams whose work isn’t comparable, how rollups are aggregated for executive reporting, and what governance keeps the program coherent. Most rollouts handle goal-setting carefully and ignore program design, which is why they fray in the second half of the year.

Why Does OKR Program Design Break at Scale?

The default OKR rubric in most books, frameworks, and SaaS templates was built around a specific assumption: the team knows what they’re trying to ship and can commit to an outcome target. That assumption holds for delivery teams. It breaks for almost everyone else.

When the same rubric gets applied to a research function, the team picks safe targets or retrofits their scoring at quarter end. When it gets applied to an operational function, the team scores zero for the work of preventing failures (because nothing happened, so nothing “moved”). When it gets applied to a sales function, it interacts badly with quota and incentive design, and teams game close dates to make the KR look right. When it gets applied to customer success, it incentivizes ticket throughput over actual retention.

You watch this happen quarter by quarter. The R&D team is usually the first to disengage, because the gap between their work and the rubric is the widest. The ops team disengages next, more quietly. By the time you’re three quarters in, your “program” is functionally a sales scoreboard with everyone else doing performative compliance around the edges. The C-suite starts asking whether OKRs are working.

The program isn’t broken. The OKR program design is. And design is fixable. Research from the Brightline Initiative on strategy execution traces a consistent pattern: programs fail more often on design decisions than on framework choice.

The Five Work-Type Shapes in OKR Program Design

Every team in your org does work that falls into one of five shapes. Each shape has a native KR structure that fits, and a scoring logic that holds up. Forcing all five through the same rubric is what creates the fraying you’ve been watching.

1. Delivery work

A team ships something with a known outcome target. Engineering teams shipping features. Platform teams running migrations. Compliance teams closing an audit. The outcome can be specified in advance because the team knows what done looks like.

KR shape: outcome target with a number. “Lift on-time check-in rate from 72% to 85%.” “Reduce p95 latency from 800ms to 400ms.”

Scoring logic: did the outcome happen? Default OKR rubric works here without modification. This is the work the standard playbook was built around.

2. Research and exploratory work

A team is figuring out whether something is possible. R&D, innovation labs, new-market exploration, experimental product bets. The outcome can’t be specified in advance because the work IS finding out.

KR shape: hypothesis-anchored, with decision-quality output as the deliverable. “Test whether adaptive deception can reduce mean detection time by 50% in a controlled environment, and produce a kill / proceed recommendation with supporting evidence by Q3 end.”

Scoring logic: did the team run a defensible inquiry and produce a shareable answer? A clear “no, here’s why” scores 1.0 because the org now knows where not to invest. (For the practitioner-level guide to writing these KRs, see OKRs for R&D teams.)

3. Operational and sustaining work

A team is keeping something running. IT operations, finance, security, infrastructure maintenance, compliance monitoring. Success often looks like nothing happening. The work is invisible when it’s done well.

KR shape: threshold-based commitments paired (where useful) with a continuous-improvement KR. “Maintain 99.95% platform uptime through Q3.” “Hold mean time to remediation under 4 hours for P1 incidents.” “Reduce average ticket-to-resolution time from 14 days to 9 days.”

Scoring logic: did we hold the line on the thresholds? Did we improve where we committed to improving? Penalize misses, credit successful holds. The trap to avoid is treating “no incident” as a 0 because nothing “moved.”

4. Revenue and sales work

A team is producing a number through repeatable activity tied to pipeline. Sales, BD, marketing pipeline, partnerships. The outcome is known (revenue, accounts, ARR), but it’s heavily entangled with quota and incentive design that often pre-dates the OKR program.

KR shape: target number paired with quality and forward-pipeline gates. “Close $4.2M new ARR in Q3 with average ACV above $30K.” “Build Q4 pipeline coverage of 3.0x against plan by end of Q3.” “Move logo retention from 87% to 92%.”

Scoring logic: did we hit the number? And was the pipeline that produced it real? A 1.0 close that leaves Q4 pipeline empty is a program-design failure, not a win.

5. Customer and support work

A team is responding to inbound demand and managing service quality. Customer success, technical support, account management, post-sale onboarding. Outcomes are mostly continuous (NPS, retention, expansion) rather than event-shaped.

KR shape: experience and retention metrics rather than activity counts. “Lift NPS from 32 to 42.” “Grow net revenue retention from 104% to 112%.” “Move time-to-first-value from 18 days to 10 days for new accounts.”

Scoring logic: did the experience metric move and did the retention number hold? Activity volume (tickets handled, calls made) is almost always the wrong KR shape here. The trap is teams reaching for tracking metrics because they’re easy to count.

OKR Program Design at a Glance

Work Shape	Native KR Type	Native Scoring Logic	Most Common Failure Under One Rubric
Delivery	Outcome target with a number	Did the outcome happen?	Default rubric works fine. No friction.
Research / Exploratory	Hypothesis + decision-quality output	Did we produce a defensible, shareable answer?	Safe-bet bias or post-hoc storytelling.
Operational / Sustaining	Threshold commitments + improvement KRs	Did we hold the line and improve where committed?	Team scores 0 for the work of preventing failures.
Revenue / Sales	Target number + pipeline-quality gates	Did we hit it, and is the pipeline real?	Quota gaming, stuffed pipeline, end-of-quarter contortion.
Customer / Support	Experience + retention metrics	Did the experience move and did we retain?	Activity counts crowd out actual outcomes.

How to Score Across Work Types Without Losing Program Coherence

The hardest part of OKR program design isn’t picking the right KR shape per team. It’s keeping the program coherent at the executive level when scoring works differently in different parts of the org. This is where most OKR program design efforts under-invest, and where rollouts fray.

Three program-level decisions hold this together.

A common scale, not a common rubric. Every team scores on the same 0-to-1 scale. What changes is the rubric used to assign the score, not the scale itself. A 0.7 in research means “we ran a defensible inquiry and produced a clear answer that wasn’t quite decision-grade.” A 0.7 in sales means “we hit 70% of quota.” Both are 0.7s and both are honest, because the underlying scale is shared.
Published scoring rubrics per work shape, locked before the cycle starts. Each team type has a one-page rubric that explains how 0, 0.3, 0.7, and 1.0 are defined in that shape. Locked before kickoff, not negotiated at quarter end. This is the single biggest defense against retrofitted scoring.
A standing program-design forum. Once per cycle, the VP Strategy or Ops lead reviews edge cases with team leads. New work types come up. Existing rubrics get refined. The forum is the governance mechanism that keeps program design from drifting into folklore.

Without these three, a mixed program defaults to whichever rubric the C-suite is most familiar with, which is almost always the delivery rubric, and the cycle starts again.

How to Defend Mixed Scoring to the Executive Team

This is the credibility moment most VP Strategy roles dread. The CEO sees R&D scoring 1.0 on what looks like a failed experiment, while sales is at 0.6 on a quarter that closed seven figures of new business. Without a clean explanation, the program loses its standing.

The clean explanation is short. The 0-to-1 scale is shared. The rubric per work type is published. The reason R&D’s 1.0 means “we delivered a high-quality answer to a question worth asking” is that punishing them for an inconclusive answer is what created the safe-bet bias your last rollout died of. The reason sales’s 0.6 means “we hit 60% of a stretch quota” is that scoring sales like research would let everyone hit 1.0 by setting trivial targets.

Hand the executive team the one-page rubrics. Most of the credibility issue evaporates once the rubrics are visible. Good OKR program design makes its own logic legible. The pushback you actually need to be ready for is on aggregation, not on the per-team scoring itself.

For company-wide OKR health, don’t average the 0-to-1 scores across teams. Averaging a research 1.0 with a sales 0.6 and reporting a “company score of 0.8” is meaningless and the executive team will notice. Report instead by work-type rollup, with an editorial summary of what each rollup means. Three sentences per shape. The summary is the signal, the scores are the supporting data.

How to Sequence OKR Program Design Decisions in a Rollout

If you’re rolling out from scratch or rebuilding after a fraying program, the sequence matters more than people assume. The standard mistake is launching across all teams in the same quarter with one rubric, then trying to fix things in the second cycle. By then trust is already lost in the teams the default rubric didn’t fit.

A sequence that holds up:

Map the work shapes first. Before any team writes an OKR, identify which of the five shapes each team’s primary work falls into. This is a 90-minute exercise with team leads, not a research project.
Lock the rubric per shape. Write the one-page scoring rubric for each shape your org actually has. Most orgs have 3 or 4 of the five, not all five.
Pilot on delivery teams first. They’re the lowest-friction starting point because the default rubric fits their work. The first cycle proves the cadence and the tooling without arguing about scoring philosophy.
Bring research and operational teams in second. These are the highest-friction work types for the default rubric. Bringing them in with their own rubric pre-built protects the program from the failure modes that kill rollouts.
Layer revenue and customer teams in the third cycle. They often have existing scoring systems (quota, NPS targets) that need to be aligned with the OKR program rather than replaced. This integration takes time and deserves its own cycle.
Review the rubrics every two cycles. Work changes. New work types emerge. The rubrics shouldn’t be permanent, just stable.

Most rollouts collapse this whole sequence into one cycle and lose teams in waves. The slower sequence costs you two quarters at launch and saves you the credibility loss that follows when the program fragments. McKinsey’s research on operating-model design consistently finds that staged rollouts outperform big-bang implementations on adoption durability, and OKR programs follow the same pattern.

FAQ: OKR Program Design

How is OKR program design different from writing good OKRs?

Writing good OKRs is a team-level craft. OKR program design is an org-level set of decisions about how scoring, governance, and reporting hold together across teams whose work isn’t directly comparable. You can have teams writing excellent OKRs inside a poorly designed program, and the program-level problems will still pull the cycle apart by quarter three. Program design is upstream of OKR craft.

Should every team in our org be on OKRs?

Not necessarily, and forcing it is one of the most common program design mistakes. Teams whose work is almost entirely operational sustaining (and where the threshold-based rubric doesn’t add value) can run on operating metrics and skip the OKR cycle. Teams whose work is purely reactive in short cycles (some support functions) may also be better served by SLA targets than quarterly OKRs. The program design question isn’t “is everyone on OKRs?” but “is everyone whose work needs strategic alignment using the cycle to get it?” or “are all areas of the org that need change or improvement leveraging OKRs?”.

How do you handle scoring when a team’s KR depends on another team delivering first?

This is a dependency problem, and the cleanest fix is at program design time. Cross-team dependencies should be surfaced at OKR kickoff (not at quarter end) and either rewritten as a shared KR with co-owners, or scoped explicitly with the upstream team’s commitment as a documented input. Scoring should reflect the team’s actual control over the outcome. Penalizing a team for an upstream miss they couldn’t influence is the fastest way to lose trust in the program.

How often should we review and update the per-shape scoring rubrics?

Every two cycles is a workable default. Reviewing every cycle creates instability and teams stop trusting that the rubric they’re scored against won’t shift retroactively. Leaving rubrics untouched for a year lets work shapes evolve past the rubric and the program starts to fray again. Two cycles is the rhythm that lets the program adapt without becoming arbitrary.

OKR Alignment Audit: Free Download

If your program is mid-rollout and you’re trying to figure out which teams are drifting and why, the OKR Alignment Audit walks each team’s OKRs through six alignment anchors and an execution-readiness check, then categorizes each one (Solid OKR, Strategy Play, Support Play, Defensive Play, Orphan OKR, At Risk). It’s the diagnostic layer that pairs with the program-design work in this post.

Request Access →

The teams in your org don’t do the same kind of work, and the OKR program design you built to govern them shouldn’t pretend they do. Map the shapes. Lock the rubrics. Score on a common scale, not a common rubric. The program holds. The teams stay engaged. The framework gets to do the job you brought it in to do.

Discover OKR Management  Tips and Updates

Get The Tuesday Brief.

A weekly note for OKR leaders. One specific move you can make this week.

We’ll never spam you or share your information

OKR Program Design: Why One Scoring Rubric Breaks Most Rollouts

Why Does OKR Program Design Break at Scale?