Skip to content
51studio
Tech News

Error tracking that doesn't alert-storm

By Sam Hollis8 min read

Most teams set up Sentry on the first day of a project. By month three, the Slack channel is hostile. Alerts every five minutes. Stack traces nobody reads. The "production-errors" channel has 4,000 unread messages.

By month six, the team mutes the channel. By month twelve, nobody knows when something is genuinely broken because everything looks broken.

This is alert fatigue, and it's not Sentry's fault. The defaults catch everything; you have to teach them what matters.

This post is the configuration that turns Sentry (or Bugsnag, or Highlight, or any equivalent) from noise generator into signal source. The five categories of errors, what to do with each, and the operational discipline behind it.

Why error tracking goes wrong

Three patterns we see repeatedly:

Every exception is a P1. The team turns on Sentry, every uncaught exception fires an alert, the alerts get muted, and now nothing fires alerts.

Front-end errors and back-end errors share a channel. A 0.1% JS error rate on a high-traffic site is 1,000 alerts/hour. The signal from the back end (where errors usually matter more) gets buried.

No grouping. Sentry groups by stack trace by default but teams disable it because "I want to see each one." Result: 500 alerts for the same broken database query.

The fix isn't a different tool. It's configuration.

The five categories

Every error your app produces falls into one of these. Treating them all the same is the source of the problem.

1. User input errors

Validation failures, malformed requests, expected 400s. These are not bugs; they're the system working as designed.

Don't alert. Don't even capture them unless you're investigating a specific issue. They're noise.

In Sentry: filter at the SDK level. Errors with status code 400-499 (most of them) should not reach Sentry. The 4xx errors that matter (401 from an authenticated endpoint, 403 from an admin endpoint, 404 from a deep link) might be worth capturing as breadcrumbs but not as events.

2. Transient errors

Network timeouts, third-party API failures, rate-limit hits. They happen, you retry, the next attempt succeeds.

Capture them at a sample rate (1-5%) for trend analysis. Don't alert on individual occurrences. Alert only when the rate exceeds a threshold (e.g., "more than 5% of Stripe calls failed in the last 5 minutes").

In Sentry: tag these errors with a transient: true attribute. Set up an issue alert that ignores them. Set up a metric alert on the rate.

3. Code bugs

Genuine bugs. Null reference, type mismatch, off-by-one, the line of code that was wrong since deploy.

These are the ones to alert on. Every new occurrence of a code bug should reach the team.

But "every new occurrence" needs grouping. Sentry's default fingerprinting (group by stack trace) is usually right. Don't disable it. A single bug producing 10,000 errors is one alert, not 10,000.

4. Environment errors

Database connection refused, disk full, container OOM-killed. The code is fine; the environment failed.

Alert immediately. These usually indicate infrastructure problems, not code problems, and they tend to cascade.

Separate channel from code bugs. The triage for "the database is down" is different from "we have a null reference."

5. Performance regressions

Slow page loads, slow API responses, long-running queries. Not errors per se, but related.

Track separately (often outside Sentry, in a tool like Datadog or Grafana). Alert on threshold breaches: "p95 API response time exceeded 1 second for 5 minutes."

The configuration that works

Specific Sentry setup we ship on most projects:

ts
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1, // 10% of traces, more is just expensive
  beforeSend(event, hint) {
    // Category 1: filter user input errors
    if (hint.originalException?.statusCode && hint.originalException.statusCode < 500) {
      return null;
    }
    // Category 2: tag transient errors (alerting rule will ignore individual occurrences)
    if (isTransient(hint.originalException)) {
      event.tags = { ...event.tags, transient: 'true' };
      // Sample transient errors at 5%
      if (Math.random() > 0.05) return null;
    }
    return event;
  },
});

function isTransient(err) {
  return err?.code === 'ECONNRESET'
      || err?.code === 'ETIMEDOUT'
      || err?.statusCode === 429
      || err?.statusCode === 502
      || err?.statusCode === 503;
}

This is the most-impactful 30 lines you can add. Filters out the 80% of "errors" that are noise, tags the 15% that are intermittent infrastructure issues, leaves the 5% that are real bugs to land in the alert channel.

Three channels, not one

Most teams have one error channel in Slack. Split it into three:

  • #bugs-prod — code bugs (Category 3). New occurrences of a genuine code error. The team reads this channel.
  • #infra-prod — environment errors (Category 4). DB down, disk full, container restarts. On-call reads this.
  • #metrics-prod — performance and transient-error rate alerts (Categories 2 and 5). Threshold breaches only. Engineering reviews this weekly.

User input errors (Category 1) don't get a channel. They're not events.

The reason for three channels: each has a different urgency and audience. Bugs are for the team to triage and assign. Infra is for whoever is on-call. Metrics is for the engineering manager to spot trends.

One mixed channel forces everyone to filter the same noise, and nobody does the filtering consistently.

Operational discipline

Three habits that determine whether the setup stays useful:

1. Triage every alert in #bugs-prod

When a new bug alert fires, someone assigns it to themselves and either fixes it or marks it as not-a-bug. Issues that sit in the channel for days are signal that the channel isn't being read.

Sentry has an "ignore" button. Use it. Marking a non-bug as ignored prevents it from firing again. The channel stays clean.

2. Resolve fixed issues

When a bug is fixed in a deploy, mark it resolved in Sentry. Set Sentry to "regression" alerting: if the same fingerprint reoccurs after being resolved, fire an alert. This catches regressions automatically.

3. Review the metrics-prod channel weekly

Trend alerts are silent unless someone reviews them. A 15-minute review every Monday catches the slow-burn issues (transient error rates creeping up, p95 latency drifting, a new third-party API getting flaky).

Without the review, slow-burn issues only become alerts when they become outages. The weekly review is cheap insurance.

What to monitor in addition

Sentry catches errors. It doesn't catch everything that matters:

Uptime monitoring. Pingdom, Better Uptime, UptimeRobot. Hits your endpoints from external locations, alerts if they fail. Catches outages where the server is down (Sentry can't report an error if Sentry can't reach the server).

Synthetic checks. Scripted user journeys ("can I log in? can I see the dashboard?") run every 5 minutes from a few regions. Checkly or Datadog Synthetics. Catches issues that errors don't (a page that loads but is broken).

Real-user monitoring (RUM). Datadog RUM, Sentry's Replay product, or just the browser performance APIs reported to your own analytics. Catches issues that affect users but don't throw errors (a CTA that doesn't fire its event handler, a form that silently fails to submit).

Each one covers a category of failure the others miss. Pick the two or three that fit your product; don't try to run all of them.

A reasonable setup

Specific stack we recommend for a typical web app:

  • Sentry for error tracking. The configuration above. Three Slack channels.
  • Better Uptime for uptime monitoring. Pings the homepage and /api/health every minute. Alerts on-call if two consecutive checks fail.
  • Sentry Replay or PostHog Recordings for the "user reported a weird issue" cases. Look at what they actually did.
  • A custom dashboard showing error rate, p95 latency, and traffic over the last 7 days. We use Grafana for this; Datadog also fine.

Total cost: typically under $200/month for an app with moderate traffic. The signal-to-noise ratio determines whether the team actually reads the alerts, which determines whether the setup is worth paying for at all.

What we stopped doing

Configurations we used to recommend and now skip:

  • Capturing every console.warn. Adds noise without value. Capture errors, not warnings.
  • Capturing every analytics-script failure. Third-party scripts fail constantly. Filter them at the SDK level.
  • Alerting on individual front-end errors. With a million page views, you have a million bots throwing weird errors. Set a rate threshold.
  • Routing alerts directly to email. Email is for things you'll read later. Alerts are for now. Use Slack/Discord/PagerDuty.

The simpler the setup, the more likely it stays useful.

A summary

  • Five categories of errors, each with different treatment.
  • Filter user input errors out entirely.
  • Sample transient errors; alert on their rate, not their individual occurrences.
  • Group code bugs by stack trace; alert on new occurrences.
  • Alert immediately on environment errors.
  • Track performance separately.

The goal isn't zero alerts. It's that every alert is genuine signal. A team that gets one alert a week and reads it carefully outperforms a team that gets 50 alerts a day and ignores all of them.

If you want a web app shipped with this monitoring setup baked in, see how we work on web apps.

Related articles