Security

Calibration as infrastructure: building a detonation lab

Six stages, four tensions, and the engineering decisions that determine whether a detonation lab earns its keep. How specimens enter, deploy, detonate, are observed, are analysed, and are archived, with notes on the tradeoffs that recur at every stage and the boring parts that compound over years. Architecture and decisions, not code. Engineering register, not academic.

Arthur Dutra·May 24, 2026·31 min readShare ↗RSS

A what lab?

A detonation lab is the piece of security infrastructure that most security work secretly depends on, and the piece that almost no one writes about. The silence is not strategic. The lab is unglamorous to describe, the engineering is closer to data engineering than to anything that reads as security, and the value of a good lab compounds over years in a way that doesn't lend itself to discrete stories. None of this is a secret. It's just hard to make interesting.

What follows is the architecture of a working detonation lab at the level of decisions, organised around the workflow that every sample passes through: ingestion, deployment, detonation, capture, analysis, archival. The treatment is engineering throughout, with the same abstraction commitment the rest of this category has maintained. No specific tooling is named where naming it would constitute a recommendation, and no operational thresholds are provided. The decisions, however, are concrete, because at this level of the work concrete is the only useful register.

The post is structured around the workflow lifecycle. Section one is the design constraint set: what makes a lab useful, and where its tensions live. Sections two through six follow a specimen through its passage. It enters at section two, is deployed and detonated through section three, is observed in section four, is analysed in section five, and is archived in section six. Section seven closes.

Section 1. Why labs exist, and what they're for

A lab earns its keep by answering questions that nothing else can answer cheaply. There are four of those questions, and each lab is shaped by which ones its operators care most about answering.

The first is the empirical question: what does this specimen actually do? You have a sample. You can read it statically, which gives you intent. You can run it in your head, which gives you what you expect. You can run it on a real host, which gives you ground truth, at the cost of contaminating the host. The lab exists to provide ground truth without the contamination cost, on a budget of time and resources small enough that you can do it routinely rather than rarely.

The second is the comparative question: what does this look like through a defender's instrumentation? You can detonate a specimen on a clean host, but you can't see what the host's defender sees. The lab provides a host with whatever instrumentation you want to evaluate against, and lets you observe both the specimen's behaviour and the instrumentation's response to it in the same execution. This is the question that makes the lab valuable for offensive work: which products catch what techniques, and at what point in the execution they catch them.

The third is the diachronic question: is this thing the same as that other thing? A sample arrives. The lab has, over its lifetime, accumulated tens of thousands of others. Some are obvious variants of known families; some are evolutionary descendants of previous samples; some look new but resemble older work in ways that aren't apparent on first reading. The lab provides the index against which any new sample can be compared, and the comparison improves in quality every year the lab keeps running.

The fourth is the synthetic question: what happens when I change one thing? You isolate a variable, a Windows version, a security product version, a network configuration, a user behaviour, and you run the specimen against both states. The lab provides the controlled environment in which the change is the only difference, and the experiment yields information that no observational approach can produce.

These four questions are not independent. A well-designed lab can answer all of them, although the design choices that optimise one frequently degrade another. There are four tensions that surface repeatedly.

The first tension is fidelity against throughput. The more realistic the detonation environment, the longer it takes to provision and tear down. A fully simulated user environment with browsing history, installed applications, and recent document activity takes minutes to assemble; a bare Windows install takes seconds. Real malware probes the difference. The lab that prioritises throughput sees less. The lab that prioritises fidelity sees more, but processes fewer samples per day.

The second tension is isolation against realism. A lab with no network egress can detonate anything safely, but most modern specimens won't execute meaningfully without phoning home. A lab with controlled egress (sinkholing, traffic interception, simulated C2) gets richer observation, at the cost of building and maintaining the infrastructure that pretends to be the rest of the internet. A lab with live egress observes the most, at the cost of every sample being a tiny act of complicity with whoever's on the other end.

The third tension is depth against breadth. A lab can run a sample for thirty seconds and capture surface behaviour, or for thirty minutes and capture delayed payloads, persistence mechanisms, and second-stage downloads. You can't do both for every sample, because the budget doesn't allow it. Most labs end up with two tiers: a fast pass for triage that classifies samples by what they look like, and a slow pass for the ones the fast pass flagged as worth the investment.

The fourth tension is what to keep. Every detonation produces gigabytes of data: memory snapshots, disk diffs, network captures, instrumentation logs. Keeping all of it forever is expensive. Keeping summaries is cheaper, but you find out later which detail you needed and can't get it back. The lab's archival policy is the question of how much of the lab's history you can afford to keep retrievable, and the answer is always less than you wish.

These four tensions structure everything that follows. The sections that come after this one are, in effect, the local resolutions of these tensions at each stage of the workflow.

Section 2. Ingestion

Every sample enters the lab the same way, by passing through an intake queue, but the metadata it brings with it is different every time. This metadata is most of what makes the difference between a lab that produces useful output and one that produces noise.

The minimum metadata is the sample itself plus an integrity hash. The hash matters because most labs detonate the same sample many times over their lifetimes, and you want to know that you're looking at the same one. SHA-256 is the default. Some labs add a fuzzy hash, ssdeep or TLSH, so that variants of a known sample are matched against the original even when their bytes differ. The cost of fuzzy hashing is small enough that there's no reason not to do it on intake.

After the hash, the most consequential field is provenance. Where did this come from? A sample submitted from an incident has a context (the affected host, the user, the timeframe) that constrains how to interpret what the detonation shows. A sample pulled from a public feed has no context; the detonation has to construct its own. A sample submitted from a red team operation has a target profile that informs what conditions the detonation should reproduce. The lab that loses provenance metadata at intake loses the ability to interpret its own observations downstream. Most of the cases where a lab's archive becomes unhelpful trace back to this loss.

Beyond provenance, the lab needs to know what kind of detonation environment the sample needs. Most samples can be classified into one of a small number of execution profiles by static inspection alone. A PE file targeting x86_64 Windows needs a Windows host. A document with macros needs an Office install. A shell script needs the right interpreter, in the right version, on the right base image. The classification can be done by hand for a small lab and has to be automated for any lab that processes more than ten samples a day.

The automation is interesting because it has to be wrong sometimes. Samples lie about their format. A file with a .doc extension is sometimes an HTA. A file with a PE header sometimes contains a payload for a different architecture that the header was crafted to mislead about. The intake classifier has to be confident enough to route most samples correctly, and humble enough to flag the ambiguous ones for review rather than guess.

The intake API itself is engineering work that's easy to underestimate. You need at least three submission channels: a programmatic interface for tools that produce samples (red team frameworks, retrieval pipelines, monitoring systems), a file upload interface for humans, and a watched directory or bucket for batch ingestion. All three converge on the same queue, which stores the sample, its metadata, the submitter's identifier, and the timestamp of intake. The queue is durable, because samples lost at intake are samples you didn't analyse, and lost analyses are the failures the lab can't recover from.

A note on hygiene that I have seen labs get wrong more often than I would expect. The intake system is itself a target. A sample that crashes the classifier, or that exploits the file upload handler, or that triggers a vulnerability in the hash computation, defeats the lab before any detonation has occurred. The intake API runs in its own isolated environment, with no access to the lab's network beyond submission to the next stage. The temptation is to skip this isolation because intake feels like infrastructure rather than detonation; the discipline is to treat every sample as adversarial from the moment it touches the lab.

The last decision at intake is deduplication. If a sample's hash matches one the lab has already processed, do you run it again? The naive answer is no. The correct answer is sometimes. A sample that detonates differently on different Windows versions needs to be run against each version of interest. A sample submitted from a new incident may need to be re-run against the current versions of the security products the lab evaluates against, because those products have updated since the last detonation. The deduplication policy is therefore not "have we seen this hash before" but "have we run this specific sample, against this specific environment configuration, in the relevant timeframe." The bookkeeping is tedious. The labs that get it right are the ones that recognise the question is about the experiment, not the sample.

Section 3. Deployment and detonation

This is where the lab's character is determined. Two labs with identical intake and identical analysis can produce wildly different output because they detonate samples in wildly different environments. The detonation environment is the lab's main engineering investment, and the choices it embodies show up in every observation the lab ever produces.

The first decision is the execution substrate. Virtual machines are the default for most malware work, for several reasons that compound. They provide snapshot and restore primitives that let you reset to a known clean state in seconds, which is the operation a detonation lab performs more than any other. They isolate the malware from the host with hardware-assisted boundaries that are significantly harder to escape than container or process isolation. They support arbitrary guest operating systems, which matters because the lab needs to detonate samples that target Windows versions you don't have hardware for. The cost is that VMs have more detectable artefacts than containers do, and modern malware checks for them.

The hypervisor choice is the next decision and it's load-bearing. The major hypervisors each leave a different fingerprint, and the malware authors know what those fingerprints look like. A specimen that detects KVM may behave differently than the same specimen on Hyper-V, because the author had a particular hypervisor's tells in mind when writing the anti-analysis logic. The lab's options are to detect what the specimen is checking for and lie about it (cat-and-mouse, expensive), to use a hypervisor with fewer well-known artefacts (rarer, more expensive operationally), or to accept that some samples will behave differently in the lab than in the wild and characterise the difference rather than try to eliminate it. Most mature labs do the third. The first two are productive for specific high-value samples, but not for routine work.

The guest configuration is where fidelity is won or lost. A bare Windows install with no user activity, no installed applications, and no browsing history is detectably synthetic. Specimens use this. They check for recently opened documents, for cookies in the browser, for installed productivity software, for shortcuts on the desktop, for the contents of the Recent folder. They check for the number of CPUs, the amount of RAM, the size of the disk, the model of the GPU. They check for the presence of analysis tools, including ones that aren't running but might have been recently. The lab's response is a base image that ages: real browsing history, real document activity, real installed software, real entropy in the file system. Building this image is expensive. Maintaining it is more expensive. Most labs underinvest here, and the cost shows up as samples that don't detonate fully in the lab and that the lab consequently underestimates.

I have a recommendation that doesn't quite rise to the level of a rule. Spend more time on the base image than feels reasonable. The base image is the single piece of infrastructure that affects every observation the lab ever produces, and the marginal hour spent making it more realistic returns more than the marginal hour spent on almost anything else. The lab that gets the base image right and everything else mediocre outperforms the lab that gets everything else right and the base image wrong.

After the substrate and the image comes the network. This is the sharpest tradeoff in the lab. Three configurations are common.

Full isolation, with no egress and no DNS resolution, is the safest. It is also the least useful, because most modern specimens fail to execute interesting behaviour without successful network communication. The lab that uses full isolation produces a lot of "specimen failed to do anything" reports that aren't true and aren't useful.

Simulated network, with controlled DNS responses and intercepted traffic, is the middle ground and the one most labs settle on. The DNS layer returns predictable but plausible responses (NXDOMAIN if the specimen queries a domain not in the simulated configuration, a controlled A record if it queries one that is). Outbound TCP and UDP connect to lab-controlled hosts that present whatever the specimen needs to see to continue execution: a fake C2 server, a fake update server, a fake captive portal. The cost is that building and maintaining the simulation is real engineering work, and the simulation is only as good as the most recent specimen behaviour you have calibrated it against.

Live network, with real internet egress, is the most informative and the most fraught. The specimen connects to its actual infrastructure, which yields the most realistic observation. The cost is that the specimen also does things you may not want it to do: it may exfiltrate data, contribute to a DDoS, or signal its operators that an analysis is happening. Mature labs use live egress sparingly, through a heavily monitored connection, only for samples where the simulation has already extracted what it can. The legal posture matters here, and varies by jurisdiction.

Detonation itself is the final stage of this section, and it's the least architecturally interesting because most of the engineering is upstream. What matters is how the specimen is triggered, what artefacts the trigger produces, and how long the run is allowed to continue.

Triggering is sample-specific. A PE executable can be launched directly. A document with macros needs to be opened in the right application with the macro security configured to allow execution. A shellcode blob needs a loader. A network exploit needs a target service running. The lab's orchestration layer maintains a library of trigger profiles, indexed by the intake classifier's output. The trigger itself is run by a process that the analysis pipeline knows to filter out of the observed activity, because the trigger is the lab's artefact rather than the specimen's.

The run duration is one of those decisions that feels small until you've made it wrong. Too short and you miss delayed behaviour, sleep-based evasion, and second-stage activity. Too long and your throughput collapses and you accumulate a lot of dead air that the analysis pipeline has to ignore. The pragmatic answer is two phases: a default duration for the fast pass (a few minutes), and a longer duration for samples that the fast pass flagged as worth following further (tens of minutes, sometimes hours). The boundary between the two has to be set deliberately. Most labs set it by accident and live with the consequences.

The teardown is the final detonation operation. Snapshot revert is the standard, because it's fast and complete, but it has a subtle failure mode: any state outside the snapshot, including network state and the captured telemetry, has to be moved out of the detonation environment before the revert. If the telemetry pipeline is asynchronous, the revert can race the pipeline and lose the tail of the observation. The fix is engineering discipline at the boundary between the detonation environment and the capture pipeline, which is where every lab eventually loses data at least once.

Section 4. Capture

Telemetry is the lab's main product, even though it doesn't feel like that from inside the lab. Everything else (the orchestration, the deployment, the network simulation) exists to produce telemetry. The capture pipeline determines what the lab knows about what it observed, and the limits of the capture pipeline are the limits of what the lab can ever tell you.

There are five categories of telemetry to capture from a detonation, and they come from different sources at different costs. The first is process telemetry: process creation and exit events, parent-child relationships, command-line arguments, image paths, signature status, integrity levels. Most of this is available from in-guest agents that subscribe to the operating system's own event channels (ETW on Windows, eBPF on Linux). The cost is that the agent is a process in the same environment as the specimen, which means the specimen can see it.

The second is file system telemetry: created, modified, deleted, and accessed files, with their paths, sizes, and (where the cost is acceptable) their content hashes before and after each change. The default mechanism on Windows is file system minifilter drivers. The default on Linux is fanotify or inotify. The granularity matters: a lab that captures only file creates and deletes misses the in-place modifications that ransomware and tampering produce.

The third is registry telemetry on Windows, which is structurally similar to file system telemetry but exposes a different category of state. The keys created, modified, and accessed during the detonation are most of the persistence story on Windows, and a lab that doesn't capture them at high fidelity is one that can't tell you how a specimen would survive a reboot.

The fourth is network telemetry. This one is the easiest to capture (at the host boundary, where a network tap can record every byte) and the hardest to interpret (because most of the bytes are encrypted, and most of the interpretation requires correlating the network capture with what was happening in the guest at the same time). The capture itself is straightforward: PCAP at the host boundary, with the simulated network's traffic also captured at the simulator's side for protocol-level context. The interpretation is where the engineering investment goes.

The fifth is memory telemetry: executable memory allocations, cross-process memory access, reflectively loaded modules, shellcode injections. This is the category that requires the deepest engineering investment because the operating system's own event channels expose only part of it. The rest requires either hypervisor introspection (looking at the guest's memory from outside the guest, which is invisible to anything running inside) or in-guest instrumentation that the specimen can detect. Hypervisor introspection is more powerful and more expensive; in-guest instrumentation is cheaper and easier to defeat.

A choice the lab has to make explicitly is where to put its probes. Three positions are common, and most labs use some combination of all three.

In-guest agents are the cheapest and the highest-fidelity for behaviour the operating system natively exposes. They run as processes or drivers inside the detonation environment. They see what the operating system sees, which is most of what matters. They are also visible to the specimen, which means the specimen can detect them, can refuse to execute if it sees them, or can target them. The mitigation is to make the agent look like something innocuous (a system process, a vendor service), which is an arms race the lab is going to lose periodically. The lab that depends entirely on in-guest agents has limited fidelity against specimens that take anti-analysis seriously.

Hypervisor introspection is invisible to the guest and therefore robust against anti-analysis. It can read guest memory, observe guest API calls at the syscall boundary, and reconstruct guest activity without any in-guest instrumentation. The cost is steep: the introspection layer is its own engineering project, the operating system semantics have to be reconstructed from raw memory rather than provided by the OS, and the introspection slows the guest down. Mature labs build introspection for the high-value cases and use in-guest agents for routine work.

Host-boundary capture sits between the lab's network and the detonation environment's network. It sees every byte of network traffic, all of which is invisible to the specimen because the specimen can't see the boundary. This is the most reliable capture position and the one with the least adversarial pressure. It is also blind to anything that happens inside the guest, which is most of what's interesting.

The combination most labs settle on is: host-boundary capture for everything network-related (highest reliability, no detection risk), in-guest agents for process and file system telemetry (cheap, mostly reliable), and hypervisor introspection for memory telemetry on the small set of samples where it matters. The selection function is a routine engineering decision that the orchestration layer makes per-case.

A consideration that doesn't get enough attention is time synchronisation across telemetry sources. The capture pipeline aggregates data from in-guest agents, network captures, hypervisor introspection, and the orchestration layer itself, each with its own clock. If the clocks are not synchronised, a chronological view of the detonation is impossible, and a chronological view is what every analyst needs to make sense of the data. The fix is mechanical (NTP, a host-controlled time source, careful clock-drift correction) but it has to be implemented, and labs that didn't think about it at the beginning spend months retrofitting it later.

The capture pipeline's output is, in the labs I have seen work well, a single structured stream per detonation, indexed by time and source. Every event the lab observed is in the stream. The stream is the input to the analysis layer, and its quality is the upper bound on the analysis layer's quality. Investments in the capture pipeline compound; investments in the analysis layer don't compound past the capture pipeline's ceiling. This is the reason most experienced lab builders advise pouring engineering effort into capture rather than analysis when forced to choose.

Section 5. Analysis

The analysis layer is what the rest of the lab exists to feed. It takes the structured stream that capture produces and turns it into something a human can read, share, and act on. The engineering questions in analysis are different from those in the upstream layers: less about fidelity and more about reduction, less about isolation and more about reproducibility.

The first reduction is the per-event filtering. A detonation produces tens of thousands of events, most of which are background noise the operating system produces routinely. The analysis layer's first job is to subtract the baseline. The baseline is captured by detonating nothing, or by detonating a known-benign specimen, in the same environment, and recording the events that occur. Any event that appears in both the baseline and the specimen's detonation is, with caveats, not the specimen's doing. The caveats include events that are background-like but whose timing or sequencing relative to specimen events matters; events whose content is baseline-like but whose presence is suspicious; and events that the specimen induced in baseline processes by interaction. Subtracting too aggressively loses signal. Subtracting too gently buries the specimen's behaviour in noise. The right tuning is empirical, and the labs that get it right have spent time on it.

The second reduction is the structural one. After filtering, the remaining events still form a flat stream. The analyst needs structure: which process did what, which file came from where, which network connection was made by which thread. The analysis layer reconstructs this structure by walking the event stream and building a graph: processes as nodes, parent-child and access relationships as edges, files and registry keys and network endpoints as leaf nodes, time as the dimension along which the graph evolves. The graph is what the analyst actually looks at, and the quality of the graph is most of what determines whether the analyst can understand what happened.

A complication that catches new labs by surprise: the graph is not always tree-shaped. Cross-process activity (injection, shared memory, IPC) produces edges that don't fit the parent-child model. Persistence mechanisms produce edges across the reboot boundary, where the surviving artefact is a file or registry key rather than a process. Network behaviour produces edges to nodes that aren't on the host at all. The graph data structure has to support this from the start, or it has to be rebuilt later, and rebuilding it later is more expensive than building it right the first time.

The third reduction is the categorisation. Once the graph exists, the lab can map specimen behaviour against a vocabulary of known techniques. MITRE ATT&CK is the standard vocabulary, and although it has weaknesses (it's coarse-grained, it conflates technique with implementation, it's a moving target), it's the vocabulary the rest of the security community uses. The mapping is part automated (some techniques have telemetry signatures the lab can recognise directly) and part manual (some require analyst judgement). The output is a list of technique identifiers, with the specific telemetry that supports each identification. This is the form in which lab findings are usually shared outside the team.

A separate question is what the lab does with the products that were running in the detonation environment. If the lab is evaluating an EDR by detonating samples against it, the analysis layer has to compare the EDR's view of the detonation against the lab's own view. The lab knows what happened because it captured everything at the host boundary, the hypervisor, and in-guest agents. The EDR knows what it caught, which is a subset. The interesting output is the diff: what the lab observed that the EDR missed, and (more rarely) what the EDR claimed that the lab didn't observe. This diff is the lab's most valuable product for offensive work, because it tells you which techniques the EDR doesn't see. It is also the lab's most valuable product for defensive vendor engineering, because it tells the vendor where their telemetry is incomplete.

A pattern I keep seeing in labs that don't quite work yet: the analysis layer is overbuilt and the comparison-to-historical-samples is underbuilt. New analysts want to write sophisticated analysis logic. The lab's actual leverage comes from being able to look at a new specimen and immediately surface the closest matches in the archive, with their analysis already done. The new analysis becomes an extension of the existing analysis for similar samples, rather than a fresh effort. Building this lookup is unglamorous, depends entirely on the quality of the archive (section six), and is more valuable than any individual piece of analysis logic. Most labs underweight it.

Reporting is the last analytical step. The output of a detonation is, depending on the audience, a structured artefact (JSON, STIX, MISP-format) for downstream tooling, or a narrative artefact (a written report) for human readers. The two forms aren't redundant. The structured form is for the lab's own indexing and for inter-lab exchange. The narrative form is what the customer reads. The labs that automate both well are the ones that have invested in a templating layer that converts the same underlying findings into both forms without losing fidelity. The labs that automate only the structured form produce output that no one outside the team can read. The labs that automate only the narrative form produce output that doesn't index against the archive.

Section 6. Archival

The archive is where the lab's value compounds. Each individual detonation is a measurement; the archive is the measurement series. A specimen analysed today against last year's archive yields more than the same specimen analysed against nothing, and the value of yesterday's analyses increases every time today's analysis adds context to them. The engineering question is how to keep the archive useful as it grows.

What to keep is the first decision and the one most labs get wrong by overcorrection. Keep everything and the storage cost dominates the lab's budget within a year, and the retrieval costs dominate the analysis time within two. Keep only summaries and you find out, three years later, that the one piece of context you needed to understand a new specimen is in the raw memory dump that your retention policy deleted. The right answer is tiered.

The hot tier holds the raw artefacts of recent detonations: PCAPs, memory snapshots, disk diffs, full event streams. Recent here is operational rather than calendar-based. The right retention is the time window across which your analysts are likely to need to go back and re-examine the raw data, which for most labs is somewhere between a month and a quarter. Storage in this tier is fast and expensive.

The warm tier holds the reduced artefacts: the structured event streams after baseline subtraction, the analysis graphs, the ATT&CK mappings, the analyst reports. These survive deletion from the hot tier by being smaller (tens of megabytes per detonation rather than tens of gigabytes) and more durable in their utility (the reduction doesn't lose what an analyst needs to know months later). Retention here is years, and the storage is correspondingly cheaper.

The cold tier holds the indexes: hashes, fuzzy hashes, technique mappings, key telemetry features. This survives essentially forever, because it's small and because it's the layer that makes the archive searchable. A lab's ability to answer "have we seen this before, and if so what was it" depends entirely on the cold tier being complete and current.

Building the tiering is engineering work that's straightforward to describe and easy to underbuild. The cost of underbuilding it shows up as the lab not being able to answer questions that should be cheap, because the data is either gone or unreachable.

The second archival decision is the index. The archive is only as useful as its lookup capabilities. A lab that can find samples by hash but not by behavioural signature can answer "do we have this exact specimen" but not "do we have anything like this specimen." The set of indices the lab maintains is determined by the queries the lab actually wants to support. The common ones are: hash and fuzzy hash (exact and near-exact identity), technique mapping (samples that exhibit this ATT&CK technique), feature vector (samples whose behaviour resembles this one in some embedding space), and metadata (samples from this campaign, this submitter, this time range). Building each index is real work; choosing which to maintain is a question of which queries are worth the cost.

A consideration that catches labs by surprise: the archive is a target. An attacker who can read the lab's archive learns which specimens the lab has analysed and what conclusions the lab drew. An attacker who can write to the archive can poison future analyses by inserting false records. The access control on the archive is therefore a meaningful security boundary, and it has to be designed at the beginning rather than retrofitted. Audit logging of archive reads and writes is the minimum; role-based access control with separation between submission, analysis, and administration is the next level; cryptographic signing of archived records is the level above that. Most labs underbuild this until something forces them to invest, which is usually too late.

The longitudinal value of the archive is what makes the lab worth running at all. A sample analysed against an archive that contains five years of related samples gets the benefit of every previous analyst's work on every previous related specimen. The lab's analytical output for any given sample is, in the limit, more about the archive than about the sample. The labs that recognise this organise their engineering investment to favour archive durability and queryability over almost anything else. The labs that don't recognise this end up running expensive infrastructure that produces analytical output indistinguishable from what a less mature lab would produce.

A practice that has worked well in labs I have seen: rerun the archive periodically. Every six or twelve months, the lab re-detonates a sample of its historical archive against current environments. Specimens that behave differently than they did in the original analysis surface either changes in the detection products evaluated against them, changes in the operating system, or changes in the lab's own infrastructure that have affected fidelity in ways the analysts hadn't noticed. The cost is non-trivial: a meaningful re-analysis run consumes a non-trivial fraction of the lab's throughput for the duration. The benefit is calibration: the lab knows whether its analyses from a year ago are still accurate, which is information no other process produces.

The final archival decision is about external sharing. Some labs share their findings publicly (via threat intelligence feeds, public sample repositories, conference publications). Some share with closed communities (intel-sharing organisations, vendor partnerships, government relationships). Some share nothing. The decision is political and operational rather than technical, but it has technical consequences: the archive's structure has to support the sharing model, including the redactions and access controls that any sharing arrangement requires. Retrofitting these is expensive. Designing them in at the start is cheap. Most labs end up retrofitting, because the sharing question doesn't surface until the archive is mature, by which time the structure is hard to change.

Conclusion

The lab is the piece of security infrastructure that you build because you have to, that you live with for years, that pays off in ways no individual case can fully demonstrate, and that almost nobody writes about. The reasons for the silence are mostly mundane: the engineering is unglamorous, the value is cumulative rather than discrete, the work is closer to data pipelines than to anything that reads as security. The post above is an attempt to make that work legible, on the proposition that a deep-dive at the architectural level is more useful to the engineer building their first lab than another walkthrough of how a specific specimen behaves under detonation.

The omissions are larger here than in either of the preceding security posts. Each section above could be a post of its own, and at least three of them (the network simulation, the analysis graph construction, the archive index design) probably should be. The choice has been to favour structural coverage over depth, on the basis that the structural coverage is what doesn't currently exist in public.

If a single observation has shaped this work, it is that the lab's value comes from the boring parts. The exciting parts (the new technique, the novel evasion, the spectacular detection) are the parts that get attention, and they are the parts the lab produces incidentally. What the lab actually produces, by the time it has been running for a while, is calibration. Calibration about what the world's specimens look like under controlled conditions, what they do that no one notices, and what changes year over year. Calibration is unglamorous and indispensable. It is the thing I would tell my younger self to build first, and to overinvest in for longer than seems reasonable.