Building reliable concurrent and distributed systems is hard. In Go, you might start with goroutines and channels. As the system grows, you add mutexes to protect shared state. Then you need to coordinate across multiple services, so you introduce message queues or RPC. Before long, you're managing synchronization primitives, handling partial failures, and debugging race conditions that only appear under load.
Ergo Framework offers a different foundation. Think of it as making goroutines addressable and message-passing-only, then extending that model across a cluster. Processes are like goroutines - lightweight, multiplexed onto OS threads - but isolated and communicating only through messages. Each process has an identifier that works whether the process is local or on a remote node. Sending a message looks the same either way.
The actor model isn't new. Erlang proved these patterns work for systems requiring massive concurrency and high reliability. Ergo brings them to Go: no external dependencies, familiar Go idioms, and performance that doesn't sacrifice correctness for speed.
Core Components
The framework consists of a few fundamental pieces that work together.
A node provides the runtime environment. It manages process lifecycles, routes messages, handles network connections, and provides services like logging and scheduled tasks. When you start a node, you get infrastructure. When you spawn a process, the node handles the mechanics.
Processes are lightweight actors. Each has a mailbox where messages queue up, priority-sorted into urgent, system, main, and log queues. The process handles messages one at a time in its own goroutine. When the mailbox empties, the goroutine sleeps. This makes processes efficient - you can have thousands without resource problems. It also makes them safe - sequential message handling means no race conditions within a process.
Supervision trees provide fault tolerance. Supervisors monitor worker processes. When a worker crashes, the supervisor restarts it according to a configured strategy. Supervisors can supervise other supervisors, creating a hierarchy. Failures are isolated to subtrees. The rest of the system continues running while the failed part recovers.
Meta processes solve a specific problem: integrating blocking I/O with the actor model. HTTP servers block waiting for requests. TCP servers block accepting connections. A meta process uses two goroutines - one runs your blocking code (like http.ListenAndServe), the other handles messages from other actors. This bridges synchronous APIs with asynchronous actor communication.
Network Transparency
The framework treats local and remote processes identically. Send a message to a process on the same node or a process on a remote node - the code is the same. The framework handles the difference.
When you send to a remote process, the node extracts the target node from the process identifier, discovers that node's address (through static routes or a registrar), establishes a connection if needed, encodes the message, and sends it. The remote node receives it, decodes it, and delivers it to the target process's mailbox. This happens automatically. Your code just sends a message.
This transparency extends to failure detection. Use the Important delivery flag and you get the same error semantics for remote processes as for local ones. Without it, a message to a missing remote process times out (was it slow or dead?). With it, you get immediate error notification (process doesn't exist), just like local delivery. The network becomes transparent not just for success cases but for failures too.
Nodes discover each other through a registrar. By default, each node runs a minimal registrar. Nodes on the same host find each other through localhost. For remote nodes, the framework queries the registrar on the remote host. For production clusters, configure an external registrar like etcd or Saturn for centralized discovery, cluster configuration, and application deployment tracking.
What This Enables
You write business logic using message passing between processes. The framework handles concurrency (processes run in parallel but each is sequential internally), fault tolerance (supervisors restart failures), and distribution (messages route automatically to remote processes). You're not writing code to manage connections, encode messages, or handle network failures explicitly. Those are solved problems handled by the framework.
Systems built this way have useful properties. They scale by adding nodes and distributing processes across them. The code doesn't change - deployment topology is operational configuration. They handle failures through supervision rather than defensive programming everywhere. They evolve through composition - add new process types, adjust supervision strategies, change message flows - without restructuring the foundation.
The development experience differs from typical microservices. No REST endpoints to define. No service discovery to configure (it's built in). No serialization libraries to manage (the framework handles it). No retry logic scattered throughout (supervision handles recovery). You model your domain as processes exchanging messages, and the framework provides the infrastructure.
Performance
Lock-free queues in process mailboxes avoid contention. Processes sleep when idle, consuming no CPU. Connection pooling uses multiple TCP connections per remote node for parallel delivery. These design choices add up to performance comparable to hand-written concurrent code, but without the complexity.
The real performance benefit is development velocity. You're not debugging race conditions or deadlocks. You're not coordinating distributed transactions. You're not managing connection pools or implementing retry logic. The framework handles those concerns, leaving you to focus on what your system does.
Benchmarks measuring message passing, network communication, and serialization performance are available at .
Zero Dependencies
The framework uses only the Go standard library. No external dependencies means no version conflicts, no supply chain vulnerabilities, no surprise breaking changes from third-party packages. The requirement is just Go 1.20 or higher.
This isn't ideological purity. It's practical stability. The framework's behavior depends only on Go itself. Updates are predictable. Supply chain is simple. The code you write today will compile and run the same way years from now, assuming Go maintains backward compatibility (which it does).
For detailed explanations of these concepts, start with and explore the section. For API documentation, see the godoc comments in the source code.
The Ergo Framework allows nodes to run with various network stacks. You can replace the default network stack or add it as an additional stack. For more information, refer to the Network Stack section.
This library contains implementations of network stacks that are not part of the standard Ergo Framework library.
Loggers
An extra library of logger implementations that are not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages that have external dependencies, as Ergo Framework adheres to a "zero dependency" policy
Meta-Processes
An extra library of meta-process implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework adheres to a "zero dependency" policy.
Applications
The additional application library for Ergo Framework contains packages with a narrow specialization or external dependencies since Ergo Framework adheres to the "zero dependencies" principle.
An extra library of registrars or client implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework follows a "zero dependency" policy.
Available Registrars
A client library for the central registrar. Provides service discovery, configuration management, and real-time cluster event notifications through a centralized registrar service.
Features:
Centralized service discovery
Real-time event notifications
Configuration management
TLS security support
A client library for , a distributed key-value store. Provides decentralized service discovery, hierarchical configuration management with type conversion, and automatic lease management.
Features:
Distributed service discovery
Hierarchical configuration with type conversion from strings ("int:123", "float:3.14")
Automatic lease management and cleanup
Choose Saturn for centralized management with a dedicated registrar service, or etcd for a distributed approach with built-in consensus and reliability guarantees.
Actors
An extra library of actors implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework adheres to a "zero dependency" policy.
Distributed leader election actor implementing Raft-inspired consensus algorithm. Provides coordination primitives for building systems that require single leader selection across a cluster.
Prometheus metrics exporter actor that automatically collects and exposes Ergo node and network telemetry via HTTP endpoint. Provides observability for monitoring cluster health and resource usage.
Use cases: Production monitoring, performance analysis, capacity planning, debugging distributed systems.
CertManager
TLS Certificate Management
Network communication in production systems needs encryption. TLS provides this, but managing TLS certificates introduces operational challenges. Certificates expire. Security incidents require rotation. Updating certificates traditionally means restarting services, causing downtime.
The naive approach loads certificates at startup from files. When you need to update a certificate, you replace the file and restart the service. For a single service, this works. For distributed systems with dozens of nodes and services, coordinating restarts for certificate updates becomes an operational burden.
Ergo Framework provides gen.CertManager for live certificate updates. Load a certificate at startup, and you can update it later without restarting. All components using that certificate manager - node acceptors, web servers, TCP servers - automatically use the updated certificate for new connections.
The Application Observer provides a convenient web interface to view node status, network activity, and running processes in the node built with Ergo Framework. Additionally, it allows you to inspect the internal state of processes or meta-processes. The application is can also be used as a standalone tool Observer. For more details, see the section Inspecting With Observer. You can add the Observer application to your node during startup by including it in the node's startup options:
The function observer.CreateApp takes observer.Options as an argument, allowing you to configure the Observer application. You can set:
Port: The port number for the web server (default: 9911 if not specified).
Host: The interface name (default: localhost).
LogLevel: The logging level for the Observer application (useful for debugging). The default is gen.LogLevelInfo
Creating a Certificate Manager
Create a certificate manager with an initial certificate:
For development or testing, generate a self-signed certificate:
Note: Self-signed certificates require setting InsecureSkipVerify: true in network options to bypass certificate validation. This is acceptable for development but never use it in production.
Pass the certificate manager to node options:
The node's network stack uses this certificate manager for TLS connections. Acceptors use it for incoming connections. Outgoing connections use it for client certificates if needed.
Updating Certificates
Update the certificate while the node is running:
The update takes effect immediately for new connections. Existing connections continue using the old certificate until they close. This allows graceful rotation - new connections get the new certificate, old connections finish naturally.
Components using the certificate manager obtain certificates through GetCertificate or GetCertificateFunc. These methods return the current certificate, so updates automatically propagate to all users of the manager.
Certificate Lifecycle
The typical pattern involves periodic certificate renewal. A cron job or external process watches for approaching expiration. When renewal is needed, it obtains a new certificate (from Let's Encrypt, an internal CA, or however your infrastructure manages certificates) and calls Update on the certificate manager.
The certificate manager is passive - it doesn't handle renewal itself. It provides the mechanism for live updates. Your renewal logic is external, allowing integration with whatever certificate provisioning system you use.
This separation is intentional. Certificate renewal policies vary widely. Some organizations use Let's Encrypt with automated renewal. Others use internal CAs with manual processes. Some rotate certificates on a schedule, others only when necessary. The certificate manager doesn't impose policy - it just enables live updates however you choose to implement them.
Mutual TLS
For scenarios requiring client certificate authentication, use gen.CertAuthManager. It extends CertManager with CA pool management for verifying certificates on both sides of the connection. This enables mutual TLS (mTLS) where servers verify client certificates and clients verify server certificates.
All settings support runtime updates, just like certificate rotation.
For detailed configuration and examples, see Mutual TLS.
For complete certificate manager methods and usage, refer to the gen.CertManager interface documentation in the code.
Building reliable systems means accepting an uncomfortable truth: failures will happen. Hardware fails. Networks partition. Bugs exist in code. The question isn't whether your processes will crash, but what happens when they do.
The supervision tree model provides an answer. Instead of trying to prevent all failures, you structure your system so failures are expected, isolated, and automatically recovered from.
The Supervision Principle
The model divides processes into two distinct roles:
Workers do the actual work. They handle requests, process data, manage state, and inevitably, sometimes crash when things go wrong.
Supervisors watch over workers. Their only job is to start child processes and restart them when they fail. Supervisors don't do application work - they manage lifecycle.
This separation is crucial. If workers handled their own restart logic, a bug in that logic would prevent recovery. By moving restart responsibility to a separate supervisor, you ensure that failures in workers can always be recovered.
How Supervision Works
A supervisor starts its children and monitors them. When a child crashes, the supervisor decides what to do based on its restart strategy. Should it restart just this one child? Restart all children? Restart all children in a specific order?
The strategy depends on the relationships between children. If they're independent, restart just the failed one. If they depend on each other, restart all of them to ensure consistent state. If they have startup dependencies, restart in order.
Supervisors can supervise other supervisors, forming a tree. At the top might be an application supervisor. Below it, supervisors for different subsystems. Below those, the actual workers. When a worker crashes, only its portion of the tree is affected. The rest of the system continues running.
Fault Tolerance Through Isolation
This tree structure creates fault isolation boundaries. A crashed database worker doesn't affect the HTTP handler workers. A failed cache process doesn't take down the authentication processes. Each supervision subtree handles its own failures without cascading them upward.
The Erlang community calls this "let it crash." It sounds reckless, but it's actually disciplined. Instead of defensive programming trying to handle every possible error, you let processes fail and rely on supervisors to restart them in a clean state. Often, a fresh restart clears transient problems that would be difficult to handle explicitly.
Supervision in Ergo Framework
Ergo Framework implements supervision through the act.Supervisor actor. When you create a supervisor, you specify its children and restart strategy. The framework handles the monitoring and restart logic.
Workers are typically act.Actor implementations - regular actors that do application work. Supervisors are act.Supervisor implementations - actors whose behavior is managing children.
Because supervisors are also actors, they can be supervised. This is how you build the tree: supervisors supervising supervisors supervising workers, all the way down.
The tree structure emerges from how you compose supervisors and workers. There's no special tree-building API. You just nest supervisors, and the tree forms naturally.
Building Reliable Systems
The supervision tree model leads to systems with interesting properties.
Self-healing - Failures trigger automatic recovery. Most transient problems resolve themselves through restart.
Graceful degradation - When a subsystem fails, only that part stops working. The rest continues serving requests.
Operational simplicity - Instead of complex error handling throughout your code, you centralize recovery logic in supervisors.
The trade-off is that you need to design processes that can restart cleanly. State that must survive restarts needs to be externalized - in databases, in other processes, or rebuilt from messages. But this discipline leads to more robust designs anyway.
Where to Go From Here
Understanding supervision requires seeing it in practice. The chapter covers the specifics: restart strategies, child specifications, and practical patterns for structuring your application.
The combination of the actor model (isolated processes, message passing) and supervision trees (automatic recovery) gives you the tools to build systems that handle failures gracefully. It's a different approach than traditional error handling, but one that scales well to distributed systems where failures are inevitable.
Links And Monitors
Linking and Monitoring Mechanisms
Building reliable systems from independent processes requires solving a fundamental coordination problem. When a process terminates - whether from a crash, graceful shutdown, or network failure - other processes that depend on it or supervise it need to know. Without this knowledge, a supervisor can't restart failed workers, dependent processes continue attempting to use unavailable services, and the system degrades silently.
The challenge is detecting termination without breaking isolation. Processes can't share memory or directly observe each other's state. The traditional approach in distributed systems uses heartbeats: processes periodically signal they're alive, and silence implies failure. But heartbeats introduce overhead, timing sensitivity, and the fundamental ambiguity of distinguishing "slow" from "dead."
Ergo Framework provides a different mechanism. Processes explicitly declare relationships - links and monitors - and the framework delivers termination notifications through these channels. When a process terminates, the node automatically notifies all processes that established relationships with it. The notification is immediate, deterministic, and part of the normal message flow.
Links and monitors both deliver termination notifications, but they differ in what happens next. A link couples your lifecycle to the target's - when it terminates, you terminate. A monitor simply informs you of termination, leaving the response up to you. The choice depends on whether you need failure propagation or just failure awareness.
Saturn - Central Registrar
Ergo Service Registry and Discovery
saturn is a tool designed to simplify the management of clusters of nodes created using the Ergo Framework. It offers the following features:
A unified registry for node registration within a cluster.
The ability to manage multiple clusters simultaneously.
Actor Model
The Actor Model and Its Properties
The actor model is a computational approach to building concurrent systems, first proposed in the 1970s. At its core is a simple yet powerful idea: instead of having program components share memory and coordinate through locks, they communicate by sending messages to each other.
The Fundamental Concept
In the actor model, everything is an actor. An actor is an independent entity that has its own private state and processes incoming messages one at a time. Actors never directly access each other's state. Instead, they send messages and wait for responses if needed.
This might seem like a constraint, but it's actually what makes the model powerful. By eliminating shared state, we eliminate entire classes of concurrency bugs that plague traditional multi-threaded programs.
Links: Coupling Lifecycles
Creating a link to another process declares a dependency. You're stating that your operation depends on the target's continued existence. When the target terminates, you receive an exit signal - a high-priority message that typically causes your termination as well.
Exit signals arrive in the Urgent queue, bypassing normal message ordering. The default behavior is immediate termination when an exit signal arrives. This cascading failure makes sense in many scenarios. If a worker's connection to a critical service is gone, the worker has nothing useful to do and should terminate cleanly.
But sometimes you want to handle exit signals explicitly. Actors can enable exit signal trapping through act.Actor. When trapping is enabled, exit signals are delivered as gen.MessageExit* messages to your HandleMessage callback. You can examine the signal, check the termination reason, and decide whether to terminate or attempt recovery.
Each exit message type carries the termination reason in its Reason field. The reason tells you what happened: normal shutdown (gen.TerminateReasonNormal), abnormal crash, panic (gen.TerminateReasonPanic), forced kill (gen.TerminateReasonKill), or network failure (gen.ErrNoConnection). This context lets you make informed decisions about how to react.
The framework provides linking methods for different identification schemes. LinkPID takes a process identifier and links to that specific process instance. When it terminates, you receive gen.MessageExitPID. LinkProcessID links to a registered name rather than a specific instance. If the process terminates or unregisters the name, you receive gen.MessageExitProcessID. LinkAlias works with process aliases - termination or alias deletion triggers gen.MessageExitAlias.
You can also link to node connections with LinkNode. If the connection to the specified node is lost, you receive gen.MessageExitNode. This is useful for processes that can't operate when a particular remote node is unavailable.
The generic Link method accepts any target type and dispatches to the appropriate typed method. Use it when the target type varies, or use the specific methods when you know the type.
The Unidirectional Nature
Links in Ergo are unidirectional, and this deserves emphasis because it differs from Erlang.
When you execute process.LinkPID(target), you establish a relationship where target's termination affects you. The link points from you to the target. If the target terminates, you receive an exit signal. But if you terminate, the target is unaffected. The link doesn't point backward.
Erlang's links are bidirectional. If process A links to process B in Erlang, either terminating causes the other to terminate. This symmetry can be useful, but it also creates unexpected cascading failures. In Ergo, if you want bidirectional coupling, you create two links: A links to B, and B links to A.
The unidirectional design gives you precise control. Consider a shared service with multiple workers. Each worker links to the service (if the service dies, workers should too). But the service doesn't link back to workers (a worker crash shouldn't kill the service). Unidirectional links express this asymmetric dependency naturally.
Monitors: Observation Without Coupling
Monitors provide lifecycle awareness without lifecycle coupling. You track when something terminates, but you don't terminate yourself.
The quintessential monitor use case is supervision. A supervisor monitors worker processes. When a worker terminates, the supervisor receives a down message. The message includes the worker's PID or identifier and the termination reason. The supervisor examines this information, consults its restart strategy, and decides whether to spawn a replacement. The supervisor continues running regardless of how many workers have crashed.
Down messages arrive in the System queue with high priority (but lower than Urgent exit signals). Each gen.MessageDown* type includes a Reason field. For MonitorPID, you receive gen.MessageDownPID with the target's PID and reason. For MonitorProcessID, you receive gen.MessageDownProcessID with the registered name and reason. The reason might indicate normal termination, a crash, or a special case like name unregistration (gen.ErrUnregistered).
Monitoring registered names or aliases handles invalidation gracefully. If you monitor a process by name and that process unregisters its name, you receive a down message with reason gen.ErrUnregistered. The process might still be running, but it's no longer accessible by that name, which is what you were monitoring. Same logic applies to alias deletion - you're notified that the thing you were monitoring is no longer valid.
Node monitoring tracks connection health. MonitorNode sends you gen.MessageDownNode when the connection to a remote node is lost. The reason is gen.ErrNoConnection. This is useful for detecting network partitions or remote node crashes without linking (which would terminate your process).
Network Transparency in Practice
Links and monitors work across nodes without changing their semantics or your code.
When you link or monitor a remote target, the framework sends a request to the remote node. The remote node records that your process is watching the target. This setup happens during your Link* or Monitor* call and involves a network round-trip. The operation can fail if the remote node is unreachable or the target doesn't exist - check the error return.
Once established, the remote node tracks your subscription. When the target terminates on the remote node, the remote node sends a notification message back to your node. Your node routes it to your process's mailbox. From your perspective, it's just another message - you don't see the network mechanics.
Network failures complicate this. If the connection to the remote node fails while your link or monitor is active, your local node detects the disconnection. It looks up which local processes had links or monitors to targets on that failed node. For links, it sends exit signals with reason gen.ErrNoConnection. For monitors, it sends down messages with the same reason.
This unified handling means you write the same error handling code for local and remote targets. The notification mechanism is consistent. The reason field distinguishes between target termination and network failure, but the notification path is identical.
Removing Links and Monitors
Links and monitors aren't permanent. You can remove them explicitly or they're removed automatically when participants terminate.
To remove a link, use the corresponding Unlink* method with the same target. UnlinkPID, UnlinkProcessID, UnlinkAlias, UnlinkNode each remove the link created by their Link* counterpart. If you never created the link, unlinking returns an error. For monitors, the Demonitor* methods work the same way.
When the target terminates and you receive notification, the link or monitor is automatically removed. You receive one notification per relationship. If the target is later restarted (by a supervisor), you won't receive notification about that new instance unless you create a new link or monitor to it.
When you terminate (the process that created the link or monitor), your relationships are cleaned up automatically. The target doesn't receive notification that you stopped watching. This asymmetry is intentional - the target doesn't track who's watching it, so it doesn't care when watchers go away.
Practical Usage Patterns
Several common patterns emerge from combining links and monitors.
Workers often link to infrastructure processes they depend on. A worker processing HTTP requests might link to a database connection pool process. If the pool terminates (perhaps during a deployment), the worker receives an exit signal and terminates. The worker's supervisor detects the termination, waits a moment (hoping the database pool restarts), and spawns a new worker. The new worker links to the (now running) pool and resumes processing.
Supervisors monitor their children. Each worker termination triggers a down message. The supervisor checks the reason. If it's gen.TerminateReasonNormal, the worker finished its task and doesn't need restart. If it's an error or panic, the supervisor spawns a replacement. The supervisor's continued operation despite worker failures is the whole point of the supervisor pattern.
Load balancers monitor backend processes. Each backend termination updates the balancer's routing table. The balancer continues routing to available backends. When a backend restarts, it might need to register with the balancer, which would then monitor it again.
Parent-child relationships often use LinkChild and LinkParent options in gen.ProcessOptions. These provide a convenient way to create links automatically after process initialization completes. You can also call Link methods directly during initialization if needed. If either participant terminates, the other receives an exit signal.
The Difference That Matters
Links propagate failure. Monitors report failure. Choose based on whether the watcher should terminate when the target terminates.
If continued operation without the target is meaningless, use a link. If you can adapt to the target's absence (by finding a replacement, degrading gracefully, or restarting the target), use a monitor.
The unidirectional nature of links matters more than you might initially think. It lets you express asymmetric dependencies precisely. Workers depend on services, but services don't depend on individual workers. Clients depend on servers, but servers don't depend on individual clients. Links point from the dependent to the dependency, making the relationship clear.
For event-based publish/subscribe patterns using links and monitors, see the Events chapter. For supervision trees built on monitors, see Supervisor.
What Makes an Actor
An actor consists of three things:
Private State - Data that belongs exclusively to this actor. No other actor can read or modify it directly.
Behavior - The logic that determines how the actor responds to messages. This can change over time as the actor processes different messages.
Mailbox - A queue where incoming messages wait to be processed. The actor pulls messages from this queue one at a time.
When an actor receives a message, it can do three things: send messages to other actors, create new actors, or decide how to handle the next message. That's it. Simple, but sufficient to build complex systems.
Why Sequential Processing Matters
Each actor processes messages sequentially, one after another. This is not a limitation but a design choice that provides important guarantees.
Consider what happens in traditional concurrent programming: multiple threads might access the same data simultaneously. To prevent corruption, you need locks. But locks introduce their own problems - deadlocks, race conditions, and complex reasoning about what state the data is in at any given moment.
Actors sidestep this entirely. Since only one message is processed at a time, the actor's state can only be in one of a finite number of well-defined states. There are no race conditions because there's no race - only one thing happens at a time within an actor.
Location Transparency
One of the most powerful aspects of the actor model is location transparency. When you send a message to an actor, you don't need to know whether it's running in the same process, on the same machine, or halfway around the world. The semantics are the same.
This makes distribution almost trivial. Code written for a single machine can scale to a distributed system without fundamental changes. The complexity of network communication is handled by the framework, not by your application logic.
Real-World Implementations
The actor model isn't just theory. It powers real production systems handling massive scale.
Erlang pioneered the practical application of the actor model. The language and its BEAM virtual machine have been running telecommunications systems since the 1980s. Systems that need to handle millions of concurrent connections with high reliability naturally gravitate toward Erlang's actor model implementation.
Akka brought the actor model to the Java ecosystem. It's used in systems that need to process high-volume transactions, manage complex workflows, or handle real-time data streams. Companies building reactive systems often choose Akka for its proven scalability patterns.
Orleans demonstrated that the actor model works well in cloud environments. Its virtual actor pattern, where actors are automatically created and destroyed based on demand, showed how the model adapts to modern distributed computing challenges.
How This Applies to Go
Go has goroutines and channels, which seem similar to actors and message passing. But there's a crucial difference: goroutines are not isolated. They can share memory, which means you still need locks and face the same concurrency challenges as traditional threading.
Ergo Framework brings true actor model semantics to Go. Each process is an isolated actor. The framework enforces the constraint that actors don't share memory and communicate only through messages. This gives you the benefits of the actor model - no race conditions, simpler concurrent logic, natural distribution - while writing Go code.
The single-goroutine-per-actor constraint might seem limiting at first. In practice, it's liberating. You write sequential code within each actor, and concurrency emerges naturally from having many actors processing messages in parallel.
The Actor Mindset
Working with the actor model requires a shift in thinking. Instead of thinking about shared data structures protected by locks, you think about independent entities sending messages to each other.
A typical pattern: instead of having multiple threads access a shared cache, you have a cache actor. Want to read from the cache? Send it a message. Want to write? Send a different message. The cache actor processes these requests sequentially, so there's no possibility of corruption. No locks needed.
This pattern scales beautifully. Need more throughput? Add more cache actors, each handling a portion of the key space. Need fault tolerance? Supervise the cache actors, so they restart if they crash. Need distribution? Put cache actors on different machines. The code structure remains the same.
Moving Forward
The actor model offers a different way to think about concurrent programming. Rather than wrestling with locks and shared memory, you design systems as independent actors exchanging messages. The constraints of the model - sequential processing, message passing only, isolated state - eliminate the complexity that makes traditional concurrent programming difficult.
Ergo Framework brings this programming model to Go. It enforces actor model principles while leveraging Go's strengths: lightweight goroutines, efficient scheduling, and a simple language. The result is a way to build concurrent and distributed systems that's both powerful and approachable.
The following chapters explore how these concepts manifest in Ergo Framework's implementation. Process covers the lifecycle and capabilities of actors. Node explains how actors are managed and how they communicate across networks.
The capability to manage the configuration of the entire cluster without restarting the nodes connected to Saturn (configuration changes are applied on the fly).
Notifications to all cluster participants about changes in the status of applications running on nodes connected to Saturn.
host: Specifies the hostname to use for incoming connections.
port: Port number for incoming connections. The default value is 4499.
path: Path to the configuration file saturn.yaml.
debug: Enables debug mode for outputting detailed information.
version: Displays the current version of Saturn.
Starting Saturn
To start Saturn, a configuration file named saturn.yaml is required. By default, Saturn expects this file to be located in the current directory. You can specify a different location for the configuration file using the -path argument.
You can find an example configuration file in the project's Git repository.
Configuration file structure
The saturn.yaml configuration file contains two root elements:
Saturn: This section includes settings for the Saturn server.
You can configure the Token for access by remote nodes and specify certificate files for TLS connections.
By default, a self-signed certificate is used. For clients to accept this certificate, they must enable the InsecureSkipVerify option when creating the client.
Changes to this section require a restart of the Saturn server.
Clusters: This section includes the configurations for clusters.
Changes in this section are automatically reloaded and sent to the registered nodes as updated configuration messages, without requiring a restart of Saturn.
The settings can target:
If the name of a configuration element ends with the suffix .file, the value of that element is treated as a file. The content of this file is then sent to the nodes as a []byte.
To configure settings for all nodes in all clusters, use the Clusters section in the saturn.yaml configuration file. Here, you can define global settings that will apply to every node within every cluster managed by Saturn:
in this example:
Var1, Var2, Var3, and Var4 will be applied to all nodes in all clusters.
However, the value of Var1 for nodes named [email protected] in any cluster will be overridden with the value 456.
If nodes are registered without specifying a Cluster in saturn.Options, they become part of the general cluster. Configuration for the general cluster should be provided in the Cluster@ section
In the example above:
The variable Var1 is set to 789 for the general cluster (all nodes in the general cluster will receive Var1: 789).
However, for the node [email protected] within the general cluster, Var1 will be overridden to 456.
Thus, all nodes in the general cluster will inherit Var1: 789, except for [email protected], which will specifically have Var1: 456. Other nodes in the general cluster will retain the default values from the Cluster@ section unless they are explicitly overridden in the configuration.
To specify settings for a particular cluster, use the element name Cluster@<cluster name> in the configuration file:
Service Discovery
Saturn can manage multiple clusters simultaneously, but resolve requests from nodes are handled only within their own cluster.
The name of a registered node must be unique within its cluster.
When a node registers, it informs the registrar which cluster it belongs to. Additionally, the node reports the applications running on it. Other nodes in the same cluster receive notifications about the newly connected node and its applications. Any changes in application statuses are also reported to the registrar, which in turn notifies all participants in the cluster.
The actor model excels at point-to-point communication. Process A sends a message to process B. Process C makes a request to process D. Each interaction has a specific sender and receiver.
But some scenarios need one-to-many communication. A price feed updates and dozens of trading strategies need the new price. A user logs in and multiple subsystems need notification. A sensor reading arrives and various monitoring processes need to react. You could send individual messages to each interested process, but then the producer needs to track all consumers. When consumers come and go, the producer's consumer list becomes a maintenance burden.
Events solve this with publish/subscribe semantics. A producer registers an event and publishes values to it. Consumers subscribe to the event without the producer knowing who they are. The framework handles message distribution - when the producer publishes an event, all current subscribers receive it. Subscribers can come and go dynamically, and the producer's code doesn't change.
Registering Events
A process becomes an event producer by calling RegisterEvent with an event name and options. The call returns a token - a unique reference that proves ownership. Only the process holding this token (or a process it delegates to) can publish events under this name.
The Notify option controls whether the producer receives notifications about subscriber changes. When enabled, the producer receives gen.MessageEventStart when the first subscriber appears and gen.MessageEventStop when the last subscriber leaves. This allows the producer to start or stop expensive operations based on demand. If nobody's watching the price feed, why fetch prices?
The Buffer option specifies how many recent events to keep. When a new subscriber joins, it receives the buffered events as a catch-up mechanism. Set this to zero if events are only relevant at the moment they're published. Set it to a reasonable number if new subscribers should see recent history.
Events are identified by name and node. The combination must be unique. Two processes on the same node can't register events with the same name. But processes on different nodes can register events with the same name - they're different events.
Publishing Events
Publishing an event sends it to all current subscribers.
You pass your application data directly. The framework wraps it in gen.MessageEvent automatically, adding the event identifier and timestamp. Subscribers receive the complete gen.MessageEvent structure containing your data.
The producer uses the token obtained during registration. If you try to publish with an incorrect token, the operation fails. This prevents unauthorized processes from publishing events they don't own.
Event publishing is fire-and-forget. The producer doesn't wait for acknowledgment or know how many subscribers received the event. The framework handles distribution asynchronously.
Subscribing to Events
Processes subscribe to events through links or monitors, the same mechanisms used for process lifecycle tracking.
LinkEvent creates a link to an event. You receive event messages as they're published. If the event producer terminates or unregisters the event, you receive an exit signal. The link semantics apply - by default, you'd terminate too.
MonitorEvent creates a monitor on an event. You receive event messages and a down notification if the producer terminates or the event is unregistered, but you don't terminate automatically.
Both methods return buffered events upon successful subscription:
The buffered events let subscribers catch up on what happened before they joined. If the buffer size was 10 and 5 events have been published, new subscribers receive those 5 events immediately.
For local events, you can omit the node name: gen.Event{Name: "price_update"}. The framework fills in the local node name. For remote events, specify the full event identifier including the remote node name.
Event Lifecycle
Events exist from registration until unregistration or producer termination.
When you register an event, it becomes available for subscription. Processes on any node can subscribe if they know the event name and node. The framework tracks all subscribers and distributes published events to them.
When the producer terminates, the event is automatically unregistered. All subscribers receive termination notifications (exit signals for links, down messages for monitors). The event name becomes available for registration again.
The producer can explicitly unregister an event with UnregisterEvent. This triggers the same notifications to subscribers. Use this when you're done publishing events but your process continues running.
If a subscriber terminates or unsubscribes (via UnlinkEvent or DemonitorEvent), the producer doesn't receive notification unless Notify was enabled. With Notify, the producer receives gen.MessageEventStop when the last subscriber leaves.
Network Transparency
Events work across nodes seamlessly. A producer on node A can publish events that subscribers on nodes B, C, and D receive. The framework handles the network distribution.
When you subscribe to a remote event, the framework sends a subscribe request to the remote node. The remote node records your subscription. When the producer publishes an event on the remote node, the remote node sends it to all remote subscribers, including you.
If the network connection fails, subscribers receive termination notifications with reason gen.ErrNoConnection. This is consistent with how links and monitors handle network failures for processes.
The buffered events work across nodes too. When you subscribe to a remote event, the remote node sends you the buffered events as part of the subscription response. This catch-up mechanism works regardless of where the producer and subscribers are located.
Token Delegation
Event tokens can be delegated. The producer can give its token to another process, allowing that process to publish events under the producer's event registration.
This enables patterns where event generation is separated from event registration. A coordinator registers the event and distributes the token to worker processes. Workers publish events as data becomes available. Subscribers don't know or care which process instance published each event - they just receive events on the registered event name.
Token delegation also allows rotating producers. A primary process registers an event and holds the token. A backup process can take over using the same token if the primary fails. Subscribers see a continuous event stream even as the producing process changes.
Event Messages
Event messages have a specific structure:
Each gen.MessageEvent contains:
Event - The event identifier (name and node)
Message - Your application data (any type)
Timestamp - When the event was published (nanoseconds since epoch)
Subscribers receive these wrapped messages and extract the application data. The wrapping provides context: which event this came from, when it was published, allowing subscribers to handle events from multiple sources or correlate timing.
Practical Patterns
Events fit several common scenarios.
Data streaming - A sensor process registers an event and publishes readings. Multiple monitoring processes subscribe. Each reading goes to all monitors. If a monitor crashes and restarts, it subscribes again and receives recent buffered readings to catch up.
State change notification - A user session process registers an event and publishes state changes (login, logout, permission change). Authorization processes subscribe and update their caches. The session process doesn't track who's interested in its state changes.
System telemetry - Processes publish metrics as events. Monitoring processes subscribe and aggregate. If the monitoring process restarts, buffered events provide recent history to rebuild state.
Workflow coordination - An order processing system publishes order state events. Inventory, shipping, and billing processes subscribe. Each subsystem reacts to relevant state changes. The order process doesn't orchestrate the subsystems - they coordinate through events.
For more information on links and monitors as they apply to processes and nodes, see the chapter.
Rotate
The rotate logger writes log messages to files with automatic rotation based on time intervals. Instead of a single growing log file that eventually fills the disk, the logger creates new files periodically and optionally compresses old ones. This keeps disk usage predictable and makes log files manageable for analysis and archival.
The logger operates asynchronously - log messages enter a queue and a background goroutine writes them to the file. This design prevents blocking your processes when disk I/O is slow. Logging happens in the background while your actors continue processing messages without waiting for disk writes to complete.
File Rotation Mechanics
Rotation happens based on time periods. You configure a duration - one minute, one hour, one day - and the logger creates a new file every period. The active file is always named <Prefix>.log. When the period ends, the logger:
Copies the active file to a timestamped filename: <Prefix>.YYYYMMDDHHmi.log
Optionally compresses it with gzip: <Prefix>.YYYYMMDDHHmi.log.gz
Truncates the active file to start fresh for the new period
This approach ensures the active file always has the same name. You can tail it (tail -f <Prefix>.log) and it works across rotations. The timestamped copies accumulate in the log directory, creating a chronological archive.
The timestamp format is YYYYMMDDHHmi - year, month, day, hour, minute. This format sorts lexicographically, so ls -l shows files in chronological order. It's compact but human-readable.
Asynchronous Writing
The logger uses an internal lock-free queue (MPSC - multi-producer single-consumer). When any process logs a message, it pushes to the queue and returns immediately. A single background goroutine pops messages from the queue and writes them to the file.
This design has several advantages:
Non-blocking - Logging never blocks your process. If the disk is slow or the file system stalls, your actors continue running. The queue absorbs bursts of messages.
Ordering - Messages from a single producer maintain order. The queue preserves submission order, so logs reflect the actual sequence of events within each process.
Batching - The background goroutine processes messages continuously. If multiple messages arrive quickly, it writes them in a tight loop, reducing syscall overhead.
Configuration
The logger requires a rotation period and accepts several optional parameters:
Period - The rotation interval. Minimum is time.Minute. Smaller periods create more files with less data each. Larger periods create fewer files with more data each. Choose based on how you analyze logs - if you search specific time ranges, shorter periods help. If you archive logs by day, use 24 * time.Hour.
Path - Directory for log files. Defaults to ./logs relative to the executable. The logger creates the directory if it doesn't exist. Supports ~ for home directory expansion (~/logs becomes /home/user/logs). Use absolute paths in production to avoid ambiguity.
Prefix - Filename prefix. Defaults to the executable name. The active file is <Prefix>.log, rotated files are <Prefix>.YYYYMMDDHHmi.log[.gz]. Use meaningful prefixes if multiple services log to the same directory.
Compress - Enables gzip compression for rotated files. The active file stays uncompressed for fast writing. When rotating, the logger compresses the copy, reducing disk usage by 5-10x for text logs. Compressed files have .log.gz extension. Use compression if disk space matters more than CPU for compression.
Depth - Limits the number of retained log files. When rotating, if the number of files exceeds Depth, the logger deletes the oldest file. Set to 0 (default) for unlimited retention. Set to a specific number (e.g., 24) to keep the last 24 periods. This prevents unbounded disk usage.
TimeFormat - Timestamp format in log messages. Same as colored logger - any format from time package or custom layout. Empty string uses nanosecond timestamps. Choose based on readability vs. precision.
IncludeName - Includes registered process names in log messages. Helps identify which process logged what.
IncludeBehavior - Includes behavior type names in log messages. Useful during development to understand code flow.
ShortLevelName - Uses abbreviated level names ([TRC], [DBG], etc.) instead of full names. Saves space in log files.
Basic Usage
Configure the rotate logger in node options:
This configuration:
Rotates every hour
Stores logs in /var/log/myapp/
Names files myapp.log (active) and myapp.202411191200.log.gz (rotated with compression)
For detailed logger configuration options, see the rotate.Options struct in the package. For understanding how loggers integrate with the framework, see .
Mutual TLS
Mutual TLS authentication between nodes
Standard TLS provides server authentication - the client verifies the server's certificate. Mutual TLS (mTLS) adds client authentication - both sides present and verify certificates. Only clients with certificates signed by a trusted CA can connect.
Configuration
NodeOptions.CertManager is used for:
Default acceptor (created automatically on port 15000)
All outgoing connections
To override per-acceptor, use AcceptorOptions.CertManager.
CertAuthManager
gen.CertAuthManager extends CertManager with CA pool and authentication settings:
Server-side settings:
Setting
Purpose
ClientAuth values:
Value
Behavior
Client-side settings:
Setting
Purpose
Runtime Certificate Rotation
Certificates can be rotated without restart:
New connections use the updated certificate. Existing connections keep their original certificate.
CA pools and ClientAuth are fixed at startup. Restart the node to change these settings.
To use different certificates for specific destinations, see .
Troubleshooting
Connection rejected with certificate error
Verify the client certificate is signed by a CA in the server's ClientCAs pool. Check certificate expiration dates.
Server certificate verification failed
The server's certificate must be signed by a CA in the client's RootCAs pool. For development, disable verification with NetworkOptions.InsecureSkipVerify: true.
SNI mismatch
Set ServerName on the client's CertAuthManager if the certificate's Common Name doesn't match the connection address.
Certificate rotation not taking effect
Updates apply to new connections only. Close existing connections to force reconnection with new certificate.
CA pool changes not taking effect
CA pools are fixed at startup. Restart the node to apply changes.
Node
What is a Node in Ergo Framework?
A node is the runtime environment where your actors live. Think of it as the container that hosts processes, routes messages between them, and handles the complexities of distributed communication.
When you start a node, you're launching a complete system with several subsystems working together: process management, message routing, networking, and logging. Each subsystem has a specific responsibility, and they coordinate to provide the foundation for your application.
What a Node Provides
Process Management - The node tracks every process running on it. When you spawn a process, the node assigns it a unique PID, registers it in the process table, and manages its lifecycle. When a process terminates, the node cleans up its resources and notifies any processes that were linked or monitoring it.
Process
What is a Process in Ergo Framework
A process is an actor - a lightweight entity that handles messages sequentially in its own goroutine. It's the fundamental building block of an Ergo application.
Every process has a mailbox where incoming messages wait to be processed. The mailbox contains four queues with different priorities: Urgent for critical system messages, System for framework control, Main for regular application messages, and Log for logging. When the process wakes up to handle messages, it processes them in priority order, taking from Urgent first, then System, then Main, and finally Log.
The process runs only when it has messages to handle. When the mailbox is empty, the process sleeps, consuming no CPU. When a message arrives, the process wakes, handles the message, and sleeps again if nothing else is waiting. This efficiency is why you can have thousands of processes in a single application.
Generic Types
Data Types and Interfaces Used in Ergo Framework
Ergo Framework uses several specialized types for identifying and addressing processes, nodes, and other entities in the system. Understanding these types is essential for working with the framework.
Identifiers and Names
Colored
The colored logger provides visual clarity for console output by applying color highlighting to log messages. Instead of monochrome text where errors blend with informational messages, each log level gets a distinct color, and framework types are highlighted automatically. This makes it easier to scan logs during development and debugging.
The logger writes directly to standard output with immediate formatting - no buffering, no delays. When a process logs a message, it appears instantly in your terminal with colors applied. This synchronous approach keeps logs simple and predictable during interactive development.
Visual Organization
Color helps your eyes parse logs quickly. Log levels use consistent colors:
Saturn Сlient
This package implements the gen.Registrar interface and serves as a client library for the central registrar, . In addition to the primary Service Discovery function, it automatically notifies all connected nodes about cluster configuration changes.
To create a client, use the Create function from the saturn package. The function requires:
The hostname where the central registrar is running (default port: 4499
Cron
Schedule tasks on a repetitive basis
Applications often need tasks to run periodically. Generate a daily report at midnight. Clean up expired sessions every hour. Send weekly summary emails. Poll an external API every five minutes.
You could implement this yourself - spawn a process that sleeps, wakes up, performs the task, and sleeps again. But then you're managing wake times, handling timezone changes, accounting for daylight saving time transitions, and ensuring the scheduler itself stays alive. The scheduling logic becomes scattered across your application.
Cron provides scheduled task execution as a framework service. You declare what should run and when using the familiar crontab syntax. The framework handles timing, execution, and all the edge cases around time-based scheduling.
Message Routing - When a process sends a message, the node figures out where it needs to go. Local process? Route it directly to the mailbox. Remote process? Establish a network connection if needed and send it there. The sender doesn't need to know these details.
Network Stack - The node handles all network communication. It discovers other nodes, establishes connections, encodes messages, and manages the complexity of distributed communication. This is what makes network transparency possible.
Pub/Sub System - Links, monitors, and events all work through a publisher/subscriber mechanism in the node core. When a process terminates or an event fires, the node knows who's subscribed and delivers the notifications.
Logging - Every log message goes through the node, which fans it out to registered loggers. This centralized logging makes it easy to capture, filter, and route log output.
Starting a Node
A node needs a name. The format is name@hostname, where the hostname determines which network interface to use for incoming connections.
The name must be unique on the host. Two nodes with the same name can't run on the same machine, but nodes with different names can coexist.
The gen.NodeOptions parameter configures the node: which applications to start, environment variables, network settings, logging configuration. If you specify applications in the options, the node loads and starts them automatically. If any application fails to start, the entire node startup fails - this ensures you don't end up in a partially initialized state.
Process Lifecycle
The node manages the complete process lifecycle.
When you spawn a process, the node creates it, registers it in the process table, calls its ProcessInit callback, and transitions it to the sleep state. The process is now live and can receive messages.
When the process terminates (either naturally or through an exit signal), the node calls ProcessTerminate, removes it from the process table, and notifies any processes that were linked or monitoring. Resources are cleaned up, and the gen.PID becomes invalid.
Processes can register names, making them addressable by name rather than PID. This is useful for well-known processes that other parts of the system need to find. The node maintains a name registry, ensuring each name maps to exactly one process.
Message Routing
Message routing is one of the node's core responsibilities.
When a process sends a message locally, the node simply places it in the recipient's mailbox. The recipient's goroutine wakes up (if it was sleeping), processes the message, and goes back to sleep if no more messages are waiting.
When the message goes to a remote process, things are more interesting. The node checks if a connection exists to the remote node. If not, it discovers the remote node's address (through the registrar or static routes) and establishes a connection. The message is encoded into the Ergo Data Format, optionally compressed, and sent over the network. The remote node receives it, decodes it, and delivers it to the recipient's mailbox.
From the sender's perspective, both paths look identical. That's network transparency.
Network Communication
Making remote message delivery work like local delivery requires solving three problems: finding remote nodes, establishing connections, and ensuring compatibility.
The first problem is discovery. When you send to a remote process, the node extracts which node that process belongs to from its identifier. Every node runs a small registrar service by default. For nodes on the same host, you query the local registrar. For nodes on different hosts, you query the registrar on that remote host - the framework derives the hostname from the node name and sends the query there. The registrar responds with connection information.
This default approach works for simple setups but has limitations. You're querying individual hosts, which requires them to be directly reachable. There's no cluster-wide view, no centralized configuration, no way to discover which applications are running where.
That's where etcd or Saturn come in. Instead of each node being its own island with a local registrar, you run a centralized registry service. All nodes register there when they start. All discovery queries go there. The central registrar becomes the source of truth for the cluster, providing not just discovery but configuration management, application tracking, and topology change notifications. It transforms independent nodes into a coordinated cluster.
Once a node is discovered, connections are established. Multiple TCP connections form a pool to that node, enabling parallel message delivery. The connections negotiate protocol details during handshake: which protocol version to use, whether compression is supported, what features are enabled. This negotiation allows nodes with different capabilities to work together.
Environment and Configuration
Nodes have environment variables that all processes inherit. This provides a way to configure behavior without hardcoding values. A process can override inherited variables or add its own, creating a hierarchy: process environment overrides parent, which overrides leader, which overrides node.
Environment variables are case-insensitive. Whether you set "database_url" or "DATABASE_URL", the process sees the same value. This eliminates a common source of configuration bugs.
Shutdown
Stopping a node can be graceful or forced.
Graceful shutdown sends exit signals to all processes and waits for them to clean up. Processes receive gen.TerminateReasonShutdown and can save state, close connections, or send final messages before terminating. Once all processes have stopped, the network stack shuts down, and the node exits.
Forced shutdown kills all processes immediately without waiting for cleanup. This is useful when you need to stop quickly, but processes don't get a chance to clean up properly.
One subtlety: if you call Stop from within a process, you create a deadlock. The process can't terminate because it's waiting for Stop to complete, but Stop is waiting for all processes (including this one) to terminate. The solution is either to call Stop in a separate goroutine or use StopForce, which doesn't wait.
Shutdown Timeout
Graceful shutdown can hang indefinitely if a process is stuck - perhaps blocked on a channel, waiting for an external resource, or caught in incorrect logic. To prevent this, the node has a shutdown timeout. If processes don't terminate within this period, the node force exits with error code 1.
The default timeout is 3 minutes. You can change it through gen.NodeOptions:
During shutdown, the node logs which processes are still running. Every 5 seconds, it prints a warning with the first 10 pending processes, showing their PID, registered name (if any), behavior type, state, and mailbox queue length. This diagnostic output helps identify what's blocking the shutdown:
The state tells you what the process is doing: running means it's handling a message, sleep means it's idle waiting for messages. The queue count shows how many messages are waiting. A process stuck in running with a growing queue indicates it's blocked in a callback and not processing its mailbox.
Node Incarnation
Every node has a creation timestamp assigned when it starts. This timestamp is embedded in every gen.PID, gen.Ref, and gen.Alias that the node creates.
When two nodes connect, they exchange their creation timestamps during the handshake. Each connection stores the remote node's creation value.
Before sending any message to a remote process, the framework compares the target's Creation field against the stored creation of that remote node. If they differ, the operation returns gen.ErrProcessIncarnation immediately - no network message is sent.
This mechanism handles a common distributed systems problem: what happens when a remote node restarts? After restart, the node gets a new creation timestamp. Any gen.PID or gen.Alias from before the restart now contains the old creation value. When you try to send a message using that stale identifier, the framework detects the mismatch and returns an error instead of delivering the message to a wrong process.
The check applies to all remote operations: Send, Call, Link, Unlink, Monitor, Demonitor, SendExit, and SendResponse.
The Node's Role
The node is infrastructure, not application logic. It provides the mechanisms - process management, message routing, networking - that your actors use to accomplish work.
This separation is important. Your actors focus on application logic: handling requests, processing data, managing state. The node handles the plumbing: routing messages, establishing connections, managing lifecycles. You don't write code to discover remote nodes or encode messages. The node does that.
This is what makes the framework approachable. You write actors that send and receive messages, and the node makes it all work, whether processes are local or distributed across a cluster.
The following chapters dive into specific node capabilities. Process explains the actor lifecycle and operations. Networking covers distributed communication. Links and Monitors explains how processes track each other.
Identifying Processes
A process identifier (gen.PID) uniquely identifies a process across the entire distributed system. It contains three components: the node name where the process runs, a unique sequential number within that node, and a creation timestamp.
The creation timestamp is the node's startup time. If a node restarts, the creation value changes, which means PIDs from before the restart are distinguishable from PIDs after. If you try to send a message to a gen.PID with an old creation value, you get an error. This prevents messages from being delivered to the wrong process after a node restart.
Besides PIDs, processes can be identified by registered names. A process can register one name, making it addressable as gen.ProcessID{Name: "worker", Node: "node@host"}. This is useful for well-known processes that other parts of the system need to find without knowing their gen.PID.
Processes can also create aliases - temporary identifiers that provide additional addressing options. Unlike registered names (one per process), a process can create unlimited aliases using gen.Alias. They're useful when you need multiple ways to address the same process, such as in request-response patterns or when implementing services with multiple endpoints.
Process Lifecycle
A process goes through several states during its lifetime.
It starts in Init, where the ProcessInit callback runs. In this state, the process can spawn children, send messages, register names, create aliases, register events, establish links and monitors, and make synchronous calls.
After initialization succeeds, the process enters Sleep and is ready to receive messages. When a message arrives, the process transitions to Running, handles the message, and returns to Sleep.
If the process makes a synchronous call, it enters WaitResponse while waiting for the reply. Once the response arrives, it returns to Running and continues processing.
Eventually the process terminates. This can happen in several ways: it returns an error from its message handler, it receives an exit signal, the node kills it, or a panic occurs. The ProcessTerminate callback runs, allowing cleanup. Then the process is removed from the node, and its resources are freed.
Starting Processes
You spawn processes through a factory function that creates instances of your actor.
The factory is called each time you spawn - each process gets a fresh instance. This isolation is important for the actor model.
gen.ProcessOptions configures the new process: mailbox size, environment variables, compression settings, message priority, linking behavior, and initialization timeout. Most options have sensible defaults. The main ones you'll configure are MailboxSize (to limit memory) and Env (to pass configuration).
InitTimeout limits how long ProcessInit can take. Zero uses the default (5 seconds). If initialization exceeds this timeout, the process is terminated with gen.ErrTimeout and spawn returns an error. For remote spawn and application processes, the maximum allowed value is 15 seconds - exceeding this limit returns gen.ErrNotAllowed.
Two options deserve explanation: LinkParent and LinkChild. These options provide a convenient way to establish links automatically after initialization completes. If LinkChild is set, the parent links to the child. If LinkParent is set, the child links to the parent. These links only work for process-spawned children, not node-spawned processes. Note that you can also call Link methods directly during initialization if needed.
Message Handling
Processes are defined by implementing the gen.ProcessBehavior interface. This is a low-level interface with three callbacks: ProcessInit for initialization, ProcessRun for the message processing loop, and ProcessTerminate for cleanup.
In practice, you rarely implement gen.ProcessBehavior directly. Instead, you use act.Actor, which implements gen.ProcessBehavior and provides a more convenient abstraction. act.Actor gives you HandleMessage and HandleCall callbacks - straightforward methods where you write your message handling logic without worrying about the mailbox mechanics.
The ProcessInit callback runs once during startup. Use it to initialize state, spawn children, configure properties. If it returns an error or exceeds the InitTimeout, the process is cleaned up and removed - it terminates immediately.
The ProcessTerminate callback runs during shutdown. Use it for cleanup: close files, send final messages, log termination. It receives the termination reason, so you can distinguish between normal shutdown and errors.
act.Actor handles the ProcessRun loop for you, calling your HandleMessage and HandleCall methods as messages arrive. This separation between the low-level interface (gen.ProcessBehavior) and the high-level abstraction (act.Actor) keeps the framework flexible while making common cases simple.
Environment Variables
Processes inherit environment variables when they spawn. At that moment, variables are copied from multiple sources and merged with a priority order: node variables (lowest priority), then application, then leader, then parent, then variables specified in gen.ProcessOptions (highest priority). If the same variable exists in multiple sources, the higher priority value wins.
Once a process is running, its environment is independent. If the node changes an environment variable, running processes don't see the change. Only newly spawned processes inherit the updated values. This isolation is important - it means a process's configuration is stable for its lifetime.
When a process queries a variable with Env or EnvList, it looks only in its own environment - the merged copy created at spawn time. The hierarchy (Process > Parent > Leader > Application > Node) determines what was copied during spawning, not what's queried during lookup.
Variables are case-insensitive. "database_url", "DATABASE_URL", and "Database_Url" are all the same variable. This eliminates configuration mistakes from case mismatches.
Use SetEnv to modify variables during Init or Running states. Pass nil as the value to delete a variable. Changes affect only this process - they don't propagate to children, parents, or the node.
Termination
Processes typically terminate themselves by returning an error from ProcessRun. In act.Actor, this manifests as returning an error from HandleMessage, HandleCall, or other handler callbacks. Return gen.TerminateReasonNormal for clean shutdown, or any other error to indicate why termination occurred. The process transitions to Terminated, runs its ProcessTerminate callback for cleanup, and is removed from the node.
If a panic occurs during message handling, the framework catches it, logs the stack trace, and terminates the process with gen.TerminateReasonPanic. The ProcessTerminate callback still runs, giving the process a chance to clean up despite the panic.
Processes can also be terminated externally. Sending an exit signal with SendExit delivers a high-priority termination request to the process's Urgent queue. Actors can trap these signals and handle them as regular messages, allowing graceful shutdown. This is how supervision trees restart workers - send an exit signal, wait for clean termination, then spawn a replacement.
The most forceful option is Kill. If the process is idle (Sleep state), it transitions directly to Terminated and ProcessTerminate is called. If the process is actively handling a message (Running or WaitResponse states), it's marked as Zombee. In Zombee state, all operations return gen.ErrNotAllowed. The process finishes its current message, then terminates and calls ProcessTerminate. Use Kill when you need to stop a process that isn't responding to exit signals.
Regardless of how termination happens, the node performs comprehensive cleanup. Events the process registered are unregistered. Its registered name becomes available for reuse. Aliases are deleted. Links and monitors are removed. If the process was acting as a logger, it's removed from the logging system. Meta processes spawned by this process are terminated. This ensures no dangling references remain after a process is gone.
State-Based Access Control
Not all Process interface methods work in all states. This isn't arbitrary - it reflects what's actually possible.
During Init, the process can spawn children, send messages, register names, create aliases, register events, establish links and monitors, and make synchronous calls.
During Running, everything is available. The process is fully operational.
During Terminated, only sending messages works. You can't spawn new children or create new resources - the process is shutting down.
These restrictions are enforced by the framework. If you call a method in the wrong state, you get gen.ErrNotAllowed. This prevents subtle bugs where operations appear to succeed but silently fail because the process isn't in the right state.
The details of which methods work in which states are documented in the gen.Process godoc. In practice, you rarely hit these restrictions unless you're doing unusual things during initialization or shutdown.
For a deeper understanding of process operations and lifecycle management, refer to the gen.Process interface documentation in the code.
gen.Atom
gen.Atom is a specialized string used for names - node names, process names, event names. While technically just a string, treating it as a distinct type allows the framework to optimize how these names are handled in the network stack.
Atoms appear in single quotes when printed:
The network stack caches atoms and maps them to numeric IDs to reduce bandwidth when the same names appear repeatedly in messages.
gen.PID
A gen.PID uniquely identifies a process. It contains the node name where the process lives, a unique sequential ID, and a creation timestamp. The creation timestamp changes when a node restarts, allowing you to detect if you're talking to a reincarnation of a node rather than the original.
gen.PID values print with the node name hashed for brevity:
The hash (90A29F11) is a CRC32 of the node name. This keeps the printed form compact while remaining unique.
gen.ProcessID
A gen.ProcessID identifies a process by its registered name rather than gen.PID. This is useful when you need to address a process but don't know its gen.PID, or when the gen.PID might change across restarts but the name remains constant.
gen.Ref
gen.Ref values are unique identifiers generated by nodes. They're used for correlating requests and responses in synchronous calls, and as tokens when registering events.
A gen.Ref is guaranteed unique within a node for its lifetime. The structure includes the node name, creation time, and a unique ID array.
References can also embed deadlines (stored in ID[2]) for timeout tracking. Recipients can check ref.IsAlive() to see if a request is still valid.
gen.Alias
gen.Alias is like a temporary gen.PID. Processes create aliases for additional addressability without registering names. Meta processes use aliases as their primary identifier.
Aliases use the same structure as references but print with a different prefix:
gen.Event
gen.Event values represent named message streams that processes can subscribe to. A gen.Event identifier consists of a name and the node where it's registered.
gen.Env
Environment variable names in Ergo are case-insensitive. The gen.Env type ensures this by converting to uppercase.
This allows processes to inherit environment variables from parents, leaders, and the node, with consistent naming regardless of how they're specified.
Core Interfaces
The framework defines several interfaces that provide access to different parts of the system.
gen.Node
The gen.Node interface is what you get when you start a node. It provides methods for spawning processes, managing applications, configuring networking, and controlling the node lifecycle.
Node operations can be called from any goroutine. The node manages processes but isn't itself an actor.
gen.Process
The gen.Process interface represents a running actor. It provides methods for sending messages, spawning children, linking to other processes, and managing the actor's lifecycle.
Actors typically embed this interface:
Process methods enforce state-based access control. Some operations are only available when the process is in certain states, ensuring actor model constraints are maintained.
gen.Network
The gen.Network interface manages distributed communication. It handles connections to remote nodes, routing, and service discovery.
Network transparency means sending messages to remote processes uses the same API as local processes. The gen.Network interface is where you configure how that transparency is achieved.
gen.RemoteNode
A gen.RemoteNode represents a connection to another Ergo node. Through this interface, you can spawn processes on the remote node or start applications there.
The remote operations require the target node to have enabled the corresponding permissions.
Type Design Philosophy
These types reflect a few design decisions worth understanding.
Hashing for readability - Node names are hashed in output to keep logs and traces readable while maintaining uniqueness. Full names can be verbose, especially in distributed systems with descriptive naming.
Separate types for concepts - gen.PID, gen.ProcessID, gen.Alias, and gen.Event are distinct types even though they could have been unified. Each represents a different way of addressing or identifying something in the system, and the type system helps keep these concepts clear.
Network-aware design - Many types include the node name. This isn't just for completeness - it's what enables network transparency. A gen.PID tells you not just which process, but which node, allowing the framework to route messages appropriately.
For detailed API documentation of these interfaces and types, refer to the godoc comments in the source code.
, unless specified in
saturn.Options
)
A token for connecting to Saturn
a set of options saturn.Options
Then, set this client in the gen.NetworkOption.Registrar options
Using saturn.Options, you can specify:
Cluster - The cluster name for your node
Port - The port number for the central Saturn registrar
KeepAlive - The keep-alive parameter for the TCP connection with Saturn
InsecureSkipVerify - Option to ignore TLS certificate verification
When the node starts, it will register with the Saturn central registrar in the specified cluster.
Additionally, this library registers a gen.Event and generates messages based on events received from the central Saturn registrar within the specified cluster. This allows the node to stay informed of any updates or changes within the cluster, ensuring real-time event-driven communication and responsiveness to cluster configurations:
saturn.EventNodeJoined - Triggered when another node is registered in the same cluster.
saturn.EventNodeLeft - Triggered when a node disconnects from the central registrar
saturn.EventApplicationLoaded - An application was loaded on a remote node. Use ResolveApplication from the gen.Resolver interface to get application details
saturn.EventApplicationStarted - Triggered when an application starts on a remote node.
saturn.EventApplicationStopping - Triggered when an application begins stopping on a remote node.
satrun.EventApplicationStopped - Triggered when an application is stopped on a remote node.
saturn.EventApplicationUnloaded - Triggered when an application is unloaded on a remote node
saturn.EventConfigUpdate - The node's configuration was updated
To receive such messages, you need to subscribe to Saturn client events using the LinkEvent or MonitorEvent methods from the gen.Process interface. You can obtain the name of the registered event using the Event method from the gen.Registrar interface. This allows your node to listen for important cluster events like node joins, application starts, configuration updates, and more, ensuring real-time updates and handling of cluster changes.
Using the saturn.EventApplication* events and the Remote Start Application feature, you can dynamically manage the functionality of your cluster. The saturn.EventConfigUpdate events allow you to adjust the cluster configuration on the fly without restarting nodes, such as updating the cookie value for all nodes or refreshing the TLS certificate. Refer to the Saturn - Central Registrar section for more details.
You can also use the Config and ConfigItem methods from the gen.Registrar interface to retrieve configuration parameters from the registrar.
To get information about available applications in the cluster, use the ResolveApplication method from the gen.Resolver interface, which returns a list of gen.ApplicationRoute structures:
Name The name of the application
Node The name of the node where the application is loaded or running
Weight The weight assigned to the application in gen.ApplicationSpec
Mode The application's startup mode (gen.ApplicationModeTemporary, gen.ApplicationModePermanent, gen.ApplicationModeTransient)..
State The current state of the application (gen.ApplicationStateLoaded, gen.ApplicationStateRunning, gen.ApplicationStateStopping)
You can access the gen.Resolver interface using the Resolver method from the gen.Registrar interface.
Every minute, the cron system wakes up and evaluates all job specifications against the current time. Jobs whose specifications match the current minute are queued for execution. Each queued job then runs in its own goroutine.
This design is stateless - no pre-calculated schedules, no complex data structures to maintain. When you add a job, it participates in the next evaluation. When you remove a job, it stops participating. Timezone and daylight saving time transitions are handled naturally because each evaluation uses current time rules.
The stateless approach has implications. Multiple executions of the same job can run concurrently if the job takes longer than its interval. A job scheduled every minute that takes two minutes to complete will have two instances running simultaneously. If your job can't handle concurrent execution, implement serialization in the action itself - for example, send a message to a named process that processes requests sequentially.
Defining Jobs
A job specification declares what should run and when:
The Name identifies the job uniquely within the node. The Spec uses crontab format to define the schedule. The Location specifies which timezone to use when interpreting the schedule. The Action defines what happens when the schedule triggers.
Optionally, Fallback can specify a process to notify if the action fails, providing centralized error handling for scheduled tasks.
Actions
Actions define what happens when a job runs.
The simplest action sends a message. The job triggers, the cron system sends gen.MessageCron to the specified process, and the process handles it through normal message processing. This integrates cleanly with the actor model - the scheduled work happens inside an actor's message handler.
For work that needs isolation per execution, spawn a process. Each time the job triggers, a fresh process spawns, performs the work, and terminates. If one execution crashes, the next starts clean. The spawned process receives environment variables identifying which job spawned it and when (gen.CronEnvNodeName, gen.CronEnvJobName, gen.CronEnvJobActionTime).
For distributed systems, spawn on a remote node. A job on the coordinator can trigger work on data nodes. The remote node must have enabled spawn permissions for the process name. This pattern centralizes scheduling while distributing execution.
Custom actions implement the gen.CronAction interface. The Do method receives the job name, node reference, and execution time in the job's timezone. Return an error to trigger fallback handling.
Crontab Format
Cron uses standard crontab syntax: five fields specifying minute, hour, day-of-month, month, and day-of-week.
Common patterns:
0 * * * * - Every hour
0 0 * * * - Every day at midnight
*/15 * * * * - Every 15 minutes
0 9-17 * * 1-5 - Every hour from 9-5 on weekdays
0 0 1 * * - First day of each month
0 0 * * 5#2 - Second Friday of each month
0 0 L * * - Last day of each month
Macros provide common schedules: @hourly, @daily, @weekly, @monthly.
Managing Jobs
Jobs can be defined at node startup in gen.NodeOptions.Cron.Jobs, or managed dynamically through the gen.Cron interface.
Add jobs with AddJob. Remove them with RemoveJob. Temporarily disable with DisableJob (useful for maintenance windows), and resume with EnableJob. Query status with Info and JobInfo, which show execution history and errors.
The Schedule and JobSchedule methods preview upcoming executions. Since the implementation evaluates specifications on-demand rather than maintaining pre-calculated schedules, these methods perform the same evaluation logic for a future time range. Use them to verify your crontab specs are correct or to detect scheduling conflicts.
Timezone Handling
Each job has its own timezone. A job with Location: time.UTC scheduled for midnight runs at UTC midnight. A job with a New York timezone runs at New York midnight. The physical location of the node doesn't matter - jobs run in their configured timezone.
This matters for distributed systems where jobs serve different regions. One node can run jobs for multiple timezones. A cleanup job for European users runs at European midnight. A report job for Asian users runs at Asian business hours. Same node, different timezones, correct local timing.
Daylight Saving Time
Timezone transitions are handled carefully.
When clocks spring forward, an hour disappears. A job scheduled for 2:00 AM doesn't run on the spring-forward date because 2:00 AM doesn't exist that day. The cron system detects the time adjustment and skips execution rather than running at the wrong time.
When clocks fall back, an hour repeats. A job scheduled during that hour runs once, not twice. The system tracks actual wall clock progression to avoid duplicate execution.
This behavior ensures jobs run when intended, not at arbitrary times that happen to match the specification after time adjustments.
Error Handling
If a job action returns an error and the job has a configured fallback, the system sends gen.MessageCronFallback to the fallback process. The message includes the job name, execution time, error, and an optional tag for identifying the job source.
This allows centralizing monitoring of failed scheduled tasks. A single fallback process can receive failures from all jobs, log them, send alerts, or take corrective action.
For complete crontab specification syntax and additional examples, refer to the gen.Cron interface documentation in the code.
Trace - Faint white (low importance, background noise)
Debug - Magenta (development information)
Info - White (normal operation)
Warning - Yellow (attention needed)
Error - Red bold (problems occurred)
Panic - White on red background bold (critical failures)
Framework types also get color highlighting:
gen.Atom - Green (names and identifiers)
gen.PID - Blue (process identifiers)
gen.ProcessID - Blue (named processes)
gen.Ref - Cyan (references)
gen.Alias - Cyan (meta-process identifiers)
gen.Event - Cyan (event names)
When you log process.Log().Info("started %s", pid), the PID renders in blue automatically. You don't annotate it - the logger detects the type and applies color. This works for any framework type used as an argument.
Log Format
Each log message follows a consistent structure:
Timestamp appears first. By default, it's the Unix timestamp in nanoseconds. You can configure any format from Go's time package, or define your own. Nanosecond timestamps are sortable and precise, useful when correlating logs with traces or metrics.
Level shows the severity. The bracket format [INFO] or short form [INF] makes levels easy to grep. Color reinforces the level visually - you don't need to read the text to know something is an error.
Source identifies where the message originated:
Node logs - Show the node name in green (CRC32 hash for compactness)
Network logs - Show both local and peer node names
Process logs - Show PID in blue, optionally the registered name in green, optionally the behavior type
Meta-process logs - Show alias in cyan, optionally the behavior type
The optional components (name, behavior) are controlled by configuration. During development, you might want behavior names to understand which actor logged something. In production, you might omit them to reduce output.
Message is your formatted string with arguments. Framework types in arguments get color highlighting automatically.
Configuration
The logger accepts several options during creation:
TimeFormat - Sets timestamp format. Any format from time package works (time.RFC3339, time.Kitchen, custom layouts). Leave empty for nanosecond timestamps. Nanoseconds are precise but hard to read. RFC3339 is human-friendly but verbose. Choose based on your use case.
ShortLevelName - Uses abbreviated level names: [TRC], [DBG], [INF], [WRN], [ERR], [PNC]. Saves horizontal space in the terminal. Full names are clearer for people unfamiliar with the abbreviations.
IncludeName - Adds the registered process name to the source. If a process registers as "worker", logs show the name in green next to the PID. Helpful when you have many processes and want to identify them by role rather than PID.
IncludeBehavior - Adds the behavior type name to the source. Logs show which actor implementation generated the message. Useful during development to understand code flow. In production, this adds noise if you have good message content.
IncludeFields - Includes structured logging fields in the output. Fields appear below the message with faint color. Useful when your log messages use context fields for correlation (request IDs, user IDs, etc.).
DisableBanner - Disables the Ergo logo banner on startup. The banner announces framework version and adds visual flair. Disable it in production or when running tests where the banner clutters output.
Basic Usage
Register the colored logger in node options:
The default logger writes to stdout too, but without colors. If you don't disable it, you get each message twice - once colored, once plain. Disabling the default logger ensures only the colored version appears.
For detailed logger configuration options, see the colored.Options struct in the package. For understanding how loggers integrate with the framework, see Logging.
Application
Grouping and Managing Actors as a Unit
An application groups related actors and manages them as a unit. Instead of starting individual processes and tracking their lifecycles manually, you define an application that specifies which actors to start, in what order, and how the group should behave if individual actors fail.
Think of an application as a recipe. It lists the components (actors and supervisors), describes their startup order, and specifies the rules for what happens when things go wrong. The node follows this recipe when starting the application and monitors the running components according to the specified mode.
The Need for Applications
Starting processes one at a time works for simple systems. But as complexity grows, you face coordination problems. Which processes should start first? What if one fails to start - do you continue or abort? If a critical component terminates, should the service keep running in a degraded state or shut down cleanly?
These aren't implementation details - they're architectural decisions about your service's structure and fault tolerance policy. Applications let you declare these decisions explicitly rather than scattering the logic throughout your code. The specification documents what your service consists of. The mode declares your termination policy. The framework enforces both.
Defining an Application
Applications implement the gen.ApplicationBehavior interface:
The Load callback returns the application specification - what this application consists of and how it should behave. The Start callback runs after all processes start successfully. The Terminate callback runs when the application stops.
A typical application specification:
The Group lists processes to start. Processes start in the order listed. If a process has a Name, it's registered with that name, making it discoverable. Processes without names are anonymous.
Application names and process names exist in separate namespaces. An application named "api" and a process named "api" do not conflict - you can have both registered simultaneously. However, using the same name for both creates confusion when reading code or debugging. Avoid identical names even though the framework allows it.
Application Modes
The mode determines what happens when a process in the application terminates.
Temporary Mode - The application continues running despite individual process terminations. Only when all processes have stopped does the application itself terminate. This mode is for applications where components can fail and restart independently (typically via supervisors) without stopping the whole application.
Transient Mode - The application stops if any process terminates abnormally (crashes, panics, errors). Normal termination doesn't trigger shutdown. When an abnormal termination occurs, all remaining processes receive exit signals and the application shuts down. Use this mode when abnormal failures indicate a systemic problem that requires stopping the entire service.
Permanent Mode - The application stops if any process terminates, regardless of reason. Even normal termination of one process triggers shutdown of all others and the application itself. This mode is for applications where all components must run together - if one stops, the whole application is incomplete.
Loading and Starting
Applications go through two phases: loading and starting.
Loading calls your Load callback, validates the specification, and registers the application with the node. The application is loaded but not running. This separation allows you to load multiple applications and resolve dependencies before starting any of them.
Starting launches the processes in the Group according to their order. If dependencies are specified in ApplicationSpec.Depends, the node ensures those applications are running first. If any process fails to start (including initialization timeout), previously started processes are killed and the application fails to start.
Application processes have a maximum InitTimeout of 15 seconds (3x DefaultRequestTimeout). Setting a higher value in gen.ProcessOptions returns gen.ErrNotAllowed and prevents the application from starting.
Once all processes are running, the Start callback is called and the application enters the running state.
Dependencies
Applications can depend on other applications or network services. If application B depends on application A, the node ensures A is running before starting B. Dependencies are declared in ApplicationSpec.Depends.
This allows you to structure complex systems with clear startup ordering. A database connection pool application starts before the API server application. The API server starts before the web frontend application. The framework handles the ordering automatically.
Stopping Applications
Applications stop in three ways.
You can call ApplicationStop, which sends exit signals to all processes and waits for them to terminate gracefully (5 second timeout by default). Once all processes have stopped, the Terminate callback runs and the application transitions to the loaded state.
You can call ApplicationStopForce, which kills all processes immediately without waiting. Less graceful, but guaranteed to stop quickly.
The application can stop itself based on its mode. In Transient or Permanent mode, process failures trigger automatic shutdown according to the mode's rules.
Environment and Configuration
Applications have environment variables that all their processes inherit. These override node-level variables but are overridden by process-specific variables. This creates a natural layering: node provides defaults, application provides service-specific values, processes can override for their specific needs.
Tags for Instance Selection
Running multiple instances of the same application across a cluster creates a selection problem. Which instance should handle the request? In blue/green deployments, you run two versions and route traffic based on readiness. Canary deployments send a percentage to the new version. Some instances enter maintenance mode while others serve production traffic.
Tags provide metadata for making these decisions. Label each application instance with tags describing its deployment state, version, or role:
Tags are always available through node.ApplicationInfo() or remoteNode.ApplicationInfo(). For clusters using centralized registrars (etcd, Saturn), tags are also published during application route registration. This enables cluster-wide discovery: query the registrar and receive all application instances with their tags.
The embedded in-memory registrar does not support application route registration, so tags in single-node or statically-routed deployments are only accessible via direct ApplicationInfo() calls, not through resolver queries.
Tags separate deployment strategy from application code. Your application doesn't know it's the "blue" deployment - that's configuration. The routing logic queries tags and makes decisions based on current cluster state.
Process Role Mapping
Applications contain multiple processes with specific responsibilities. An API server handles requests. A connection pool manages database connections. A cache manager stores frequently accessed data. These are logical roles, but the actual process names might be versioned, generated, or environment-specific.
The Map field bridges this gap. Define a mapping from logical role (string) to actual process name (Atom):
To communicate with a process by role, get the application info, look up the role in the map, then use the returned name:
This works for both local and remote applications. When querying a remote application, RemoteNode.ApplicationInfo() retrieves the map from the remote node, letting you discover process names without prior knowledge of the remote application's internal structure.
Why use mapping:
Version changes: Update "api_server_v2" to "api_server_v3" without changing client code
Implementation swaps: Map "db" to different pool implementations based on deployment
Remote discovery: Remote nodes query the map to find process names in foreign applications
The map provides a service contract. External code knows the application has an "api" role and a "db" role. The actual implementations can change as long as the roles remain consistent.
The Application Pattern
Applications provide structure to your actor system. Instead of scattered process creation throughout your code, applications centralize the "what runs in this service" question. The specification documents your system's structure. The mode declares your fault tolerance policy. The dependency mechanism ensures correct startup ordering.
This organization becomes especially valuable in distributed systems where services start on different nodes. An application can be started remotely on another node, bringing all its components with the correct configuration and dependencies.
For more details on application lifecycle and options, refer to the gen.ApplicationBehavior and gen.ApplicationSpec documentation in the code.
Boilerplate Code Generation
The ergo tool allows you to generate the structure and source code for a project based on the Ergo Framework. To install it, use the following command:
When using ergo tool, you need to follow the specific template for providing arguments:
Parent:Actor{param1:value1,param2:value2...}
Parent can be a supervisor (specified earlier with -with-sup) or an application (specified earlier with -with-app).
Actor can be an actor (added earlier with -with-actor) or a supervisor (specified earlier with -with-sup).
This structured approach ensures the proper hierarchy and parameters are defined for your actors and supervisors
Available Arguments and Parameters :
-init <node_name>: a required argument that sets the name of the node for your service. Available parameters:
tls: enables encryption for network connections (a self-signed certificate will be used).
Example
For clarity, let's use all available arguments for ergo in the following example:
Pay attention to the values of the -with-tcp and -with-web arguments — they are enclosed in double quotes. If an argument has multiple parameters, they are separated by commas without spaces. However, since commas are argument delimiters for the shell interpreter, we enclose the entire value of the argument in double quotes to ensure the shell correctly processes the parameters.
In our example, we specified two loggers: colored and rotate. This allows for colored log messages in the standard output as well as logging to files with log rotation functionality. In this case, the default logger is disabled to prevent duplicate log messages from appearing on the standard output.
Additionally, we included the observer application. By default, this interface is accessible at http://localhost:9911.
As a result of the generation process, we have a well-structured project source code that is ready for execution:
The generated code is ready for compilation and execution:
Since this example includes the , you can open http://localhost:9911 in your browser to access the web interface for and its running processes.
WebWorker
WebWorker is a specialized actor for handling HTTP requests sent as meta.MessageWebRequest messages. It automatically routes requests to HTTP-method-specific callbacks and ensures the request completion signal is called.
Used with meta.WebHandler to convert HTTP requests into actor messages. See for integration approaches.
Unimplemented HTTP methods return 501 Not Implemented automatically.
Error Handling
Return nil to continue processing requests. Return non-nil error to terminate the worker:
Returning error terminates the worker. Use this for fatal errors only (database connection lost, critical resource unavailable). For transient errors (validation, not found, conflict), write error response and return nil.
Using with act.Pool
Single worker processes one request at a time. Use act.Pool for concurrent processing:
Spawn pool instead of single worker:
WebHandler sends requests to pool. Pool distributes across 10 workers. System handles 10 concurrent requests.
WebWorker processes meta.MessageWebRequest specially, but also receives regular messages:
This allows workers to receive configuration updates, control messages, or other actor communication while processing HTTP requests.
Implementation Details
WebWorker implements gen.ProcessBehavior at low level. It manages the mailbox loop, detects meta.MessageWebRequest, routes by HTTP method, and calls Done() after processing.
The Done() call is critical. It cancels the context that WebHandler blocks on. Without it, HTTP request would timeout. WebWorker guarantees Done() is called even if your callback panics or returns error.
Default implementations for all callbacks exist. Unimplemented HTTP methods log warning and return 501 Not Implemented. This allows implementing only the methods you need without boilerplate for unsupported methods.
// Query registrar for all instances
routes, err := resolver.ResolveApplication("api_service")
// Returns []ApplicationRoute, each with Node, Tags, Weight, State
// Filter by tag
for _, route := range routes {
hasBlue := false
for _, tag := range route.Tags {
if tag == "blue" {
hasBlue = true
break
}
}
if hasBlue {
remoteNode, _ := network.GetNode(route.Node)
info, _ := remoteNode.ApplicationInfo("api_service")
// Use this instance
}
}
// Query application info (works locally or remotely)
info, err := node.ApplicationInfo("backend")
// or: info, err := remoteNode.ApplicationInfo("backend")
// Find process name by role
apiName, found := info.Map["api"]
if found {
// Use the actual process name to communicate
response, err := node.Call(apiName, APIRequest{})
}
type APIWorker struct {
act.WebWorker
}
func (w *APIWorker) HandleGet(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
// Process GET request
user := w.lookupUser(request.URL.Query().Get("id"))
json.NewEncoder(writer).Encode(user)
return nil
}
func (w *APIWorker) HandlePost(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
// Process POST request
var data CreateRequest
json.NewDecoder(request.Body).Decode(&data)
result := w.createResource(data)
writer.WriteHeader(http.StatusCreated)
json.NewEncoder(writer).Encode(result)
return nil
}
func (w *APIWorker) HandleDelete(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
id := request.URL.Query().Get("id")
w.deleteResource(id)
writer.WriteHeader(http.StatusNoContent)
return nil
}
Remote spawning means starting a process on another node from your code. You call a method, provide a factory name and options, and a process starts on the remote node. From the caller's perspective, it's nearly identical to spawning locally - you get back a gen.PID and can communicate with it immediately.
This capability enables dynamic workload distribution. Your node needs to process a job but doesn't have capacity? Spawn a worker on a remote node with available resources. Your application needs to scale horizontally? Spawn processes across multiple nodes and distribute load. Remote spawning makes the cluster feel like one large computing resource rather than isolated nodes.
But remote spawning isn't automatic. Security matters. You don't want arbitrary nodes spawning arbitrary processes on your infrastructure. The framework requires explicit permission - the remote node must enable each process factory individually and can restrict which nodes are allowed to use it.
Security Model
Remote spawning is disabled by default at the framework level. To enable it, set the EnableRemoteSpawn flag in your node's network configuration:
This flag is a global switch. With it disabled, all remote spawn requests fail immediately with gen.ErrNotAllowed. With it enabled, requests proceed to the next level of security: per-factory permission.
Enabling Process Factories
Even with EnableRemoteSpawn turned on, remote nodes can't spawn anything until you explicitly enable specific process factories:
Now remote nodes can request spawning using the factory name "worker". The factory function createWorker returns a gen.ProcessBehavior, just like local spawning. When a remote spawn request arrives with name "worker", the framework calls createWorker() to instantiate the process.
The factory name is the permission token. Remote nodes must use this exact name when requesting spawns. If they request "worker" and you haven't enabled it, the request fails. If they request "admin_process" without permission, it fails. You control the namespace of what's spawnable.
Access Control Lists
By default, EnableSpawn allows all nodes to use the factory. But you can restrict it to specific nodes:
Now only those two nodes can spawn workers. Requests from other nodes fail with gen.ErrNotAllowed.
You can update the access list dynamically:
Calling EnableSpawn again with the same factory name updates the access list. The factory must be the same (same type) - you can't change which factory is associated with a name after the first EnableSpawn call. Attempting to do so returns an error.
Disabling Access
To remove nodes from the access list:
This removes scheduler@node2 from the allowed list. Other nodes in the list remain allowed.
To completely disable a factory:
Without any node arguments, DisableSpawn removes the factory entirely. All future spawn requests for that name fail.
To re-enable the factory with an open access list (any node can spawn):
This is the explicit "allow all nodes" configuration.
Spawning on Remote Nodes
To spawn a process on a remote node, first get a gen.RemoteNode interface:
GetNode establishes a connection if needed. If a connection already exists, it returns immediately. If discovery or connection fails, you get an error.
With the remote node handle, spawn a process:
The gen.ProcessOptions are the same as local spawning: mailbox size, compression settings, parent process options. The remote node respects these options when creating the process.
Spawn with Arguments
You can pass initialization arguments to the remote process:
These arguments are passed to the factory's Init callback, just like local spawning. The arguments must be serializable via EDF - primitives, registered structs, framework types. Complex arguments require type registration on both sides.
Spawn with Registration
To spawn and register the process with a name:
The first argument is the registration name. The remote process is registered under that name on the remote node, allowing other processes on that node (or other nodes) to find it via gen.ProcessID{Name: "worker-001", Node: "worker@otherhost"}.
Spawning from Processes
The gen.Process interface provides methods for remote spawning from within a process:
This differs from using RemoteNode.Spawn in a subtle but important way: the spawned process inherits properties from the calling process, not from the node.
Inherited properties:
Application name - if the caller is part of an application, the remote process becomes part of that application too
Logging level - the remote process uses the same log level as the caller
Environment variables - if ExposeEnvRemoteSpawn security flag is enabled, the remote process gets a copy of the caller's environment
This inheritance enables application-level distribution. If your application spawns processes remotely using process.RemoteSpawn, those processes belong to your application's supervision tree (conceptually), inherit your configuration, and operate as extensions of your application rather than independent processes.
The same inheritance applies.
Parent Relationship and Inheritance
Remote spawn behavior differs based on whether you spawn from a process or from the node:
From a process (process.RemoteSpawn):
The spawned process inherits attributes from the calling process:
Parent PID: Set to the calling process's PID
Group Leader: Set to the calling process's group leader
Application: Set to the calling process's application name (if caller belongs to an application)
The remote process can send messages to its parent using process.Parent(). If LinkChild: true is set in options, the link is established after spawn. However, the parent is on a different node - if the network connection drops, the remote process receives an exit signal for the lost parent and may terminate if linked.
From the node (RemoteNode.Spawn):
The spawned process receives attributes from the requesting node's core:
Parent PID: Set to the requesting node's core PID
Group Leader: Set to the requesting node's core PID
Application: Not set (empty - process doesn't belong to any application)
This creates independent processes without application affiliation. Use this for standalone remote workers that don't need to be part of an application's logical structure.
Environment Variable Inheritance
By default, remote processes don't inherit environment variables. This is a security decision - you probably don't want to expose your node's configuration to remote processes.
To enable environment inheritance:
Now when you use process.RemoteSpawn, the remote process receives a copy of the calling process's environment. The remote node reads these values and sets them on the spawned process.
Important: Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via edf.RegisterTypeOf. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote spawn fails entirely with an error like "no encoder for type <type>". The framework doesn't skip problematic variables - any non-serializable value causes the entire spawn request to fail.
Environment inheritance only works with process.RemoteSpawn. Using RemoteNode.Spawn doesn't inherit environment because there's no calling process - it's a node-level operation.
How It Works
When you call remote.Spawn:
Check capabilities - The local node checks if the remote node's EnableRemoteSpawn flag is true (learned during handshake). If false, fail immediately.
Create spawn message - Package the factory name, process options, and arguments into a MessageSpawn protocol message. Include a reference for tracking the response.
If anything fails (factory not found, access denied, remote node terminating, initialization timeout), the error is returned to the caller. The entire operation is synchronous from the caller's perspective - you call Spawn and block until the process is created or an error occurs.
Practical Considerations
Performance - Remote spawning is slower than local spawning. There's network latency, message encoding, and a synchronous request-response roundtrip. If you're spawning hundreds of processes, doing it remotely will be noticeably slower. Consider spawning a pool locally and distributing work via messages rather than spawning on-demand remotely.
Timeouts - Remote spawn has a maximum InitTimeout of 15 seconds (3x DefaultRequestTimeout). If the remote process's ProcessInit takes longer, spawn fails with gen.ErrTimeout. Setting InitTimeout higher than 15 seconds returns gen.ErrNotAllowed immediately without attempting the spawn.
Failure modes - Remote spawn can fail in ways local spawn can't. The network connection can drop mid-request. The remote node can crash before responding. The factory might exist but lack permission. Handle errors explicitly and have fallback strategies (retry, spawn locally, defer the work).
Resource ownership - A process spawned on a remote node runs on that node's resources (CPU, memory). It's part of that node's process table. If the remote node terminates, the process dies. If you're distributing workload, be aware of which node owns which processes.
Linking - Both LinkChild and LinkParent options work for remote spawn. The link is established after the remote process is created. If the network connection drops, linked processes receive exit signals for the lost peer.
Application membership - Processes spawned via RemoteNode.Spawn don't belong to any application. Processes spawned via process.RemoteSpawn inherit the caller's application. This affects supervision, lifecycle, and monitoring.
Registration names - Use SpawnRegister carefully. The name you provide is registered on the remote node. If that name is already taken, spawn fails. Ensure your naming strategy avoids conflicts, especially if multiple nodes are spawning on the same target.
When to Use Remote Spawn
Dynamic scaling - Your application detects high load and spawns additional workers on remote nodes to handle the burst. When load decreases, workers terminate naturally and resources are freed.
Specialized hardware - Some nodes have GPUs, fast storage, or special network access. Spawn processes on those nodes when you need their capabilities, rather than sending data back and forth.
Fault isolation - Spawn risky operations on remote nodes. If they crash or consume excessive resources, they don't affect your local node's stability.
Data locality - If data lives on a specific node (in memory, on local disk), spawn processing near the data rather than transferring it across the network.
Heterogeneous clusters - Different nodes run different process types. Scheduler nodes spawn job processors on worker nodes. API nodes spawn request handlers on computation nodes. Remote spawning enables this separation.
Remote spawning isn't always the right answer. For static topologies where processes have fixed homes, use supervision trees and let supervisors spawn locally. For message-passing workloads where spawning overhead matters, use process pools and distribute work via messages. Remote spawning shines when you need dynamic, on-demand process creation across a cluster.
For understanding the underlying network mechanics, see . For controlling connections to remote nodes, see .
WebSocket
WebSocket provides persistent bidirectional connections between clients and servers. Unlike HTTP request-response, a WebSocket connection remains open for extended periods, allowing both client and server to send messages at any time.
The framework provides WebSocket meta-process implementation that integrates WebSocket connections with the actor model. Each connection becomes an independent actor addressable from anywhere in the cluster.
The Integration Problem
WebSocket connections need two capabilities simultaneously:
Continuous reading: Connection must block reading messages from the client. When a message arrives, forward it to application actors for processing.
Asynchronous writing: Backend actors must be able to push messages to the client at any time - notifications, updates, events from the actor system.
This is exactly what meta-processes solve. External Reader continuously reads from the WebSocket. Actor Handler receives messages from backend actors and writes to the WebSocket. Both operate concurrently on the same connection.
Components
Two meta-processes work together:
WebSocket Handler: Implements http.Handler interface. When HTTP request arrives, upgrades it to WebSocket connection using gorilla/websocket library. Spawns Connection meta-process for each upgrade. Returns immediately - does not block.
WebSocket Connection: Meta-process managing one WebSocket connection. External Reader continuously reads messages from client, sends them to application actors. Actor Handler receives messages from actors, writes them to client. Connection lives until client disconnects or error occurs.
Creating WebSocket Server
Use websocket.CreateHandler to create handler meta-process:
Handler options:
ProcessPool: List of process names that will receive messages from WebSocket connections. When connection is established, handler round-robins across this pool to select which process receives messages from this connection. If empty, connection sends to parent process.
HandshakeTimeout: Maximum time for WebSocket upgrade handshake. Default 15 seconds.
EnableCompression: Enable per-message compression. Reduces bandwidth for text messages.
CheckOrigin: Function to verify request origin. Return true to accept, false to reject. Default rejects cross-origin requests. Use func(r *http.Request) bool { return true } to accept all origins.
Connection Lifecycle
When client connects:
HTTP request arrives, handler upgrades to WebSocket
Handler spawns Connection meta-process
Connection sends MessageConnect to application
During connection lifetime:
Client messages: External Reader reads → sends to application
Server messages: Application sends → Actor Handler writes to client
Both directions operate simultaneously
When client disconnects:
ReadMessage() returns error
External Reader sends MessageDisconnect to application
Connection closes socket
Messages
Three message types flow between connections and actors:
websocket.MessageConnect: Sent when connection established.
Receive this to track new connections:
websocket.MessageDisconnect: Sent when connection closes.
Receive this to clean up connection state:
websocket.Message: Client message received or server message to send.
Receive messages from client:
Send messages to client:
When sending, Type defaults to MessageTypeText if not set. ID field is ignored - target is specified in SendAlias() call.
Network Transparency
Connection meta-processes have gen.Alias identifiers that work across the cluster. Any actor on any node can send messages to any connection:
Network transparency makes every WebSocket connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without intermediaries.
Client Connections
Create client-side WebSocket connections with websocket.CreateConnection:
CreateConnection performs WebSocket dial during creation. If dial fails, error is returned. If successful, connection is established but meta-process is not started yet. Call SpawnMeta() to start the meta-process. If spawn fails, call conn.Terminate(err) to close the connection.
Connection options:
URL: WebSocket server address. Use ws:// or wss:// scheme.
Process: Process name that will receive messages from server. If empty, sends to parent process.
HandshakeTimeout: Maximum time for connection handshake. Default 15 seconds.
EnableCompression: Enable compression. Must match server setting.
Client connections work identically to server connections. External Reader reads from server, Actor Handler sends to server. Messages use the same websocket.Message type.
Process Pool Distribution
Handler accepts ProcessPool - list of process names to receive connection messages. Handler distributes connections across this pool using round-robin:
Connection 1 sends to "handler1", connection 2 to "handler2", connection 3 to "handler3", connection 4 to "handler1", etc. This distributes load across multiple handler processes.
Useful for scaling: spawn multiple handler processes, each managing subset of connections. Prevents single handler from becoming bottleneck.
Behind the NAT
Running nodes behind NAT or load balancers
When a node starts, it registers its routes with a registrar. A route contains connection parameters: port number, TLS flag, handshake version, protocol version, and optionally a host address. When another node needs to connect, it resolves the target node's routes from the registrar and uses these parameters to establish a connection.
The host address in the route is optional. When empty, the connecting node extracts the host from the target's node name. If you're connecting to [email protected], the framework extracts 10.0.1.50 and connects to that address on the resolved port.
This works when node names reflect reachable addresses. But when a node is behind NAT, its node name contains a private IP that external nodes can't reach. The solution is to include a public address in the route itself using RouteHost and RoutePort.
Remote Start Application
Starting applications on remote nodes
Remote application starting means launching an application on another node from your code. The remote node has the application loaded but not running. You send a start request, and the application starts on that node with the mode and options you specify. The application runs under the remote node's supervision, part of the remote node's application tree.
This capability enables dynamic application deployment and orchestration. You have a cluster of nodes, each with applications loaded but waiting. A coordinator node decides which applications should run where, based on load, topology, or scheduling logic. Remote application starting makes this coordination explicit and controllable.
Like remote spawning, remote application starting isn't automatic. Security matters. You don't want arbitrary nodes starting arbitrary applications. The framework requires explicit permission - the remote node must enable each application individually and can restrict which nodes are allowed to start it.
Inspecting With Observer
Installation and starting
To install the observer tool, you need to have Golang compiler version 1.20 or higher. Run the following command:
Log Level: Inherits the calling process's log level
Environment: Inherits the calling process's environment (if SecurityOptions.ExposeEnvRemoteSpawn is enabled)
Log Level: Inherits the requesting node's default log level
Environment: Inherits the requesting node's environment (if SecurityOptions.ExposeEnvRemoteSpawn is enabled)
Send request
- Encode and send the message to the remote node. Wait for a response (this is synchronous - remote spawning blocks until the remote node replies).
Remote processing - The remote node receives the message, checks if the factory is enabled, checks if the requesting node is allowed, calls the factory function, spawns the process with the given options.
Response - The remote node sends back a MessageResult containing either the spawned PID or an error. The local node receives this, resolves the waiting request, and returns the PID to the caller.
Remote application starting is disabled by default at the framework level. To enable it, set the EnableRemoteApplicationStart flag in your node's network configuration:
This flag is a global switch. With it disabled, all remote application start requests fail immediately with gen.ErrNotAllowed. With it enabled, requests proceed to per-application permission.
Enabling Applications
Even with EnableRemoteApplicationStart turned on, remote nodes can't start anything until you explicitly enable specific applications:
Now remote nodes can request starting the "workers" application. The application must be loaded on this node (via node.ApplicationLoad). If it's not loaded, remote start requests fail with gen.ErrApplicationUnknown. If it's already running, remote start requests fail because you can't start a running application again.
The application name is the permission token. Remote nodes must use this exact name when requesting starts. If they request "workers" and you haven't enabled it, the request fails. If they request "admin_app" without permission, it fails. You control what's startable remotely.
Access Control Lists
By default, EnableApplicationStart allows all nodes to start the application. But you can restrict it to specific nodes:
Now only those two nodes can start the workers application. Requests from other nodes fail with gen.ErrNotAllowed.
You can update the access list dynamically:
Calling EnableApplicationStart again with the same application name updates the access list.
Disabling Access
To remove nodes from the access list:
This removes scheduler@node2 from the allowed list. Other nodes in the list remain allowed.
To completely disable remote starting for an application:
Without any node arguments, DisableApplicationStart removes the permission entirely. All future start requests for that application fail.
To re-enable with an open access list (any node can start):
This is the explicit "allow all nodes" configuration.
Starting Applications on Remote Nodes
To start an application on a remote node, first get a gen.RemoteNode interface:
With the remote node handle, start an application:
The application starts on the remote node. The start is synchronous - the call blocks until the remote node confirms the application started or returns an error.
Application Startup Modes
Applications have three startup modes: Temporary, Transient, and Permanent. These modes control restart behavior when the application terminates. For remote starts, you can specify the mode explicitly:
If you use ApplicationStart without specifying a mode, the application starts with the mode it was loaded with (set during ApplicationLoad).
The mode affects how the remote node's application supervisor handles termination. If the application crashes, does it restart automatically? The mode determines this. Choose based on your operational requirements - critical services should be Permanent, optional services can be Temporary, and services that should restart only on failure can be Transient.
When an application starts remotely, parent tracking is set at multiple levels:
Application Parent: Set to the requesting node name:
Process Parent for Group Members: Processes started directly by the application (listed in Group) receive the requesting node's core PID as their parent:
Process Parent for Descendants: If those processes spawn children, the children receive their spawning process PID as parent (normal process hierarchy):
Only the first-level processes (application group members) have the cross-node parent relationship. Subsequent generations follow standard process parent-child relationships within the local node.
This parent information is for tracking and auditing, not supervision. The application is supervised by the local application supervisor on the remote node. Terminating the requesting node does not affect the running application.
Environment Variable Inheritance
By default, remote applications don't inherit environment variables from the requesting node. To enable environment inheritance:
Now when you start an application remotely, the application's processes receive a copy of the requesting node's core environment. This enables configuration propagation - your scheduler node has configuration in its environment, and applications started remotely inherit it.
Important: Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via edf.RegisterTypeOf. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote application start fails entirely with an error like "no encoder for type <type>". The framework doesn't skip problematic variables - any non-serializable value causes the entire start request to fail.
How It Works
When you call remote.ApplicationStart:
Check capabilities - The local node checks if the remote node's EnableRemoteApplicationStart flag is true (learned during handshake). If false, fail immediately.
Create start message - Package the application name, startup mode, and options into a MessageApplicationStart protocol message. Include a reference for tracking the response.
Send request - Encode and send the message to the remote node. Wait for a response (this is synchronous - remote application start blocks until the remote node replies).
Remote processing - The remote node receives the message, checks if the application is enabled for remote start, checks if the requesting node is allowed, verifies the application exists and isn't already running, calls the application's start logic with the given mode.
Response - The remote node sends back a MessageResult containing either success or an error. The local node receives this, resolves the waiting request, and returns the result to the caller.
If anything fails (application not found, access denied, already running, remote node terminating), the error is returned to the caller. The entire operation is synchronous - you call ApplicationStart and block until the application is running or an error occurs.
Practical Considerations
Idempotency - Starting an already-running application returns an error. If you're unsure of the application's state, query it first using remote.ApplicationInfo to check if it's already running. Or handle the error gracefully and treat "already running" as success.
Startup time - Some applications take time to start - they might load configuration, establish connections, initialize state. The remote start call blocks during this entire startup sequence. If startup is slow, the caller waits. For long-running startup logic, consider using async patterns or monitoring application state separately.
Failure modes - Remote application start can fail in ways local start can't. The network connection can drop mid-request. The remote node can crash before responding. The application might fail to start for reasons specific to that node (missing dependencies, configuration issues). Handle errors explicitly.
Resource contention - An application starting on a remote node consumes that node's resources (CPU, memory, file descriptors). If multiple nodes simultaneously request starting applications on the same remote node, it could become resource-constrained. Coordinate start requests to avoid overwhelming nodes.
Application lifecycle - Once started remotely, the application runs until explicitly stopped or until the remote node terminates. The requesting node has no automatic control over the running application. If you want to stop it later, you need to send another request (the framework doesn't currently support remote application stop, but you can implement custom coordination via messages).
Supervision independence - The application is supervised by the remote node, not by the requesting node. If the requesting node crashes, the application keeps running. If the remote node crashes, the application terminates. This independence is important for operational reasoning - the application's lifecycle is tied to where it runs, not to who started it.
Configuration management - Applications often need configuration. With ExposeEnvRemoteApplicationStart, you can propagate environment variables. But this creates coupling - the application depends on the requesting node's configuration. Consider whether configuration should come from the remote node's local environment, from a centralized configuration service, or from the requesting node. The right answer depends on your architecture.
When to Use Remote Application Start
Dynamic orchestration - A coordinator node decides which applications should run on which nodes based on cluster state, resource availability, or scheduling logic. The coordinator starts applications dynamically as needed.
Staged deployment - Applications are pre-loaded on nodes but not started. A deployment controller starts them in a specific order, waiting for health checks between stages. This enables controlled rollouts.
Capacity management - Some applications run only during high-load periods. A resource manager monitors load and starts applications on additional nodes when needed, then stops them when load decreases.
Geographic distribution - Applications are loaded across multiple regions. A traffic manager starts applications in specific regions based on user distribution, latency requirements, or failover needs.
Testing and validation - Test frameworks load applications on test nodes but don't start them until test execution. Tests start applications with specific configurations, run scenarios, then stop them. This enables repeatable, isolated testing.
Maintenance windows - During maintenance, you stop applications on a node, perform updates, then start them again. Remote start enables coordinated maintenance across a cluster without manually SSHing to each node.
Remote application starting is about control and coordination. If your cluster has static application deployment (applications always run on specific nodes), you don't need this feature - use supervision trees and let supervisors start applications automatically. If your cluster has dynamic application deployment (applications move between nodes based on conditions), remote application starting enables that flexibility.
For understanding the underlying network mechanics, see Network Stack. For controlling connections to remote nodes, see Static Routes. For understanding application lifecycle and modes, see Application.
// Allow only these nodes to spawn workers
network.EnableSpawn("worker", createWorker,
"scheduler@node1",
"scheduler@node2",
)
// Add more nodes to the allowed list
network.EnableSpawn("worker", createWorker,
"scheduler@node1",
"scheduler@node2",
"scheduler@node3", // newly allowed
)
// Remove specific nodes
network.DisableSpawn("worker", "scheduler@node2")
// No nodes can spawn workers anymore
network.DisableSpawn("worker")
// Re-enable for all nodes
network.EnableSpawn("worker", createWorker) // no node arguments
pid, err := remote.Spawn("worker", gen.ProcessOptions{})
if err != nil {
// handle error - not allowed, factory not found, remote node terminated, etc
}
// pid is the process running on the remote node
process.Send(pid, WorkRequest{Job: "process-data"})
// Allow only these nodes to start the workers app
network.EnableApplicationStart("workers",
"scheduler@node1",
"scheduler@node2",
)
// Add more nodes to the allowed list
network.EnableApplicationStart("workers",
"scheduler@node1",
"scheduler@node2",
"scheduler@node3", // newly allowed
)
// Remove specific nodes
network.DisableApplicationStart("workers", "scheduler@node2")
// No nodes can start this application remotely anymore
network.DisableApplicationStart("workers")
// Re-enable for all nodes
network.EnableApplicationStart("workers") // no node arguments
err := remote.ApplicationStart("workers", gen.ApplicationOptions{})
if err != nil {
// handle error - not allowed, app not loaded, already running, etc
}
// Start as temporary (not restarted if it terminates)
err := remote.ApplicationStartTemporary("workers", gen.ApplicationOptions{})
// Start as transient (restarted only if it terminates abnormally)
err := remote.ApplicationStartTransient("workers", gen.ApplicationOptions{})
// Start as permanent (always restarted if it terminates)
err := remote.ApplicationStartPermanent("workers", gen.ApplicationOptions{})
// On the remote node
info, err := node.ApplicationInfo("workers")
// info.Parent == "scheduler@node1" (requesting node name)
Understanding the resolution flow clarifies why NAT causes problems and how RouteHost solves them.
When a node registers with any registrar (embedded, etcd, or Saturn), it sends its routes:
The registrar stores these routes exactly as received. When another node resolves [email protected]:
The connecting node checks if route.Host is set. If empty, it extracts the host from the node name as a fallback.
The NAT Problem
When a node is behind NAT, its node name contains a private IP. The external node resolves routes, gets an empty host, extracts 10.0.1.50 from the node name, and tries to connect to a private IP that's unreachable from the internet.
The Solution: RouteHost and RoutePort
Tell the node what address to advertise by setting RouteHost and RoutePort in AcceptorOptions:
Now the route registered with the registrar includes the public address:
When another node resolves:
The connecting node sees a non-empty Host in the route and uses it directly. No fallback to node name extraction. The connection goes to the public address, NAT forwards it, and the connection succeeds.
Field Reference
AcceptorOptions Field
Purpose
Host
Network interface to bind the listener socket
Port
TCP port to listen on
RouteHost
Host address to advertise in route registration
RoutePort
Host and RouteHost are independent:
Host: "0.0.0.0" binds to all interfaces but is useless as a connectable address
RouteHost: "203.0.113.50" is what other nodes use to connect
Registrar Behavior
All registrars (embedded, etcd, Saturn) handle routes identically:
Registration: Store routes exactly as provided, including Host field
Resolution: Return routes exactly as stored
Connection: Connecting node uses route.Host if set, otherwise extracts from node name
The embedded registrar sends resolution queries via UDP to the host portion of the node name. For [email protected], it queries 10.0.1.50:4499. This works because the registrar query goes to the private network (where the registrar runs), not to the NAT-ed node directly.
External registrars (etcd, Saturn) use their central server for all queries. The node name's host portion is irrelevant for resolution since queries go to etcd/Saturn, not to the target host.
Common Scenarios
Same Port Forwarding
NAT forwards the same port (15000 external = 15000 internal):
Different Port Forwarding
NAT maps different ports (32000 external -> 15000 internal):
DNS Name Instead of IP
Advertise a DNS name for flexibility:
The DNS name is stored in the route. Connecting nodes resolve DNS at connection time, getting the current IP.
Kubernetes NodePort
Pod behind NodePort service:
Local Network Considerations
Setting RouteHost affects all nodes that resolve your address, including nodes on the same local network. If local nodes should use internal addresses while external nodes use public addresses, you have several options.
Multiple Acceptors
Run acceptors on different ports for internal and external access:
Both routes are registered. Local nodes can connect via either. External nodes can only use the one with RouteHost set.
Static Routes on Local Nodes
Configure local nodes to bypass registrar resolution:
Static routes are checked before registrar resolution. Local nodes use the static route (internal IP), external nodes use registrar resolution (public IP from RouteHost).
Hairpin NAT
Hairpin NAT (also called NAT loopback) allows internal nodes to connect using the public IP address.
When you set RouteHost: "203.0.113.50", all nodes - including local ones - receive this public address from the registrar and try to connect to it.
Without hairpin NAT support:
With hairpin NAT support:
The traffic makes a "hairpin turn" at the NAT device - goes toward the external interface, turns around, comes back to the internal network.
This is a network infrastructure configuration on your router/firewall, not an application change. Check your NAT device documentation for "hairpin NAT", "NAT loopback", or "NAT reflection" settings.
Relation to Static Routes
RouteHost/RoutePort and static routes solve opposite problems:
Problem
Solution
You're behind NAT, others can't reach you
Set RouteHost/RoutePort to advertise your public address
Others are behind NAT, you can't reach them
Configure static routes with their public addresses
In complex topologies, you might use both. Your node advertises its public address via RouteHost. It also configures static routes to reach other nodes through specific gateways.
Troubleshooting
External nodes can't connect
Verify NAT/firewall forwards traffic to your node
Check RouteHost and RoutePort match your NAT configuration
Confirm the public address is reachable from outside
Local nodes unnecessarily using public address
Expected when RouteHost is set. Use multiple acceptors or static routes to give local nodes a direct path.
Wrong port advertised
If using PortRange and the first port is unavailable, the node binds to a different port. RoutePort (if set) still advertises your configured value. Ensure NAT forwards to the actual bound port, or ensure your configured port is available.
Embedded registrar resolution fails for cross-network nodes
The embedded registrar sends UDP queries to hostname:4499 extracted from the target node name. If [email protected] is behind NAT, external nodes send UDP to 10.0.1.50:4499, which is unreachable. Use external registrars (etcd, Saturn) for cross-network deployments, or configure static routes.
-help: displays information about the available arguments.
-version: prints the current version of the Observer tool.
-host: specifies the interface name for the Web server to run on (default: "localhost").
-port: defines the port number for the Web server (default: 9911).
-cookie: sets the default cookie value used for connecting to other nodes.
If you are running observer on a server for continuous operation, it is recommended to use the environment variable COOKIE instead of the -cookie argument. Using sensitive data in command-line arguments is insecure.
After starting observer, it initially has no connections to other nodes, so you will be prompted to specify the node you want to connect to.
Once you establish a connection with a remote node, the Observer application main page will open, displaying information about that node.
If you have integrated the Observer application into your node, upon opening the Observer page, you will immediately land on the main page showing information about the node where the Observer application was launched.
Info (main page)
On this tab, you will find general information about the node and the ability to manage its logging level. Changing the logging level only affects the node itself and any newly started processes, but it does not impact processes that are already running.
Graphs provide real-time information over the last 60 seconds, including the total number of processes, the number of processes in the running state, and memory usage data. Memory usage is divided into used, which indicates how much memory was reserved from the operating system, and allocated, which shows how much of that reserved memory is currently being used by the Golang runtime.
In addition to these details, you can view information about the available loggers on the node and their respective logging levels. For more details, refer to the Logging section. Environment variables will also be displayed here, but only if the ExposeEnvInfo option was enabled in the gen.NodeOptions.Security settings when the inspected node was started.
Network (main page)
The Network tab displays information about the node's network stack.
The Mode indicates how the network stack was started (enabled, hidden, or disabled).
The Registrar section shows the properties of the registrar in use, including its capabilities. Embedded Server indicates whether the registrar is running in server mode, while the Server field shows the address and port number of the registrar with which the node is registered.
Additionally, the tab provides information about the default handshake and protocol versions used for outgoing connections.
The Flags section lists the set of flags that define the functionality available to remote nodes.
The Acceptors section lists the node's acceptors, with detailed information available for each. This list will be empty if the network stack is running in hidden mode.
Since the node can work with multiple network stacks simultaneously, some acceptors may have different registrar parameters and handshake/protocol versions. For an example of simultaneous usage of the Erlang and Ergo Framework network stacks, refer to the Erlang section.
The Connected Nodes section displays a list of active connections with remote nodes. For each connection, you can view detailed information, including the version of the handshake used when the connection was established and the protocol currently in use. The Flags section shows which features are available to the node when interacting with the remote node.
Since the ENP protocol supports a pool of TCP connections within a single network connection, you will find information about the Pool Size (the number of TCP connections). The Pool DSN field will be empty if this is an incoming connection for the node or if the protocol does not support TCP connection pooling.
Graphs provide a summary of the number of received/sent messages and network traffic over the last 60 seconds, offering a quick overview of communication activity and data flow.
Process list (main page)
On the Processes List tab, you can view general information about the processes running on the node. The number of processes displayed is controlled by the Start from and Limit parameters.
By default, the list is sorted by the process identifier. However, you can choose different sorting options:
Top Running: displays processes that have spent the most time in the running state.
Top Messaging: sorts processes by the number of sent/received messages in descending order.
Top Mailbox: helps identify processes with the highest number of messages in their mailbox, which can be an indication that the process is struggling to handle the load efficiently.
For each process, you can view brief information:
The Behavior field shows the type of object that the process represents.
Application field indicates the application to which the process belongs. This property is inherited from the parent, so all processes started within an application and their child processes will share the same value.
Mailbox Messages displays the total number of messages across all queues in the process's mailbox.
Running Time shows the total time the process has spent in the running state, which occurs when the process is actively handling messages from its queue.
By clicking on the process identifier, you will be directed to a page with more detailed information about that specific process.
Log (main page)
All log messages from the node, processes, network stack, or meta-processes are displayed here. When you connect to the Observer via a browser, the Observer's backend sends a request to the inspector to start a log process with specified logging levels (this log process is visible on the main Info tab).
When you change the set of logging levels, the Observer's backend requests the start of a new log process (the old log process will automatically terminate).
To reduce the load on the browser, the number of displayed log messages is limited, but you can adjust this by setting the desired number in the Last field.
The Play/Pause button allows you to stop or resume the log process, which is useful if you want to halt the flow of log messages and focus on examining the already received logs in more detail.
Process information
This page displays detailed information about the process, including its state, uptime, and other key metrics.
The fallback parameters specify which process will receive redirected messages in case the current process's mailbox becomes full. However, if the Mailbox Size is unlimited, these fallback parameters are ignored.
The Message Priority field shows the priority level used for messages sent by this process.
Keep Network Order is a parameter applied only to messages sent over the network. It ensures that all messages sent by this process to a remote process are delivered in the same order as they were sent. This parameter is enabled by default, but it can be disabled in certain cases to improve performance.
The Important Delivery setting indicates whether the important flag is enabled for messages sent to remote nodes. Enabling this option forces the remote node to send an acknowledgment confirming that the message was successfully delivered to the recipient's mailbox.
The Compression parameters allow you to enable message compression for network transmissions and define the compression settings.
Graphs on this page help you assess the load on the process, displaying data over the last 60 seconds.
Additionally, you can find detailed information about any aliases, links, and monitors created by this process, as well as any registered events and started meta-processes.
The list of environment variables is displayed only if the ExposeEnvInfo option was enabled in the node's gen.NodeOptions.Security settings.
Additionally, on this page, you can send a message to the process, send an exit signal, or even forcibly stop the process using the kill command. These options are available in the context menu.
Inspect (process page)
If the behavior of this process implements the HandleInspect method, the response from the process to the inspect request will be displayed here. The Observer sends these requests once per second while you are on this tab.
In the example screenshot above, you can see the inspection of a process based on act.Pool. Upon receiving the inspect request, it returns information about the pool of processes and metrics such as the number of messages processed.
Log (process page)
The Log tab on the process information page displays a list of log messages generated by the specific process.
Please note that since the Observer uses a single stream for logging, any changes to the logging levels will also affect the content of the Log tab on the main page.
Meta-process information
On this page, you'll find detailed information about the meta-process, along with graphs showing data for the last 60 seconds related to incoming/outgoing messages and the number of messages in its mailbox. The meta-process has only two message queues: main and system.
You can also send a message to the meta-process or issue an exit signal. However, it is not possible to forcibly stop the meta-process using the kill command.
Inspect (meta-process page)
If the meta-process's behavior implements the HandleInspect method, the response from the meta-process to the inspect request will be displayed on this tab. The Observer sends this request once per second while you are on the tab.
Log (meta-process page)
On the Log tab of the meta-process, you will see log messages generated by that specific meta-process. Changing the logging levels will also affect the content of the Log tab on the main page.
Logging
Logging system and logger implementations
Understanding what happens inside a running system requires logging. But logging in distributed actor systems isn't straightforward. Messages pass between dozens of processes. Processes spawn dynamically, handle requests, and terminate. Network connections form and break. Following a single request's path through the system means tracking its journey across multiple processes, possibly across multiple nodes.
Traditional logging compounds the problem. Each component writes to its own log. Process logs go to one file, network logs to another, node events to a third. When something goes wrong, you're piecing together a timeline from scattered sources, correlating by timestamp and hoping you've found all the relevant entries. It's detective work when you need diagnostic clarity.
Ergo Framework centralizes the logging flow while keeping distribution flexible. Every log call - whether from a process, meta process, or the node itself - flows through a single logging system. That system distributes messages to registered loggers based on configurable filters. One logger might write everything to the console. Another might write only errors to a file. A third might send metrics to a monitoring system. The architecture is simple: centralized input, filtered distribution to multiple outputs.
How Messages Flow
When code calls process.Log().Info("message"), the framework creates a gen.MessageLog structure. This contains the timestamp, severity level, source identifier, message format and arguments, and any attached structured fields. The message enters the node's logging subsystem.
The subsystem maintains loggers organized by severity level. Each logger, when registered, declares which levels it handles - perhaps just errors and panics, perhaps everything from debug upward. When a log message arrives, the subsystem looks up which loggers are registered for that message's level and calls their Log methods.
This is fan-out distribution. A single info-level message goes to every logger registered for info level. The default logger writes it to stdout. A file logger appends it to a file. A metrics logger counts it. Each logger receives the same message and processes it independently.
Hidden Loggers (introduced in v3.2.0) - Prefix a logger name with "." to create a hidden logger that's excluded from fan-out. Hidden loggers only receive logs from processes that explicitly call SetLogger(name). This creates truly isolated logging streams - bidirectional isolation. For example, register ".debug" as a hidden logger, then have a specific process use SetLogger(".debug"). That process's logs go only to the hidden logger (not to other loggers), and the hidden logger receives logs only from that process (not from fan-out). This is useful for separating verbose debugging output or creating per-process log files without mixing logs from other processes.
You can also use SetLogger("filename") to send a process's logs to a specific logger. The process's logs go only to that logger, but the logger still receives fan-out logs from other processes. This routes verbose process logs to a dedicated destination but doesn't create isolation - the logger sees both the process's logs and system-wide fan-out.
Severity Levels
The framework provides six severity levels, ordered from most to least verbose:
gen.LogLevelTrace - Framework internals, message routing, network packets. Extremely verbose, intended only for deep debugging of the framework itself.
gen.LogLevelDebug - Application debugging information. Useful during development but typically disabled in production.
gen.LogLevelInfo - Normal informational messages. This is the default level. Startup events, request handling, normal operations.
gen.LogLevelWarning - Conditions that merit attention but don't prevent operation. Deprecated API usage, approaching resource limits, retry scenarios.
gen.LogLevelError - Errors that prevent specific operations but don't crash the system. Failed requests, unavailable resources, validation failures.
gen.LogLevelPanic - Critical errors requiring immediate attention. Despite the name, logging at this level doesn't trigger a panic - it's just the highest severity marker.
Setting a level creates a threshold. Set a process to gen.LogLevelWarning and it logs warnings, errors, and panics, but suppresses info, debug, and trace. Each level implicitly includes all higher severity levels.
Two special levels control behavior rather than representing severity:
gen.LogLevelDefault - Sentinel meaning "inherit." Nodes with this level become gen.LogLevelInfo. Processes with this level inherit from their parent, leader, or node. This default-then-inherit pattern allows hierarchical log level configuration.
gen.LogLevelDisabled - Stops all logging from the source. The framework doesn't even create log messages. Use this to completely silence a source without removing loggers.
Trace deserves special mention. It's so verbose that enabling it accidentally could flood storage. You can't enable it dynamically via SetLevel. It must be set at startup through gen.NodeOptions.Log.Level or gen.ProcessOptions.LogLevel. This restriction prevents operational mistakes.
The node starts at gen.LogLevelInfo. Processes inherit this unless their spawn options specify otherwise. After startup, you can adjust a process's level dynamically with SetLevel, allowing surgical verbosity changes during debugging.
Identifying Log Sources
The logging subsystem differentiates between four source types: node, process, meta process, and network. Each carries its source information in a typed structure - gen.MessageLogNode, gen.MessageLogProcess, gen.MessageLogMeta, or gen.MessageLogNetwork. This typing allows custom loggers to handle different sources differently, perhaps routing network logs to one destination and process logs to another.
The default logger formats each source type distinctly in its output:
Node logs show the node name as a CRC32 hash:
Process logs show the full PID:
With IncludeName enabled, the registered name appears:
With both IncludeName and IncludeBehavior enabled, the actor type appears:
Meta process logs show the alias:
Network logs show local and remote node hashes:
These visual distinctions make scanning logs easier. At a glance, you can distinguish node events from process activity, meta process operations from network communications. The format itself tells you what layer of the system generated each message.
Adding Context with Fields
Beyond the message text, you can attach structured fields - key-value pairs providing context. Fields enable correlation across log entries and make logs machine-parseable.
Consider a request handler. It receives a request with an ID. Every log entry related to that request should include the ID, allowing you to filter logs to just that request's activity:
With IncludeFields enabled in the logger configuration, output shows:
Fields appear on a separate line below the message, prefixed with "fields" and aligned with the timestamp. Multiple fields are space-separated, each formatted as key:value. In JSON output, fields become separate JSON properties at the message's top level.
Fields only appear in output if the logger is configured to include them. The default logger requires gen.NodeOptions.Log.DefaultLogger.IncludeFields = true. Without this, fields are tracked internally but not displayed - useful if some loggers need fields while others don't.
Fields accumulate. Call AddFields multiple times and you add more fields rather than replacing existing ones. This supports incremental context building. Add session_id when the session starts. Add transaction_id when beginning a transaction. Add payment_id when processing payment. Each subsequent log includes all accumulated fields.
Remove fields with DeleteFields:
This clears the named fields from subsequent logs.
Field Scoping
Field scoping handles nested contexts where you need temporary fields that shouldn't persist beyond a specific operation.
PushFields saves the current field set and starts a new scope. Add temporary fields, perform the operation (with those fields appearing in logs), then PopFields to restore the previous field set:
Output shows:
The operation field exists only within the push/pop scope. After popping, logs include only session_id.
Scopes can nest. Each PushFields returns the stack depth. Each PopFields returns the new depth. This supports complex nested contexts - a request containing a transaction containing multiple operations, each adding its own contextual fields that disappear when the operation completes.
One restriction protects consistency: you can't delete fields while the field stack has active frames. If you've pushed fields, pop back to the base level before deleting. This prevents deleting a field that a pending pop might restore, which would leave the field state inconsistent.
The Default Logger
Every node starts with a default logger writing to os.Stdout. Configure it through gen.NodeOptions.Log.DefaultLogger:
TimeFormat controls timestamp display. Empty means nanoseconds since epoch. Any Go time format works - time.DateTime, time.RFC3339, or custom formats.
IncludeBehavior adds actor type names to process logs, showing which implementation generated each message.
IncludeName adds registered process names to process logs, making output more readable than PIDs alone.
IncludeFields controls whether structured fields appear in output.
EnableJSON switches to JSON format, with each message as a single-line JSON object.
To disable the default logger entirely, set Disable: true. Do this when using only custom loggers.
Adding Custom Loggers
Custom loggers implement gen.LoggerBehavior:
The Log method receives each message. The Terminate method handles cleanup when the logger is removed or the node shuts down.
Register a logger with node.LoggerAdd:
The filter (final arguments) specifies which levels this logger handles. The logger receives only messages at those levels. Omit the filter to use gen.DefaultLogFilter, which includes all levels from Trace through Panic.
Loggers are stored per-level internally. Registering for Error and Panic stores the logger in both level maps. When an error occurs, the framework looks up the Error map and delivers the message to all loggers in that map.
Logger names must be unique. Reusing a name returns gen.ErrTaken. Remove a logger with LoggerDelete before adding a new one with the same name.
The Log method is called synchronously. If it blocks, it delays the logging path. For expensive operations - compressing logs, sending over network, database writes - make Log queue the work and return immediately, processing asynchronously.
Process-Based Loggers
A process can act as a logger, receiving log messages through its mailbox. This integrates logging with the actor model.
Implement the HandleLog callback in your actor:
Register the process as a logger:
Process-based logging queues messages asynchronously. The Log call places the message in the process's Log mailbox and triggers the process. The process handles log messages through HandleLog, processing them sequentially. The code that generated the log continues immediately without waiting.
This queuing prevents blocking. If the logger process is busy or the logging logic is expensive, messages queue and are processed when ready. The logging path stays fast.
One detail matters: when a logger process terminates, it's automatically removed from the logging system. No need to call LoggerDeletePID explicitly.
Using Multiple Loggers
The fan-out architecture supports multiple loggers operating simultaneously with different purposes.
A typical production configuration disables the default logger and adds specialized loggers:
The colored logger handles debug through panic for console display during development. The rotate logger receives everything and writes to rotating files. Trace messages don't appear anywhere because no logger is registered for trace level.
Loggers can be added and removed dynamically. Start with console logging during development. Add file logging in staging. In production, remove console, keep files, add metrics forwarding. The system adapts without code changes.
Controlling Verbosity
Different processes often need different verbosity. Most processes log at Info. Increase a troublesome process to Debug temporarily. Keep infrastructure processes at Warning to reduce noise.
For processes generating high-volume logs, route them to a dedicated logger using a hidden logger. A trading engine logging every order would overwhelm general logs:
This creates isolation - the trading process logs only to .trading, and .trading receives only trading process logs. Other processes and loggers are unaffected. Without the hidden logger (using a regular logger name), the logger would also receive fan-out logs from all other processes.
Process-based loggers enable sophisticated handling. A logger process can aggregate metrics - count errors per minute, track which processes log most frequently. It can detect patterns - the same error repeating indicates a stuck condition. It can forward to external systems - send errors to Slack, metrics to Prometheus. As an actor, it maintains state, can be supervised for reliability, and integrates naturally with the rest of your system.
Logger Implementations
The framework provides two logger implementations in separate packages for common needs:
Colored (ergo.services/logger/colored) - Terminal output with ANSI colors. Highlights Ergo types (PIDs, Atoms, Refs) and colorizes log levels (yellow for warnings, red for errors, etc.). Visual clarity for development, but has performance overhead. Not suitable for high-volume production logging.
Rotate (ergo.services/logger/rotate) - File logging with automatic rotation. Supports size-based and time-based rotation. Compresses old logs with gzip. Configurable retention policies. Production-ready for long-running systems generating substantial logs.
Both integrate with the logging system through node.LoggerAdd. You can combine them - colored for console during development, rotate for persistent storage, both receiving the same filtered log stream.
For implementation details and configuration options, see and in the extra library documentation.
Network Stack
Understanding the network stack for distributed communication
The network stack makes remote messaging work like local messaging. When you send to a process on another node, the framework discovers where that node is, establishes a connection if needed, encodes the message, sends it over TCP, and delivers it to the recipient's mailbox. From your perspective, it's just Send(pid, message) - whether the PID is local or remote.
This transparency requires three systems working together: service discovery to find nodes, connection management to establish reliable links, and message encoding to serialize data for transmission. Each system handles a specific problem, and together they create the illusion that remote communication is just local communication.
The Big Picture
When you send a message to a remote process:
Routing decision - The framework examines the node portion of the PID. Local node? Direct mailbox delivery. Remote node? Continue to step 2.
Connection lookup - Check if a connection to that node already exists. If yes, use it. If no, continue to step 3.
Discovery - Query the registrar (or check static routes) to find where the remote node is listening: hostname, port, TLS requirements, protocol versions.
This entire pipeline is invisible to your code. You call Send, and the framework does the rest.
Service Discovery
Before connecting to a remote node, the framework needs to know where that node is. Service discovery translates logical node names ([email protected]) into connection parameters (IP, port, TLS, protocol versions).
The embedded registrar provides basic discovery:
One node per host runs a registrar server (whoever started first)
Other nodes connect as clients
Same-host discovery is direct (no network)
For production clusters, external registrars provide more features:
Saturn - Purpose-built for Ergo, immediate event propagation, efficient at scale
The embedded registrar works for development and small deployments. For larger clusters or dynamic topologies, use etcd or Saturn. The choice is transparent to your code - you specify the registrar at node startup, and everything else works identically.
For details, see .
Static Routes
Discovery is dynamic - nodes register themselves, and others query to find them. But sometimes you want explicit control. Maybe nodes have fixed addresses. Maybe you're behind a firewall that blocks discovery. Maybe you're connecting to external systems.
Static routes let you hardcode connection parameters:
Now when connecting to [email protected], the framework uses your route directly. No discovery query. No registrar involvement. You've taken control.
Static routes support pattern matching ("prod-.*"), multiple routes with failover weights, and hybrid approaches (use patterns for selection, resolvers for address lookup). You can configure per-route cookies, certificates, network flags, and atom mappings.
The framework checks static routes first, always. If a static route exists, discovery is bypassed. If static routes fail or don't exist, the framework falls back to discovery.
For details, see .
Connection Establishment
Once the framework knows where to connect (from discovery or static routes), it establishes a connection pool.
Handshake
The handshake performs mutual authentication using challenge-response. Node A connects to node B:
A sends hello with random salt and digest (computed from salt + cookie)
B verifies digest - if cookies match, digest is correct
B sends its own challenge
If TLS is enabled, certificate fingerprints are exchanged and verified too.
After authentication, nodes exchange introduction messages:
Node names and version information
Network flags (capabilities: remote spawn? important delivery? fragmentation?)
Caching dictionaries (atoms, types, errors that will be used frequently)
The flags negotiation ensures nodes with different feature sets can work together. Features not supported by both sides are disabled for that connection.
The caching dictionaries enable efficiency. Instead of encoding "mynode@localhost" repeatedly (19 bytes), it gets a cache ID and subsequent uses encode as 2 bytes.
Connection Pool
After handshake, the accepting node tells the dialing node to create a connection pool:
Pool size (default 3 TCP connections)
Acceptor addresses to connect to
The dialing node opens additional TCP connections using a shortened join handshake (skips full authentication since the first connection already authenticated). These connections join the pool, forming a single logical connection with multiple physical TCP links.
Multiple connections enable parallel message delivery. Each message goes to a connection based on the sender's identity (derived from sender PID). Messages from the same sender always use the same connection, preserving order. Messages from different senders use different connections, enabling parallelism.
The receiving side creates 4 receive queues per TCP connection. A 3-connection pool has 12 receive queues processing messages concurrently. This parallel processing improves throughput while preserving per-sender message ordering.
Message Encoding and Transmission
Once a connection exists, messages flow through encoding and framing.
EDF (Ergo Data Format)
EDF is a binary encoding specifically designed for the framework's communication patterns. It's type-aware - each value is prefixed with a type tag (e.g., 0x95 for int64, 0xaa for PID, 0x9d for slice). The decoder reads the tag and knows what follows.
Framework types like gen.PID and gen.Ref have optimized encodings. Structs are encoded field-by-field in declaration order (no field names on the wire). Custom types must be registered on both sides - registration happens during init(), and during handshake nodes exchange their type lists to agree on encoding.
Compression is automatic. If a message exceeds the compression threshold (default 1024 bytes), it's compressed using GZIP, ZLIB, or LZW. The protocol frame indicates compression, so the receiver decompresses before decoding.
For details on EDF - type tags, struct encoding, registration requirements, compression, caching - see .
ENP (Ergo Network Protocol)
ENP wraps encoded messages in frames for transmission. Each frame has an 8-byte header with magic byte, protocol version, frame length, order byte, and message type. The frame body contains sender/recipient identifiers and the EDF-encoded payload.
The order byte preserves message ordering per sender. Messages from the same sender have the same order value and route to the same receive queue, guaranteeing sequential processing. Messages from different senders have different order values and route to different queues, enabling parallel processing.
For details on protocol framing, order bytes, receive queue distribution, and the exact byte layout, see .
Network Transparency in Practice
Network transparency means remote operations look like local operations. You send to a PID without checking if it's local or remote. You establish links and monitors the same way regardless of location. The framework handles discovery, encoding, and transmission automatically.
But transparency has limits:
Latency - Remote sends take milliseconds vs microseconds for local
Bandwidth - Network links have finite capacity, local operations don't
Failures - Networks fail in ways local memory doesn't (packets lost, connections drop, nodes unreachable)
The framework makes distributed programming feel local, but you still need to design for network realities: use timeouts, handle connection failures, prefer async over sync, batch messages, keep payloads small.
For deep understanding of how transparency works - EDF encoding, struct serialization, type registration, important delivery, failure semantics - see .
Network Configuration
Configure the network stack in gen.NodeOptions.Network:
Mode - NetworkModeEnabled enables full networking with acceptors. NetworkModeHidden allows outgoing connections only (no acceptors). NetworkModeDisabled disables networking entirely.
Cookie - Shared secret for authentication. All nodes must use the same cookie to communicate. Set explicitly for distributed deployments.
MaxMessageSize - Maximum incoming message size. Protects against memory exhaustion. Default unlimited (fine for trusted clusters).
Flags - Control capabilities. Remote nodes learn your flags during handshake and can only use features you've enabled. EnableRemoteSpawn allows spawning (with explicit permission per process). EnableImportantDelivery enables delivery confirmation.
Acceptors - Define listeners for incoming connections. Multiple acceptors on different ports are supported. Each can have its own cookie, TLS, and protocol.
Custom Network Stacks
The framework provides three extension points:
gen.NetworkHandshake - Control connection establishment and authentication. Implement this to change how nodes authenticate or how connection pools are created.
gen.NetworkProto - Control message encoding and transmission. The Erlang distribution protocol is implemented as a custom proto, allowing Ergo nodes to join Erlang clusters.
gen.Connection - The actual connection handling. Implement this for custom framing, routing, or error handling.
You can register multiple handshakes and protos, allowing one node to support multiple protocol stacks simultaneously:
This enables migration scenarios (gradually migrate from Erlang to Ergo) and integration scenarios (connect to systems using different protocols).
Remote Operations
Once connections exist, you can spawn processes and start applications on remote nodes:
Remote spawning requires the remote node to explicitly enable it:
Without explicit permission, remote spawn requests fail. This prevents arbitrary code execution.
The same pattern applies to starting applications:
Requires:
This security model ensures you control exactly what remote nodes can do on your node.
Where to Go Next
This chapter provided an overview of how the network stack operates. For deeper understanding:
- How nodes find each other, application routing, configuration management, embedded vs external registrars
- How messages are encoded, EDF details, protocol framing, compression, caching, important delivery
Each of these chapters dives deep into its specific topic, giving you the details needed for production deployments.
etcd Client
This package implements the gen.Registrar interface and serves as a client library for etcd, a distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. In addition to the primary Service Discovery function, it automatically notifies all connected nodes about cluster configuration changes and supports hierarchical configuration management with type conversion.
To create a client, use the Create function from the etcd package. The function requires a set of options etcd.Options to configure the connection and behavior.
Then, set this client in the gen.NetworkOption.Registrar options:
Using etcd.Options, you can specify:
Cluster - The cluster name for your node (default: "default")
Endpoints - List of etcd endpoints (default: ["localhost:2379"])
Username - Username for etcd authentication (optional)
When the node starts, it will register with the etcd cluster and maintain a lease to ensure automatic cleanup if the node becomes unavailable.
Configuration Management
The etcd registrar provides hierarchical configuration management with four priority levels:
The etcd registrar supports typed configuration values using string prefixes. Configuration values are stored as strings in etcd and automatically converted to the appropriate Go types when read by the registrar:
"int:123" → int64(123)
"float:3.14" → float64(3.14)
Important: All configuration values must be stored as strings in etcd. The type conversion happens automatically when the registrar reads the configuration.
Example configuration setup using etcdctl:
Access configuration in your application:
Event System
The etcd registrar registers a gen.Event and generates messages based on changes in the etcd cluster within the specified cluster. This allows the node to stay informed of any updates or changes within the cluster, ensuring real-time event-driven communication and responsiveness to cluster configurations:
etcd.EventNodeJoined - Triggered when another node is registered in the same cluster
etcd.EventNodeLeft - Triggered when a node disconnects or its lease expires
etcd.EventApplicationLoaded - An application was loaded on a remote node
To receive such messages, you need to subscribe to etcd client events using the LinkEvent or MonitorEvent methods from the gen.Process interface. You can obtain the name of the registered event using the Event method from the gen.Registrar interface:
Application Discovery
To get information about available applications in the cluster, use the ResolveApplication method from the gen.Resolver interface, which returns a list of gen.ApplicationRoute structures:
Name - The name of the application
Node - The name of the node where the application is loaded or running
Weight - The weight assigned to the application in gen.ApplicationSpec
You can access the gen.Resolver interface using the Resolver method from the gen.Registrar interface:
Node Discovery
Get a list of all nodes in the cluster:
Data Storage Structure
The etcd registrar organizes data in etcd using the following key structure:
Important Architecture Notes:
Routes (nodes/applications) use edf.Encode + base64 encoding and are stored in the routes/ subpath. Don't change anything there.
Configuration uses string encoding with type prefixes and is stored in the config/ subpath
Example
A fully featured example can be found at in the docker directory.
This example demonstrates how to run multiple Ergo nodes using etcd as a registrar for service discovery. It showcases service discovery, actor communication, typed configuration management, and real-time configuration event monitoring across a cluster.
Development and Testing
The etcd registrar includes comprehensive testing infrastructure:
Docker Testing Setup
Use the included Docker Compose setup for testing:
Manual etcd Operations
For debugging and manual operations:
The etcd registrar provides a robust, scalable solution for service discovery and configuration management in distributed Ergo applications, with the reliability and consistency guarantees of etcd.
Pool
A single actor processes messages sequentially. This is fundamental to the actor model - it eliminates race conditions and makes reasoning about state straightforward. But it also means one actor can become a bottleneck. If messages arrive faster than the actor can process them, the mailbox grows, latency increases, and eventually the system stalls.
The standard solution is to run multiple workers. Instead of sending requests to one actor, distribute them across several identical actors processing in parallel. This works, but now you need routing logic: pick a worker, check if it's alive, handle mailbox overflow, restart dead workers. This boilerplate appears in every pool implementation.
act.Pool solves this. It's an actor that manages a pool of worker actors and automatically distributes incoming messages and requests across them. You send to the pool's PID, the pool forwards to an available worker. The pool handles worker lifecycle, automatic restarts, and load balancing. From the sender's perspective, it's just one actor. Under the hood, it's N workers processing in parallel.
Creating a Pool
Like act.Actor provides callbacks for regular actors, act.Pool uses the act.PoolBehavior interface:
The key difference from ActorBehavior: Init returns PoolOptions that define the pool configuration. All callbacks are optional except Init.
Embed act.Pool in your struct and implement Init to configure workers:
The pool spawns workers during initialization. Each worker is linked to the pool (via LinkParent: true). If a worker crashes, the pool receives an exit signal and can restart it.
Workers are created using the WorkerFactory. This is the same factory pattern as regular Spawn - it returns a gen.ProcessBehavior instance. The workers can be act.Actor, act.Pool (nested pools), or custom behaviors.
Rate Limiting Through Pool Configuration
The combination of PoolSize and WorkerMailboxSize provides a natural rate limiting mechanism. The pool can buffer at most PoolSize × WorkerMailboxSize messages. If all workers are busy and their mailboxes are full, new messages are rejected:
When a sender tries to send beyond this limit, they receive ErrProcessMailboxFull (if using important delivery) or the message is dropped with a log entry. This backpressure prevents the system from accepting more work than it can handle.
For external APIs (HTTP, gRPC), this translates to returning "503 Service Unavailable" when the pool is saturated. The pool size controls maximum concurrency, and the mailbox size controls burst capacity. Tune both based on your worker processing speed and acceptable latency.
Automatic Message Distribution
When you send a message or make a call to the pool, act.Pool automatically forwards it to an available worker:
Forwarding happens for messages in the Main queue (normal priority). The pool maintains a FIFO queue of worker PIDs. When a message arrives:
Pop a worker from the queue
Forward the message using Forward (preserves original sender and ref)
Check result:
If all workers have full mailboxes, the message is dropped and logged. The pool doesn't have its own buffer beyond the workers' mailboxes. This is intentional - backpressure should propagate to senders.
The pool forwards Regular messages, Requests, and Events. Exit signals and Inspect requests are handled by the pool itself (they're not forwarded to workers).
Workers and the Original Sender
Workers receive the original sender's PID, not the pool's PID. When a worker processes a forwarded message, from points to whoever sent to the pool:
The same applies to Call requests. Workers see the original caller's from and ref. When they return a result or call SendResponse, it goes directly to the original caller, bypassing the pool entirely.
This is why forwarding is transparent. The worker doesn't know it's part of a pool. It processes messages as if they were sent directly to it.
Intercepting Pool Messages
Automatic forwarding applies only to the Main queue (normal priority). Urgent and System queues are handled by the pool itself through HandleMessage and HandleCall callbacks:
The same for synchronous requests:
Important: High-priority requests that return (nil, nil) from HandleCall are not forwarded to workers. They're simply ignored, and the caller times out. Forwarding only happens for Main queue messages. If you want a request to be handled, either:
Send it with normal priority (goes to workers)
Handle it explicitly in pool's HandleCall and return a result
Use high priority only for pool management that should be handled by the pool itself, not for work that should go to workers.
Dynamic Pool Management
Adjust the pool size at runtime with AddWorkers and RemoveWorkers:
AddWorkers spawns new workers with the same factory and options used during initialization. They're added to the FIFO queue and immediately available for work.
RemoveWorkers takes workers from the queue and sends them gen.TerminateReasonNormal via SendExit. The workers terminate gracefully, finishing any in-progress work before shutting down.
Both methods return the new pool size after the operation. They fail if called from outside Running state.
Worker Restarts
Workers are linked to the pool with LinkParent: true. When a worker crashes, the pool receives an exit signal. The forward mechanism detects this (ErrProcessUnknown / ErrProcessTerminated), spawns a replacement with the same factory and arguments, and forwards the message to the new worker.
This is automatic restart, not supervision. The pool doesn't track worker history or apply restart strategies. It just replaces dead workers immediately when detected during forwarding. If you need sophisticated restart strategies, use a Supervisor to manage the pool and its workers.
Pool Statistics
Pools expose internal metrics via Inspect:
Use this for monitoring pool health. High messages_unhandled indicates workers are overwhelmed. High worker_restarts suggests worker stability issues.
When to Use Pools
Use a pool when:
One actor is a bottleneck (mailbox growing, latency increasing)
Work items are independent (no ordering dependencies)
Workers are stateless or can reconstruct state cheaply
Don't use a pool when:
Work items depend on previous items (pools don't guarantee ordering)
Workers maintain critical state that can't be lost on restart
Concurrency isn't the bottleneck (single actor is fast enough)
Pools are for horizontal scaling of stateless work. If workers need state coordination, use multiple independent actors with explicit routing instead.
Patterns and Pitfalls
Set WorkerMailboxSize to limit backpressure propagation. Unbounded mailboxes let workers accumulate huge queues, hiding the overload until memory exhausts. Bounded mailboxes cause forwarding to try next worker, eventually reaching the sender with backpressure.
Don't forward Exit signals intentionally. The pool doesn't forward Exit messages to workers. If you need to broadcast shutdown to all workers, iterate manually and send to each worker PID.
Monitor forwarding metrics. If messages_unhandled increases, your pool is undersized or workers are too slow. Scale up with AddWorkers or optimize worker processing.
Use priority for pool management. Send management commands with MessagePriorityHigh to ensure they go to the pool, not forwarded to workers.
Nested pools are possible but rarely useful. A pool of pools adds latency without much benefit. Prefer one pool with more workers over nested layers.
type MessageDisconnect struct {
ID gen.Alias // Connection meta-process identifier
}
case websocket.MessageDisconnect:
delete(h.connections, m.ID)
h.Log().Info("Client disconnected: %s", m.ID)
type Message struct {
ID gen.Alias // Connection identifier
Type MessageType // Message type (text, binary, ping, pong, close)
Body []byte // Message payload
}
const (
MessageTypeText MessageType = 1
MessageTypeBinary MessageType = 2
MessageTypeClose MessageType = 8
MessageTypePing MessageType = 9
MessageTypePong MessageType = 10
)
case websocket.Message:
h.Log().Info("Received from %s: %s", m.ID, string(m.Body))
// Process message, maybe reply
h.SendAlias(m.ID, websocket.Message{Body: []byte("ack")})
// Send to specific connection
h.SendAlias(connID, websocket.Message{
Type: websocket.MessageTypeText,
Body: []byte("notification"),
})
// Broadcast to all connections
for connID := range h.connections {
h.SendAlias(connID, websocket.Message{
Body: []byte("broadcast message"),
})
}
// Actor on node1 sends to connection on node2
actor.SendAlias(connectionAlias, websocket.Message{
Body: []byte("update from backend"),
})
Connection establishment - Open TCP connections to the remote node, perform mutual authentication via handshake, negotiate capabilities, exchange caching dictionaries, create a connection pool.
Message transmission - Encode the message into bytes (EDF), optionally compress it, wrap it in a protocol frame (ENP), send it over one of the TCP connections in the pool.
Remote delivery - The receiving node reads the frame, decompresses if needed, decodes back to Go values, routes to the recipient's mailbox.
Cross-host discovery uses UDP queries
Automatic failover if the server node dies
A verifies B's response
Both sides authenticated
Partial failures - Some nodes work while others fail (local systems fail entirely or work entirely)
type MyLogger struct {
act.Actor
}
func (ml *MyLogger) HandleLog(message gen.MessageLog) error {
switch m := message.Source.(type) {
case gen.MessageLogNode:
// Handle node log
case gen.MessageLogProcess:
// Process log - has PID, name, behavior
case gen.MessageLogMeta:
// Meta process log - has Alias
case gen.MessageLogNetwork:
// Network log - has local and remote nodes
}
return nil
}
// Debugging a specific process
node.SetLogLevelProcess(suspiciousPID, gen.LogLevelDebug)
// Later, restore normal level
node.SetLogLevelProcess(suspiciousPID, gen.LogLevelInfo)
// Register hidden logger for trading
tradingFileLogger := rotate.CreateLogger(rotate.Options{Path: "/var/log/trading"})
node.LoggerAdd(".trading", tradingFileLogger)
// Trading process uses only the hidden logger
tradingProcess.Log().SetLogger(".trading")
# Node-specific integer configuration (stored as string, converted to int64)
etcdctl put services/ergo/cluster/production/config/web1/database.port "int:5432"
# Cluster-wide float configuration (stored as string, converted to float64)
etcdctl put services/ergo/cluster/production/config/*/cache.ratio "float:0.75"
# Boolean configuration (stored as string, converted to bool)
etcdctl put services/ergo/cluster/production/config/*/debug.enabled "bool:true"
etcdctl put services/ergo/cluster/production/config/web1/ssl.enabled "bool:false"
# Application-specific configuration (visible to all nodes using wildcard format)
etcdctl put services/ergo/cluster/production/config/*/myapp.cache.size "int:256"
etcdctl put services/ergo/cluster/production/config/*/client.timeout "int:30"
# Global string configuration (stored and returned as string)
etcdctl put services/ergo/config/global/log.level "info"
registrar, err := node.Network().Registrar()
if err != nil {
return err
}
// Get single configuration item
port, err := registrar.ConfigItem("database.port")
if err != nil {
return err
}
// port will be int64(5432)
// Get multiple configuration items
config, err := registrar.Config("database.port", "cache.ratio", "debug.enabled", "log.level")
if err != nil {
return err
}
// config["database.port"] = int64(5432)
// config["cache.ratio"] = float64(0.75)
// config["debug.enabled"] = bool(true)
// config["log.level"] = "info"
type myActor struct {
act.Actor
}
func (m *myActor) HandleMessage(from gen.PID, message any) error {
reg, err := m.Node().Network().Registrar()
if err != nil {
m.Log().Error("unable to get Registrar interface: %s", err)
return nil
}
ev, err := reg.Event()
if err != nil {
m.Log().Error("Registrar has no registered Event: %s", err)
return nil
}
m.MonitorEvent(ev)
return nil
}
func (m *myActor) HandleEvent(event gen.MessageEvent) error {
switch msg := event.Message.(type) {
case etcd.EventNodeJoined:
m.Log().Info("Node %s joined cluster", msg.Name)
case etcd.EventApplicationStarted:
m.Log().Info("Application %s started on node %s", msg.Name, msg.Node)
case etcd.EventConfigUpdate:
m.Log().Info("Configuration %s updated", msg.Item)
// Handle specific configuration changes
if msg.Item == "ssl.enabled" {
if enabled, ok := msg.Value.(bool); ok {
m.Log().Info("SSL %s", map[bool]string{true: "enabled", false: "disabled"}[enabled])
}
}
}
return nil
}
type ApplicationRoute struct {
Node Atom
Name Atom
Weight int
Mode ApplicationMode
State ApplicationState
}
# Start etcd for testing
make start-etcd
# Run tests with coverage
make test-coverage
# Run integration tests only
make test-integration
# Clean up
make clean
# Check cluster health
etcdctl --endpoints=localhost:12379 endpoint health
# List all keys in cluster
etcdctl --endpoints=localhost:12379 get --prefix "services/ergo/"
# Set configuration manually (values must be strings)
etcdctl --endpoints=localhost:12379 put \
"services/ergo/cluster/production/config/web1/database.timeout" "int:30"
etcdctl --endpoints=localhost:12379 put \
"services/ergo/cluster/production/config/web1/debug.enabled" "bool:true"
# Watch for changes
etcdctl --endpoints=localhost:12379 watch --prefix "services/ergo/cluster/production/"
// Send a message to the pool
process.Send(poolPID, WorkRequest{Data: "task1"})
// The pool forwards to a worker transparently
// The worker's HandleMessage receives it
// Sender
process.Send(poolPID, "hello")
// Worker's HandleMessage
func (w *Worker) HandleMessage(from gen.PID, message any) error {
// 'from' is the original sender's PID, not the pool's PID
w.Send(from, "reply") // Reply goes to original sender
return nil
}
// Normal priority - forwarded to worker automatically
process.Send(poolPID, WorkRequest{})
// High priority - handled by pool's HandleMessage
process.SendWithPriority(poolPID, ManagementCommand{}, gen.MessagePriorityHigh)
// Pool's HandleMessage - invoked for Urgent/System messages
func (p *WorkerPool) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case ManagementCommand:
count, _ := p.AddWorkers(msg.AdditionalWorkers)
p.Log().Info("scaled to %d workers", count)
default:
p.Log().Warning("unhandled message: %T", message)
}
return nil
}
func (p *WorkerPool) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case ScaleUpCommand:
newSize, err := p.AddWorkers(msg.Count)
if err != nil {
p.Log().Error("failed to add workers: %s", err)
return nil
}
p.Log().Info("scaled up to %d workers", newSize)
case ScaleDownCommand:
newSize, err := p.RemoveWorkers(msg.Count)
if err != nil {
p.Log().Error("failed to remove workers: %s", err)
return nil
}
p.Log().Info("scaled down to %d workers", newSize)
}
return nil
}
stats, err := node.Inspect(poolPID)
// stats contains:
// - "pool_size": configured number of workers
// - "worker_behavior": type name of worker behavior
// - "worker_mailbox_size": mailbox limit per worker
// - "worker_restarts": count of workers restarted
// - "messages_forwarded": total messages forwarded to workers
// - "messages_unhandled": messages dropped (all workers full)
Port to advertise in route registration (0 = use actual listening port)
Erlang
Erlang network stack
This package implements the Erlang network stack, including the DIST protocol, ETF data format, EPMD registrar functionality, and the Handshake mechanism.
It is compatible with OTP-23 to OTP-27. The source code is available on the project's GitHub page at https://github.com/ergo-services/proto in the erlang23 directory.
Note that the source code is distributed under the Business Source License 1.1 and cannot be used for production or commercial purposes without a license, which can be purchased on the project's sponsor page.
EPMD
The epmd package implements the gen.Registrar interface. To create it, use the epmd.Create function with the following options:
Port: Registrar port number (default: 4369).
EnableRouteTLS: Enables TLS for all gen.Route responses on resolve requests. This is necessary if the Erlang cluster uses TLS.
DisableServer: Disables the internal server mode, useful when using the Erlang-provided
To use this package, include ergo.services/proto/erlang23/epmd.
Handshake
The handshake package implements the gen.NetworkHandshake interface. To create a handshake instance, use the handshake.Create function with the following options:
Flags: Defines the supported functionality of the Erlang network stack. The default is set by handshake.DefaultFlags().
UseVersion5: Enables handshake version 5 mode (default is version 6).
To use this package, include ergo.services/proto/erlang23/handshake.
DIST protocol
The ergo.services/proto/erlang/dist package implements the gen.NetworkProto and gen.Connection interfaces. To create it, use the dist.Create function and provide dist.Options as an argument, where you can specify the FragmentationUnit size in bytes. This value is used for fragmenting large messages. The default size is set to 65000 bytes.
To use this package, include ergo.services/proto/erlang/dist.
ETF data format
Erlang uses the ETF (Erlang Term Format) for encoding messages transmitted over the network. Due to differences in data types between Golang and Erlang, decoding received messages involves converting the data to their corresponding Golang types:
number -> int64
float number -> float64
When encoding data in the Erlang ETF format:
map -> map#{}
slice/array -> list[]
You can also use the functions etf.TermIntoStruct and etf.TermProplistIntoStruct for decoding data. These functions take into account etf: tags on struct fields, allowing the values to map correctly to the corresponding struct fields when decoding proplist data.
To automatically decode data into a struct, you can register the struct type using etf.RegisterTypeOf. This function takes the object of the type being registered and decoding options etf.RegisterTypeOption. The options include:
Name - The name of the registered type. By default, the type name is taken using the reflect package in the format #/pkg/path/TypeName
Strict - Determines whether the data must strictly match the struct. If disabled, non-matching data will be decoded into any.
To be automatically decoded the data sent from Erlang must be a tuple, with the first element being an atom whose value matches the type name registered in Golang. For example:
The values sent by an Erlang process should be in the following format:
Ergo-node in Erlang-cluster
If you want to use the Erlang network stack by default in your node, you need to specify this in gen.NetworkOptions when starting the node:
In this case, all outgoing and incoming connections will be handled by the Erlang network stack. For a complete example, you can refer to the repository at , specifically the erlang project
If you want to maintain the ability to accept connections from Ergo nodes while using the Erlang network stack as a main one, you need to add an acceptor in the gen.NetworkOptions settings:
Please note that if the list of acceptors is empty when starting the node, it will launch an acceptor with the network stack using Registrar, Handshake, and Proto from gen.NetworkOptions.
If you set the options.Network.Acceptor, you must explicitly define the parameters for all necessary acceptors. In the example, acceptorErlang is created with empty gen.AcceptorOptions (the Erlang stack from gen.NetworkOptions will be used), while for acceptorErgo, the Ergo Framework stack (Registrar, Handshake, and Proto) is explicitly defined.
In this example, you can establish incoming and outgoing connections using the Erlang network stack. However, the Ergo Framework network stack can only be used for incoming connections. To create outgoing network connections using the Ergo Framework stack, you need to configure a static route for a group of nodes by defining a match pattern:
For more detailed information, please refer to the section.
Erlang-node in Ergo-cluster
If your cluster primarily uses the Ergo Framework network stack by default and you want to enable interaction with Erlang nodes, you'll need to add an acceptor using the Erlang network stack. Additionally, you must define a static route for Erlang nodes using a match pattern:
Actor GenServer
The erlang23.GenServer actor implements the low-level gen.ProcessBehavior interface, enabling it to handle messages and synchronous requests from processes running on an Erlang node. The following message types are used for communication in Erlang:
regular messages - sent from Erlang using erlang:send or the Pid ! message syntax
cast-messages - sent from Erlang with gen_server:cast
call-requests - from Erlang made with
erlang23.GenServer uses the erlang23.GenServerBehavior interface to interact with your object. This interface defines a set of callback methods for your object, which allow it to handle incoming messages and requests. All methods in this interface are optional, meaning you can choose to implement only the ones relevant to your specific use case:
The callback method HandleInfo is invoked when an asynchronous message is received from an Erlang process using erlang:send or via the Send* methods of the gen.Process interface. The HandleCast callback method is called when a cast message is sent using gen_server:cast from an Erlang process. Synchronous requests sent with gen_server:call or Call* methods are handled by the HandleCall callback method.
If your actor only needs to handle regular messages from Erlang processes, you can use the standard act.Actor and process asynchronous messages in the HandleMessage callback method.
To start a process based on erlang23.GenServer, create an object embedding erlang23.GenServer and implement a factory function for it.
Example:
To send a cast message, use the Cast method of erlnag23.GenServer.
To send regular messages, use the Send* methods of the embedded gen.Process interface. Synchronous requests are made using the Call* methods of the gen.Process interface.
Like act.Actor, an actor based on erlang23.GenServer supports the TrapExit functionality to intercept exit signals. Use the SetTrapExit and TrapExit methods of your object to manage this functionality, allowing your process to handle exit signals rather than terminating immediately when receiving them.
Debugging
Debugging distributed actor systems presents unique challenges. Traditional debugging tools struggle with concurrent message passing, process isolation, and distributed state. This article covers the debugging capabilities built into Ergo Framework and demonstrates practical techniques for troubleshooting common issues.
Build Tags
Ergo Framework uses Go build tags to enable debugging features without affecting production performance. These tags control compile-time behavior, ensuring zero overhead when disabled.
Static Routes
Controlling outgoing connections with static routing
When your code sends a message to a remote process, the framework needs to establish a connection to that node. But how does it know where the node is? By default, it asks the system (the Registrar) to look up the node's address. This works well for dynamic clusters where nodes come and go.
But sometimes you want more control. Maybe you know exactly where certain nodes are. Maybe you're behind a firewall and can't use dynamic discovery. Maybe you want to connect to external systems with fixed addresses. Static routes let you hardcode connection information directly, bypassing the discovery process entirely.
This isn't just about convenience. It's about control. When you define a static route, you're saying "I know better than the discovery system where this node is, and here's exactly how to reach it." The framework respects that - static routes are checked first, before any discovery queries.
SSE
SSE provides unidirectional server-to-client streaming over HTTP. Unlike WebSocket bidirectional connections, SSE is designed for scenarios where server pushes updates to clients - live feeds, notifications, real-time dashboards.
The framework provides SSE meta-process implementation that integrates SSE connections with the actor model. Each connection becomes an independent actor addressable from anywhere in the cluster.
The Integration Problem
SSE connections need two capabilities:
epmd
service.
big number -> big.Int from math/big, or to int64/uint64
map -> map[any]any
binary -> []byte
list -> etf.List ([]any)
tuple -> etf.Tuple ([]any) or a registered struct type
string -> []any. convert to string using etf.TermToString
atom -> gen.Atom
pid -> gen.Pid
ref -> gen.Ref
ref (alias) -> gen.Alias
atom = true/false -> bool
struct -> map with field names as keys (considering etf: tags on struct fields)
registered type of struct -> tuple with the first element being the registered struct name, followed by field values in order.
The framework maintains an internal routing table. When you create an outgoing connection to a remote node, the framework:
Checks static routes first - Looks in the routing table for a match
Falls back to discovery - If no static route exists, queries the Registrar
Tries proxy routes - If direct connection fails, attempts proxy routes
This order is important. Static routes always win. If you've defined a route for "prod-*" that matches [email protected], the framework uses your route and never asks the Registrar. You've taken control.
The routing table uses pattern matching. When the framework needs to connect to [email protected], it checks all static routes against that name using Go's regexp.MatchString. Any routes whose patterns match become candidates. If multiple routes match, they're sorted by weight (higher weights first), and the framework tries them in order until one succeeds.
Adding Static Routes
To add a static route, use AddRoute from the network interface:
This tells the framework: "When connecting to [email protected], use host 10.0.1.50 on port 4370 with TLS enabled. This route has weight 100."
The weight determines priority when multiple routes match the same node. Higher numbers mean higher priority. If you have two routes for "prod-.*" - one with weight 100 (the default datacenter) and one with weight 200 (a faster backup datacenter) - the framework tries weight 200 first.
Pattern Matching Examples
When the framework looks up [email protected], it finds all matching routes: the prefix match (prod-.*), the suffix match (.*@example.com), and the complex pattern (^prod-db[0-9][email protected]$). It sorts them by weight and tries the highest-weight route first.
Route Configuration
The gen.NetworkRoute struct gives you fine-grained control over how connections are established:
Direct Connection
The simplest route specifies connection parameters directly:
When the framework uses this route, it connects to the specified host and port with TLS. The handshake and protocol versions default to the node's configured versions if you don't specify them explicitly.
Route with Resolver
You can combine static patterns with dynamic resolution:
This hybrid approach uses the pattern to select which nodes use this route, then queries the resolver for connection details. The Route fields override any values returned by the resolver. In this example, even if the resolver returns a non-TLS route, the framework forces TLS. If the resolver returns staging-db@internal but you've specified Host: "custom.example.com", the framework connects to your specified host instead.
Why would you do this? Imagine you have a staging environment behind a bastion host. The staging nodes register themselves in the discovery system with their internal addresses, but you need to connect through a specific gateway. The resolver pattern matches staging nodes, the resolver gets you the node's details, but your route configuration redirects the connection through your gateway.
Custom Cookie
Each route can override the node's default authentication cookie:
This is essential when connecting to nodes outside your cluster. Your internal nodes use one cookie (say, "internal-cluster-secret"). An external partner's nodes use a different cookie (say, "shared-secret-with-partner"). Without per-route cookies, you'd have to use the same cookie everywhere or give up on connecting to external systems.
Custom Certificates
For TLS connections, you can specify a custom certificate manager:
Different routes can use different certificates. Your production nodes might use certificates from one CA. A partner's nodes might use certificates from another CA. Each route gets its own certificate manager, allowing you to maintain separate trust chains.
Setting InsecureSkipVerify: true disables certificate validation. Use this only for testing or when connecting to nodes with self-signed certificates you trust but can't properly validate.
Custom Network Flags
You can override network capabilities for specific routes:
This is about defense. When you connect to an external node, you probably don't want them spawning arbitrary processes on your node or starting applications remotely. Custom flags let you expose only the features you're comfortable with for that specific connection.
Atom Mapping
Some advanced scenarios require translating atom values during communication:
When sending to this route, the framework automatically replaces mynode@localhost with legacy_node in all messages. On receiving, it reverses the mapping. This is rarely needed - most systems agree on naming conventions. But when integrating with legacy systems or systems with incompatible naming schemes, atom mapping saves you from rewriting every piece of code that references those atoms.
Per-Route Logging
You can set the logging level for a specific connection:
Normally your network stack runs at INFO or WARNING level. But when debugging a specific connection, you want TRACE logs for that connection without drowning in logs from all other connections. Per-route logging gives you surgical debugging.
Multiple Routes and Failover
The framework tries routes in weight order when multiple patterns match the same node:
When connecting to [email protected], both patterns match. The framework sorts them by weight and tries weight-200 first. If that connection fails (host unreachable, handshake failure, timeout), it tries weight-100. This gives you automatic failover.
Important limitation: You can't add the same pattern twice. AddRoute returns gen.ErrTaken if the pattern already exists - the pattern is the routing table key. To achieve multi-route failover for a single node, you need different patterns that both match:
Both patterns match [email protected], but they're different strings, so both can be added to the routing table.
Alternatively, use a resolver-based route. The resolver can return multiple addresses, and the framework tries them in order, letting the resolver handle failover logic.
Querying Routes
To see if a route exists for a node:
This queries the routing table without establishing a connection. You get back all routes whose patterns match the node name, sorted by weight. The highest-weight route is first - that's the one the framework would try first when actually connecting.
Removing Routes
To remove a static route:
The pattern you pass to RemoveRoute must exactly match the pattern you used in AddRoute. It's not a regex match - it's a literal string key lookup in the routing table. If you added "prod-.*", you must remove "prod-.*" exactly.
Removing a route doesn't affect existing connections. If you have an active connection to [email protected] and you remove its static route, the connection stays alive. Removing a route only affects future connection attempts. The next time the framework needs to connect to that node, it won't find the static route and will fall back to discovery.
Proxy Routes
Sometimes you can't connect directly to a node. Maybe it's behind a firewall. Maybe it's in a private network. Proxy routes let you connect through an intermediate node:
When the framework needs to connect to [email protected], it establishes a connection to [email protected] first, then asks the gateway to proxy the connection to the final destination. The gateway handles forwarding messages between you and the backend node.
Proxy routes have the same pattern matching and weight semantics as direct routes. You can define multiple proxy routes for the same pattern with different weights for failover.
Proxy Configuration
MaxHop limits proxy chaining. If the gateway itself needs to proxy through another node, and that node proxies through yet another node, MaxHop prevents infinite loops. The default is 8. Each proxy hop decrements the counter. When it reaches zero, the framework refuses to proxy further.
The Flags control what operations the proxy allows. Maybe your gateway allows monitoring remote processes but doesn't allow spawning processes through the proxy. This gives you granular security control at the proxy level.
Static Routes vs Discovery
Static routes are checked first, always. When the framework needs to connect to a node:
Check routing table - Pattern match against static routes
Query discovery - If no static route exists or all failed, ask the Registrar
Try discovered routes - Attempt connection using discovered addresses
Try proxy discovery - If direct connection fails, try discovered proxy routes
Fail - Return gen.ErrNoRoute
This priority order means static routes override discovery. If you have a static route for prod-db pointing to 10.0.1.50, the framework never asks the Registrar for prod-db's address. It just uses your route. This is by design - you're explicitly taking control.
But combining them is powerful. You can define static routes with resolvers:
Now all production nodes use the static route for pattern matching, but the resolver for address lookup. You get the control of static routes (selecting which nodes use this configuration) with the dynamism of discovery (nodes can move without updating your code).
When to Use Static Routes
Fixed infrastructure - If your nodes run on specific servers with static IPs, static routes are simpler than running a discovery service. Add routes for your database, cache, and API servers, and you're done.
Firewall restrictions - When discovery protocols can't traverse your firewall, static routes work around it. The internal nodes discover each other normally. External access uses static routes pointing to your gateway.
External integration - Connecting to nodes outside your cluster almost always requires static routes. You don't control their discovery system (if they even have one). You just need to reach specific addresses.
Testing - Hardcoding routes during development lets you point at local test nodes without configuring a full discovery system.
Performance - Static routes eliminate discovery latency. The framework connects immediately without the resolver round-trip. For frequently accessed nodes, this shaves milliseconds off connection establishment.
Security boundaries - Different routes can use different cookies and certificates. When integrating multiple trust domains, static routes let you configure each boundary explicitly.
Static routes aren't a replacement for discovery. They're a tool for cases where discovery doesn't fit. Most production clusters use discovery for internal nodes (dynamic, automatic) and static routes for fixed external connections (explicit, controlled). The framework supports both, and they work together.
For details on how connections are established, see Network Stack. For understanding the discovery system that static routes bypass, see Service Discovery.
// Exact match - only this specific node
network.AddRoute("database@prod", route1, 100)
// Prefix match - all production nodes
network.AddRoute("prod-.*", route2, 100)
// Suffix match - all nodes in a domain
network.AddRoute(".*@example.com", route3, 100)
// Complex pattern - production databases only
network.AddRoute("^prod-db[0-9][email protected]$", route4, 100)
route := gen.NetworkRoute{
Route: gen.Route{
Host: "192.168.1.100",
Port: 4370,
TLS: true,
HandshakeVersion: handshake.Version(), // optional, uses default if not set
ProtoVersion: proto.Version(), // optional, uses default if not set
},
}
route := gen.NetworkRoute{
Resolver: registrar.Resolver(), // use specific registrar
Route: gen.Route{
Host: "custom.example.com", // override resolved host
TLS: true, // force TLS
},
}
network.AddRoute("staging-.*", route, 100)
customCert := node.CertManager() // or create a new one
route := gen.NetworkRoute{
Route: gen.Route{
Host: "secure.partner.com",
Port: 4370,
TLS: true,
},
Cert: customCert,
InsecureSkipVerify: false, // enforce certificate validation
}
route := gen.NetworkRoute{
Route: gen.Route{
Host: "readonly.external.com",
Port: 4370,
},
Flags: gen.NetworkFlags{
Enable: true,
EnableRemoteSpawn: false, // don't let them spawn on us
EnableRemoteApplicationStart: false, // don't let them start apps on us
EnableImportantDelivery: true, // but do support important delivery
},
}
// These are different patterns that match the same node
network.AddRoute("^[email protected]$", primaryRoute, 200) // exact match with anchors
network.AddRoute("[email protected]", backupRoute, 100) // substring match
routes, err := network.Route("[email protected]")
if err == gen.ErrNoRoute {
// no static route defined
} else {
// routes contains all matching routes, sorted by weight descending
for i, route := range routes {
fmt.Printf("Route %d: %s:%d\n", i+1, route.Route.Host, route.Route.Port)
}
}
err := network.RemoveRoute("[email protected]")
if err == gen.ErrUnknown {
// no such route existed
}
route := gen.NetworkRoute{
Resolver: etcdRegistrar.Resolver(),
Route: gen.Route{
TLS: true, // force TLS even if resolver says otherwise
},
}
network.AddRoute("prod-.*", route, 100)
The pprof Tag
The pprof tag enables the built-in profiler and goroutine labeling:
This activates:
pprof HTTP endpoint at http://localhost:9009/debug/pprof/
PID labels on actor goroutines and Alias labels on meta process goroutines for identification in profiler output
The endpoint address can be customized via environment variables:
PPROF_HOST - host to bind (default: localhost)
PPROF_PORT - port to listen on (default: 9009)
The profiler endpoint exposes standard Go profiling data:
Endpoint
Description
/debug/pprof/goroutine
Stack traces of all goroutines
/debug/pprof/heap
Heap memory allocations
/debug/pprof/profile
CPU profile (30-second sample)
/debug/pprof/block
The norecover Tag
By default, Ergo Framework recovers from panics in actor callbacks to prevent a single misbehaving actor from crashing the entire node. While this improves resilience in production, it can hide bugs during development.
With norecover, panics propagate normally, providing full stack traces and allowing debuggers to catch the exact failure point. This is particularly useful when:
Investigating nil pointer dereferences in message handlers
Tracking down type assertion failures
Understanding the call sequence leading to a panic
The trace Tag
The trace tag enables verbose logging of framework internals:
This produces detailed output about:
Process lifecycle events (spawn, terminate, state changes)
Message routing decisions
Network connection establishment and teardown
Supervision tree operations
To see trace output, also set the node's log level:
Combining Tags
Tags can be combined for comprehensive debugging:
This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.
Profiler Integration
The Go profiler is a powerful tool for understanding runtime behavior. Ergo Framework enhances its usefulness by labeling goroutines with their identifiers.
Identifying Actor and Meta Process Goroutines
When built with the pprof tag, each actor's goroutine carries a label containing its PID, and each meta process goroutine carries a label with its Alias. This creates a direct link between the logical identity and the runtime goroutine.
To find labeled goroutines:
Example output for actors:
Example output for meta processes:
Meta processes have two goroutines with different roles:
"role":"handler" - Actor Handler goroutine processing messages (HandleMessage/HandleCall)
The output shows:
The goroutine's stack trace
The identifier label (PID for actors, Alias for meta processes)
The exact location in your code where the goroutine is currently executing
Debugging Stuck Processes
During graceful shutdown, Ergo Framework logs processes that are taking too long to terminate. These logs include PIDs that can be matched against profiler output.
Consider a shutdown scenario where the node reports:
To investigate why <ABC123.0.1005> is stuck:
Capture the goroutine profile:
Search for the specific PID:
Analyze the stack trace to understand what the actor is waiting on.
The debug=2 parameter provides full stack traces with argument values, which is more verbose than debug=1 but contains more diagnostic information.
Common Patterns in Stack Traces
Different types of blocking have characteristic stack traces:
Blocked on channel receive:
Blocked on mutex:
Blocked on network I/O:
Blocked on synchronous call (waiting for response):
Understanding these patterns helps quickly identify the root cause of stuck processes.
Shutdown Diagnostics
Ergo Framework provides built-in diagnostics during graceful shutdown. When ShutdownTimeout is configured (default: 3 minutes), the framework logs pending processes every 5 seconds.
The shutdown log includes:
PID: Process identifier for correlation with profiler
State: Current process state (running, sleep, etc.)
Queue: Number of messages waiting in the mailbox
A process with state=running and queue=0 is actively processing something (likely stuck in a callback). A process with state=running and queue>0 is stuck while new messages continue to arrive. A process with state=sleep and queue=0 is idle - during shutdown this typically means the process is waiting for its children to terminate first (normal supervision tree behavior).
Practical Debugging Scenarios
Scenario: Message Handler Never Returns
Symptoms:
Process stops responding to messages
Other processes waiting on Call timeout
Shutdown hangs on specific process
Investigation:
Note the PID from shutdown logs or observer
Capture goroutine profile with debug=2
Find the goroutine by PID label
Examine the stack trace
Common causes:
Infinite loop in message handler
Blocking channel operation
Deadlock with another process via synchronous calls
External service call without timeout
Solution approach:
Never use blocking operations (channels, mutexes) in actor callbacks
Always use timeouts for external calls
Use asynchronous messaging patterns where possible
Scenario: Memory Growth
Symptoms:
Heap size increases over time
Process eventually killed by OOM
Investigation:
Capture heap profile:
In pprof, use top to see largest allocators:
Use list to examine specific functions:
Common causes:
Messages accumulating in mailbox faster than processing
Actor state holding references to large data
Unbounded caches or buffers in actor state
Scenario: Distributed Deadlock
Symptoms:
Two or more processes stop responding
Circular dependency in synchronous calls
Investigation:
Identify stuck processes from shutdown logs
For each process, capture its goroutine stack
Look for waitResponse in stack traces (indicates waiting for synchronous call response)
Map the call targets to build a dependency graph
Prevention:
Prefer asynchronous messaging over synchronous calls
Design clear hierarchies where calls flow in one direction
Use timeouts on all synchronous operations
Consider using request-response patterns with explicit message types
Scenario: Process Crash Investigation
Symptoms:
Process terminates unexpectedly
TerminateReasonPanic in logs
Investigation:
Build with --tags norecover to get full panic stack
Run the scenario that triggers the crash
Examine the complete stack trace
With norecover, the panic propagates with full context:
This shows exactly which line in your code triggered the panic.
Observer Integration
The Observer tool provides a web interface for inspecting running nodes. While not strictly a debugging tool, it complements profiler-based debugging by providing:
Real-time process list with state and mailbox sizes
Application and supervision tree visualization
Network topology view
Message inspection capabilities
Observer runs at http://localhost:9911 by default when included in your node.
Best Practices
Always use build tags in development: Run with --tags pprof during development to have profiler and goroutine labels available when needed.
Configure reasonable shutdown timeout: A shorter timeout (30-60 seconds) in development helps identify stuck processes quickly.
Use framework logging: The framework's Log() method automatically includes PID/Alias in log output, enabling correlation with profiler data.
Use structured logging: The framework's logging system supports log levels and structured fields. Add context with AddFields() for correlation:
For scoped logging, use PushFields()/PopFields() to save and restore field sets.
Profile regularly: Periodic profiling during development helps catch performance regressions before production.
Test shutdown paths: Explicitly test graceful shutdown to verify all actors terminate cleanly.
Summary
Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:
Build tags that enable profiling and diagnostics without production overhead
Goroutine labels that link runtime goroutines to their actor (PID) and meta process (Alias) identities
Shutdown diagnostics that identify processes preventing clean termination
Observer integration for visual inspection of running systems
Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.
HTTP streaming: Connection must keep HTTP response open and stream events to client. Standard HTTP handlers return immediately - SSE requires long-lived responses.
Asynchronous writing: Backend actors must be able to push events to the client at any time - notifications, updates, data changes from the actor system.
This is exactly what meta-processes solve. The SSE connection meta-process holds the HTTP response open. Actor Handler receives messages from backend actors and writes formatted SSE events to the response stream.
Components
Two meta-processes work together:
SSE Handler: Implements http.Handler interface. When HTTP request arrives, sets SSE headers and spawns Connection meta-process. Returns after connection closes.
SSE Connection: Meta-process managing one SSE connection. Actor Handler receives messages from actors, formats them as SSE events, writes to HTTP response stream. Connection lives until client disconnects or error occurs.
For client-side connections:
SSE Client Connection: Meta-process connecting to external SSE endpoint. External Reader continuously reads SSE stream, parses events, sends them to application actors.
Creating SSE Server
Use sse.CreateHandler to create handler meta-process:
Handler options:
ProcessPool: List of process names that will receive messages from SSE connections. When connection is established, handler round-robins across this pool to select which process handles this connection. If empty, connection sends to parent process.
Heartbeat: Interval for sending comment heartbeats to keep connection alive. Default 30 seconds. Heartbeats prevent proxies and load balancers from closing idle connections.
Connection Lifecycle
When client connects:
HTTP request arrives with Accept: text/event-stream
Handler sets SSE response headers
Handler spawns Connection meta-process
Connection sends MessageConnect to application
Connection blocks waiting for client disconnect
Actor Handler waits for backend messages
During connection lifetime:
Server events: Application sends message -> Actor Handler formats and writes SSE event
Four message types flow between connections and actors:
sse.MessageConnect: Sent when connection established.
Receive this to track new connections:
sse.MessageDisconnect: Sent when connection closes.
Receive this to clean up connection state:
sse.Message: Event to send to client (server) or received from server (client).
Send events to client:
Wire format for the above message:
sse.MessageLastEventID: Sent when client reconnects with Last-Event-ID header.
Handle reconnection to resume from last event:
SSE Wire Format
SSE events follow a simple text format:
event: - Event type. Client listens with addEventListener("type", ...). Optional, defaults to "message".
id: - Event ID. Client sends as Last-Event-ID header on reconnect. Optional.
retry: - Suggested reconnection delay. Client uses this if connection drops. Optional.
data: - Event payload. Can span multiple lines, each prefixed with data:. Required.
Empty line terminates event.
The sse.Message struct maps directly to this format. Multi-line data is handled automatically.
Client Connections
Create client-side SSE connections with sse.CreateConnection:
Connection options:
URL: SSE server endpoint. Use http:// or https:// scheme.
Process: Process name that will receive events from server. If empty, sends to parent process.
Headers: Custom HTTP headers for the request. Useful for authentication.
LastEventID: Initial Last-Event-ID header value for resuming from specific event.
ReconnectInterval: Default reconnection delay. Can be overridden by server's retry: field. Default 3 seconds.
Client connections receive the same message types. External Reader parses SSE stream and sends sse.Message to application:
Network Transparency
Connection meta-processes have gen.Alias identifiers that work across the cluster. Any actor on any node can send events to any connection:
Network transparency makes every SSE connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without intermediaries.
Process Pool Distribution
Handler accepts ProcessPool - list of process names to receive connection messages. Handler distributes connections across this pool using round-robin:
Connection 1 sends to "handler1", connection 2 to "handler2", connection 3 to "handler3", connection 4 to "handler1", etc. This distributes load across multiple handler processes.
Useful for scaling: spawn multiple handler processes, each managing subset of connections. Prevents single handler from becoming bottleneck.
Differences from WebSocket
Aspect
WebSocket
SSE
Direction
Bidirectional
Server to client only
Protocol
Upgrade to ws://
Standard HTTP streaming
Choose SSE when:
Server pushes updates to clients (notifications, live feeds, dashboards)
Clients only need to receive, not send through same connection
Working with proxies that may not support WebSocket
Want automatic reconnection with event replay
Choose WebSocket when:
True bidirectional communication needed
Binary data transfer required
Low latency in both directions critical
Meta-Process
A meta-process solves a specific problem: how to integrate blocking I/O with the actor model without breaking its guarantees. It runs two goroutines - one executes your blocking I/O code, the other handles actor messages. This separation preserves sequential message processing while allowing continuous external I/O operations.
Meta-processes are owned by their parent process. When the parent terminates, all its meta-processes terminate with it. This dependency is by design - meta-processes extend the parent's capabilities rather than existing as independent entities in the supervision tree.
The Problem
Actors work sequentially. One message arrives, gets processed, completes. Next message. This simplicity eliminates race conditions and makes reasoning straightforward.
Blocking I/O breaks this model. Call net.Listener.Accept() in a message handler and the actor freezes. The goroutine blocks waiting for connections. Other messages pile up unprocessed. The actor becomes unresponsive.
The obvious fix fails. Spawn a goroutine for Accept() and now two goroutines access the actor's state concurrently. You need locks. The sequential guarantee vanishes. The actor model collapses into traditional concurrent programming with all its complexity.
Meta-processes preserve both. One goroutine blocks on I/O. Another goroutine processes messages sequentially. Neither interferes with the other.
Two Goroutines, Two Purposes
When a meta-process starts, the framework launches two goroutines:
External Reader: Runs your Start() method from beginning to end. This goroutine is meant for blocking operations - Accept() loops, ReadFrom() calls, reading from pipes. When external events occur, this goroutine sends messages into the actor system using Send(). It never processes incoming messages.
Actor Handler: Created on-demand when messages arrive in the mailbox. Processes messages sequentially by calling your HandleMessage() and HandleCall() methods. When the mailbox empties, this goroutine terminates. Next time messages arrive, a new actor handler spawns. This goroutine never does I/O directly - it handles requests from actors.
The External Reader runs continuously from spawn until termination. The Actor Handler comes and goes based on message traffic.
Why Regular Processes Cannot Do This
Aspect
Process
Meta-Process
Processes have one goroutine that must handle everything. If it blocks on I/O, message processing stops. If it spawns additional goroutines for I/O, the actor model breaks.
Meta-processes separate concerns. The External Reader handles I/O. The Actor Handler handles messages. Both run independently.
Restrictions Explained
Meta-processes cannot make synchronous calls. Which goroutine should block waiting for the response? The External Reader is blocked on external I/O. The Actor Handler might not be running. Neither can reliably wait for responses.
Meta-processes cannot create links or monitors. When a linked process terminates, it sends an exit signal as a message. The Actor Handler processes messages, but only when running. Signals could be delayed or lost if the Actor Handler is not active. Incoming links and monitors work because other processes send signals that queue in the mailbox. Creating outgoing links requires guarantees that meta-processes cannot provide.
These are not arbitrary limitations. They follow from having two goroutines with distinct responsibilities.
Behavior Implementation
Init() runs once during creation. Initialize state, store the MetaProcess reference, prepare resources. Return an error to prevent spawning.
Start() runs in the External Reader. This is where your blocking I/O lives. Loop forever accepting connections. Block reading datagrams. Read from pipes. When Start() returns, the meta-process terminates.
HandleMessage() processes regular messages sent by actors. Runs in the Actor Handler. Return nil to continue, return an error to terminate.
HandleCall() processes synchronous requests from actors. Return (result, nil) to send the result back. Return (nil, error) to send an error. The framework handles the response automatically.
Terminate() runs during shutdown regardless of how termination occurred. Close resources, flush buffers, clean up. Do not block or panic here.
HandleInspect() returns diagnostic information as string key-value pairs. Used by monitoring tools. Inspect requests are sent to the system queue (high priority) and processed before regular messages. You can inspect meta processes from within a process context using process.InspectMeta(alias) or directly from the node using node.InspectMeta(alias). Both methods only work for local meta processes (same node).
Three States
Sleep: External Reader is running (usually blocked on I/O), Actor Handler does not exist. Mailbox may contain messages waiting to be processed. This is the resting state when no actors are communicating with the meta-process.
Running: Both goroutines active. External Reader continues I/O operations. Actor Handler processes messages from the mailbox. Both work simultaneously without blocking each other.
Terminated: Both goroutines stopped. Start() returned and Actor Handler completed its final message.
Transitions are automatic. Message arrives → Actor Handler spawns → Sleep becomes Running. Mailbox empties → Actor Handler exits → Running becomes Sleep. Start() returns → Terminated regardless of current state.
Data Flow
The External Reader blocks reading while the Actor Handler simultaneously blocks writing. Two blocking operations, two goroutines, neither prevents the other.
Creating Meta-Processes
Define your behavior:
Spawn from a process:
The meta-process lives as long as its parent lives. When Server terminates, the UDP server terminates automatically.
State-Based Operations
Different operations are available in different states:
All states (Sleep, Running, Terminated):
Send(), SendWithPriority() - External Reader sends in Sleep, Actor Handler sends in Running
ID(), Parent() - Identity never changes
Running only:
SendResponse(), SendResponseError() - Only Actor Handler has the gen.Ref from HandleCall()
SetSendPriority(), SetCompression() - Actor Handler controls these
Sleep and Running (not Terminated):
Spawn() - Both goroutines can spawn child meta-processes
The External Reader operates in Sleep state and has minimal capabilities - just sending messages and spawning children. The Actor Handler operates in Running state and has full capabilities for processing requests.
Shared State
Both goroutines access the same struct fields. Use atomic operations for shared counters and flags:
Avoid complex synchronization. If you need mutexes, the design probably belongs in a regular process with meta-processes handling only I/O.
Common Patterns
External events to actors: External Reader reads events, sends them to actors for processing.
Actor-controlled I/O: Actors send commands, Actor Handler executes them against external resources.
Full-duplex communication: External Reader reads, Actor Handler writes, both operate on the same connection.
Server accepting connections: External Reader accepts connections, spawns child meta-processes for each.
Bridging external event sources with actors (monitoring filesystems, listening to OS signals)
Wrapping synchronous APIs that cannot be made asynchronous
Do not use meta-processes when:
Implementing business logic
Managing application state
Coordinating between actors
Processing messages that do not involve blocking I/O
Meta-processes sit at the boundary between the external world and the actor system. They translate blocking operations into asynchronous messages and execute actor commands using blocking APIs. Regular processes implement everything else.
# Find actor goroutines by PID
curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*pid'
# Find meta process goroutines by Alias
curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*meta'
type MessageDisconnect struct {
ID gen.Alias // Connection meta-process identifier
}
case sse.MessageDisconnect:
delete(h.connections, m.ID)
h.Log().Info("Client disconnected: %s", m.ID)
type Message struct {
ID gen.Alias // Connection identifier
Event string // Event type (optional)
Data []byte // Event data (can be multi-line)
MsgID string // Event ID for reconnection (optional)
Retry int // Retry hint in milliseconds (optional)
}
// Simple data event
h.SendAlias(connID, sse.Message{
Data: []byte("Hello, client!"),
})
// Named event with ID
h.SendAlias(connID, sse.Message{
Event: "update",
Data: []byte(`{"temperature": 23.5}`),
MsgID: "42",
})
// Broadcast to all connections
for connID := range h.connections {
h.SendAlias(connID, sse.Message{
Event: "broadcast",
Data: []byte("Server announcement"),
})
}
event: update
id: 42
data: {"temperature": 23.5}
type MessageLastEventID struct {
ID gen.Alias // Connection identifier
LastEventID string // ID from client header
}
case sse.MessageLastEventID:
h.Log().Info("Client reconnected, last event: %s", m.LastEventID)
// Send missed events since LastEventID
h.sendMissedEvents(m.ID, m.LastEventID)
func (h *EventHandler) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case sse.MessageConnect:
h.Log().Info("Connected to server")
case sse.Message:
h.Log().Info("Event: %s, Data: %s", m.Event, string(m.Data))
case sse.MessageDisconnect:
h.Log().Info("Disconnected from server")
}
return nil
}
// Actor on node1 sends to connection on node2
actor.SendAlias(connectionAlias, sse.Message{
Event: "notification",
Data: []byte("Update from backend service"),
})
In the actor model, messages are typically fire-and-forget. You send a message, and it either arrives or it doesn't. For local communication, errors are immediate - if the process doesn't exist or the mailbox is full, Send returns an error. But for remote communication, Send succeeds as soon as the message reaches the network layer. You don't know if it arrived at the remote node, if the target process exists, or if the mailbox had space.
This works fine for many scenarios. Asynchronous messaging doesn't require confirmation. Actors process what arrives and ignore what doesn't. Systems are resilient because actors don't wait for acknowledgments - they keep working.
But some operations need certainty. A payment authorization must definitely be recorded or definitely fail - "maybe it worked" isn't acceptable. A distributed transaction coordinator needs to know that all participants received the commit message before proceeding. Critical state updates can't be silently lost.
Important Delivery provides guaranteed message delivery through acknowledgment. When you send with the important flag, the framework tracks the message, waits for confirmation from the recipient, and reports errors if delivery fails.
The Problem: Network Opacity
Without important delivery, remote communication is opaque:
The remote Send succeeds even if:
The remote process doesn't exist
The remote process's mailbox is full
The remote node received the message but dropped it
You only discover problems through absence - no response arrives, timeouts fire, but you don't know why. Did the request get lost? Did the process crash? Is it just slow?
The Solution: Confirmed Delivery
Important delivery makes remote communication transparent - errors are immediate, just like local:
The framework sends the message, waits for acknowledgment from the remote node, and reports the outcome. Either the message is in the recipient's mailbox (success) or you get an error explaining what went wrong (failure). No ambiguity.
How to Use Important Delivery
There are two ways to enable important delivery:
Method 1: Per-message explicit methods
Use SendImportant and CallImportant instead of Send and Call:
Method 2: Process-level flag
Set the important delivery flag on the process - all outgoing messages use important delivery:
The process-level flag affects all outgoing messages: Send, SendPID, SendProcessID, SendAlias, and Call requests. You don't need to use special methods - regular Send and Call automatically include the important flag.
Use the flag when the process primarily deals with critical messages. Use explicit methods when only specific messages require guarantees.
How Important Delivery Works
Send with Important Delivery
Here's what happens when you send a message with important delivery:
The sender blocks until the acknowledgment arrives. The remote node attempts delivery and sends either success (ACK) or failure (error). The sender's SendImportant unblocks with the result.
For local sends, the behavior is identical to regular Send - immediate error if the process doesn't exist or mailbox is full. The important flag only affects remote sends.
Call with Important Delivery
Call requests already have a response channel (the caller waits for HandleCall to return), so important delivery works differently. The ACK is only sent if there's an error - if delivery succeeds, no ACK is sent, and the caller waits for the actual response:
The key difference from regular Call: with CallImportant, if the remote process doesn't exist or its mailbox is full, you get an immediate error instead of waiting for timeout. If delivery succeeds, you wait for the response just like regular Call.
Without the important flag, ErrProcessUnknown looks like timeout - you can't tell if the process is slow, dead, or never existed. With important delivery, you know immediately.
Combining Call and Response Delivery
Things get interesting when you combine important delivery on requests with important delivery on responses. There are four combinations, each with different guarantees.
Regular Call + Regular Response
Guarantees: None. Request may be lost. Response may be lost. Timeout is ambiguous.
Use case: Fast, non-critical operations where occasional loss is acceptable.
Regular Call + Important Response (RR-2PC)
Guarantees: Response delivery is confirmed. If the handler returns a result, the caller will receive it (or get an error if delivery fails). Request delivery is not confirmed - the handler might never receive the request.
Use case: The handler's work is critical, the caller must know if it succeeded. Example: committing a transaction. If the transaction commits, the caller must know. But it's okay if the request gets lost (request is idempotent, can be retried).
How it works:
The handler blocks after processing until the caller acknowledges the response. If the caller crashes before sending ACK, the handler's SendResponseImportant returns ErrResponseIgnored or ErrTimeout.
The request has no guarantee - it might be lost, and the caller would timeout. But if the handler processed the request and sends a response, that response is guaranteed to be delivered.
Important Call + Regular Response
Guarantees: Request delivery is confirmed. The handler will receive the request (or caller gets an error immediately). Response delivery is not confirmed - response may be lost.
Use case: The handler must receive the request, but the response is less critical or can be retried. Example: triggering a background job. The job must start, but if the status response is lost, the caller can query status later.
How it works:
The caller gets immediate confirmation that the request arrived, then waits for the response. If the response gets lost, the caller times out - but knows the handler received and processed the request.
Important Call + Important Response (FR-2PC)
Guarantees: Both request and response delivery are confirmed. The handler definitely receives the request, and the caller definitely receives the response. No ambiguity at any point.
Use case: Critical operations where both request and response must be guaranteed. Example: distributed transaction commit coordination, financial operations, critical state synchronization.
How it works:
With FR-2PC:
The caller gets immediate error if request can't be delivered (no ambiguous timeout)
If request is delivered, caller waits for response
The handler blocks after sending response until caller confirms receipt
This is the most reliable pattern but also the most expensive. Use it only when guaranteed delivery is essential.
FR-2PC as Foundation for 3PC
FR-2PC provides the messaging reliability needed to implement Three-Phase Commit (3PC) and other distributed transaction protocols at the application level.
Traditional Two-Phase Commit (2PC) has a blocking problem: if the coordinator crashes after participants vote "yes" but before sending commit/abort, participants don't know what to do. They're stuck.
Three-Phase Commit solves this by adding a pre-commit phase:
Prepare: Can you commit?
Pre-commit: Everyone said yes, get ready to commit
Commit: Now commit
If the coordinator crashes after pre-commit, participants know the outcome was "commit" and can proceed independently.
But 3PC only works if messages are reliably delivered. If a pre-commit message gets lost and a participant doesn't receive it, the protocol breaks - some participants think we're committing, others are still waiting.
FR-2PC guarantees that messages are delivered or errors are reported. This lets you implement 3PC confidently:
FR-2PC ensures that:
If CallImportant returns nil, the participant received the message
If CallImportant returns an error, the participant didn't receive the message
No ambiguous timeouts where you don't know if the message arrived
This determinism is essential for 3PC. Without it, you'd need complex timeout-based recovery that can't distinguish "participant is slow" from "participant is dead" from "message was lost."
Performance Considerations
Important delivery adds overhead:
Extra round trip: Sender waits for ACK before proceeding
Sender blocks: Can't process other messages while waiting
Network traffic: Additional ACK messages
For SendImportant, the sender blocks until ACK arrives (success or error) or timeout. For CallImportant, the sender gets immediate error if delivery fails, or waits for response if delivery succeeds (no extra ACK on success).
The blocking is process-local - only the sending actor waits. Other actors on the node continue normally. But the sending actor's mailbox isn't processed during the wait.
Use important delivery selectively:
Use for: Critical state updates, transaction coordination, payment processing, data synchronization
Most actor communication doesn't need guarantees. The actor model is resilient because actors handle partial failure gracefully. Important delivery is for the cases where partial failure isn't acceptable - where certainty is worth the cost.
Local vs Remote Behavior
Important delivery only affects remote communication. For local sends:
Local mailbox operations are synchronous - pushing to the mailbox either succeeds or fails immediately. The important flag is unnecessary because there's no network uncertainty. The framework silently treats local important sends as regular sends.
This means your code works identically for local and remote processes. You can use SendImportant everywhere without checking if the target is local or remote - the framework optimizes local communication automatically.
Error Types
Important delivery produces specific errors:
ErrProcessUnknown - The remote process doesn't exist. Without important delivery, you'd discover this through timeout. With important delivery, you know immediately.
ErrProcessMailboxFull - The remote process exists but its mailbox is full. Without important delivery, the message would queue in the network layer or be dropped. With important delivery, you get immediate feedback.
ErrTimeout - The remote node received the message but didn't send ACK within the timeout period. This is different from Call timeout - it means the node is unresponsive or overloaded.
ErrResponseIgnored - For important responses, the caller is no longer waiting (timed out or terminated). The response couldn't be delivered. Without important delivery, the handler wouldn't know the response was ignored.
ErrNoConnection - Cannot establish connection to the remote node. This error occurs for both regular and important sends, but important delivery surfaces it immediately instead of silently queueing.
These specific errors let you handle different failure modes appropriately - retry for ErrTimeout, provision more resources for ErrProcessMailboxFull, fail immediately for ErrProcessUnknown.
Summary
Important delivery trades performance for certainty. Messages are guaranteed to be delivered or errors are reported immediately. Use it when:
The operation is critical and must succeed or definitely fail
Ambiguous timeouts are unacceptable
You're implementing distributed protocols that require guaranteed delivery
For most actor communication, fire-and-forget messaging is sufficient. The actor model handles uncertainty through supervision, retries, and eventual consistency. Important delivery is for the cases where uncertainty itself is the problem.
For more on handling synchronous requests, see .
Service Discovering
How nodes find each other and establish connections
Service discovery solves a fundamental problem in distributed systems: how does one node find another node when all it has is a name?
When you send a message to a remote process, the target identifier contains the node name - a gen.PID includes the node where that process runs, a gen.ProcessID specifies both process name and node, and a gen.Alias includes the node. But what does that node name mean in network terms? What IP address? What port? Is TLS required? What protocol versions are supported? Service discovery answers these questions, translating logical node names into concrete connection parameters.
Metrics
The metrics actor provides observability for Ergo applications by collecting and exposing runtime statistics in Prometheus format. Instead of manually instrumenting your code with counters and gauges scattered throughout, the metrics actor centralizes telemetry into a single process that exposes an HTTP endpoint for Prometheus to scrape.
This approach separates monitoring concerns from application logic. Your actors focus on business functionality while the metrics actor handles collection, aggregation, and exposure of operational data. Prometheus or compatible monitoring systems poll the /metrics endpoint periodically, building time-series data for alerting and visualization.
Consider a simple scenario. Node A wants to send to a process on node B. The process has a gen.PID that includes the node name "[email protected]". That's the logical address, but it's not enough to open a TCP connection. The node needs to translate that into connection parameters:
The IP address or hostname to connect to
The port number where node B is listening
Whether TLS is required for this connection
Which handshake and protocol versions node B supports
Which acceptor to use if node B has multiple listeners
This information changes dynamically. Nodes start and stop. Ports change. TLS gets enabled or disabled. You don't want to hardcode these details into your application. You want discovery to happen automatically, and you want it to stay current.
The Embedded Registrar
Every node includes a registrar component that handles discovery. When a node starts, its registrar attempts to become a server by binding to port 4499 - TCP on localhost:4499 for registration and UDP on 0.0.0.0:4499 for resolution. If the TCP bind succeeds, the registrar runs in server mode. If the port is already taken (another node is using it), the registrar switches to client mode and connects to the existing server.
This design means one node per host acts as the discovery server for all other nodes on that host. Whichever node started first becomes the server. The rest are clients.
When a node's registrar runs in server mode, it:
Listens on TCP localhost:4499 for registration from same-host nodes
Listens on UDP 0.0.0.0:4499 (all interfaces) for resolution queries from any host
Maintains a registry of which nodes are running and how to reach them
Responds to queries with current connection information
When a node's registrar runs in client mode, it:
Connects via TCP to the local registrar server at localhost:4499
Forwards its own registration to the server over TCP
Performs discovery queries via UDP (to localhost for same-host, to remote hosts for cross-host)
Maintains the TCP connection until termination (for registration keepalive)
This dual-mode design provides automatic failover. If the server node terminates, its TCP connections close. The remaining nodes detect the disconnection, and they race to bind port 4499. The winner becomes the new server. The others reconnect as clients. Discovery continues without manual intervention.
Registration
When a node starts, it registers with the registrar. This registration happens over the TCP connection (for same-host nodes) or through initial discovery queries (for the server itself).
What gets registered:
Node name (must be unique on the host)
List of acceptors this node is running
For each acceptor: port number, handshake version, protocol version, TLS flag
The TCP connection from client to server stays open. It serves two purposes: maintaining registration (if the connection drops, the node is considered dead) and enabling the server to push updates (though the current implementation doesn't use this capability).
If a node tries to register a name that's already taken, the registrar returns gen.ErrTaken. Node names must be unique within a host. Across hosts, the same name is fine - node names include the hostname for disambiguation.
Resolution
When a node needs to connect to a remote node, it queries the registrar for connection information.
The resolution mechanism depends on whether the querying node is running the registrar in server mode:
If the node runs the registrar server and the target is on the same host, resolution is a direct function call - no network involved. The server looks up the target in its local registry and returns the acceptor information immediately.
If the node is a registrar client, resolution uses UDP regardless of whether the target is same-host or cross-host. The node extracts the hostname from the target node name (worker@otherhost becomes otherhost), sends a UDP packet to that host on port 4499, and waits for a response. For same-host queries, this means UDP to localhost:4499. For cross-host queries, it's UDP to the remote host. The registrar server (wherever it is) looks up the node and sends back the acceptor list via UDP reply.
This UDP-based resolution is stateless. No connection is maintained. Each query is independent. This keeps it lightweight but means there's no push notification when remote nodes change - you only discover changes when you query again. The TCP connection between client and server is used only for registration and keepalive, not for resolution queries.
The resolution response includes everything needed to establish a connection:
Acceptor port number
Handshake protocol version
Network protocol version
TLS flag (whether encryption is required)
Multiple acceptors are supported. If a node has three acceptors listening on different ports with different configurations, all three appear in the resolution response. The connecting node tries them in order until one succeeds.
Application Discovery
Central registrars (etcd and Saturn) provide application discovery - finding which nodes in your cluster are running specific applications. The embedded registrar doesn't support this feature.
When an application starts on a node, it registers an application route with the registrar:
The registrar stores this deployment information. Other nodes can then discover where the application is running:
The response includes the node name, application state, running mode, and a weight value. Multiple nodes can run the same application - the resolver returns all of them.
Load Balancing with Weights
Weights enable intelligent load distribution across application instances.
When multiple nodes run the same application, each registration includes a weight. Higher weights indicate preference - nodes with more resources, better performance, or strategic positioning get higher weights. When you resolve an application, you get all instances with their weights:
You choose which instance to use based on your load balancing strategy:
Weighted random - Randomly select, but favor higher weights. Worker3 gets picked 2x more often than worker1, 4x more than worker2.
Round-robin with weights - Cycle through instances, but send proportionally more requests to higher-weighted nodes. Send 4 requests to worker3, 2 to worker1, 1 to worker2, then repeat.
Least-loaded - Track active requests per instance, prefer higher-weight nodes when load is equal.
Geographic routing - Set weights based on proximity. Same datacenter gets weight 100, same region gets 50, cross-region gets 10.
The weight is metadata - the registrar doesn't enforce any particular strategy. Your application decides how to interpret weights.
Use Cases for Application Discovery
Service mesh - Applications discover service endpoints dynamically. Your "api" application needs to send requests to the "workers" application. Instead of hardcoding which nodes run workers, you resolve it at runtime. When workers scale up or down, discovery reflects the current topology.
Job distribution - A scheduler needs to distribute jobs across worker nodes. Resolve the "workers" application, get the list of available instances with their weights, and distribute jobs proportionally. If a worker node goes down, the next resolution returns fewer instances automatically.
Application migration - You're moving an application from old nodes to new nodes. Start the application on new nodes with low weights. Verify it works correctly. Gradually increase weights on new nodes while decreasing weights on old nodes. Traffic shifts smoothly. Once migration completes, stop the application on old nodes.
Feature flags - Run experimental versions of an application on a subset of nodes with specific weights. Route a percentage of traffic to the experimental version. If it performs well, increase its weight. If it fails, remove its registration entirely.
Multi-region deployment - Deploy applications across regions. Use weights to prefer local regions. A node in us-east resolves the application and gets instances from all regions, but us-east instances have weight 100, us-west has weight 20, eu has weight 10. Most traffic stays local, but you can still route to other regions if needed.
Configuration Management
Central registrars provide cluster-wide configuration storage. The embedded registrar doesn't support this - each node maintains its own configuration independently.
Configuration lives in the registrar's key-value store. For etcd, this is etcd's native key-value storage. For Saturn, it's stored in the Raft-replicated state. Any node can read configuration, creating a single source of truth for cluster settings:
Configuration values can be any type - strings, numbers, booleans, nested structures. The registrar encodes them using EDF, so complex configuration is supported.
Configuration Patterns
Global configuration - Settings that apply cluster-wide. Database connection strings, external service URLs, feature flags. Store them in the registrar, and all nodes read the same values. When you update a configuration item in the registrar, new nodes get the updated value automatically.
Per-node configuration - Node-specific settings stored with the node name as a key prefix. Store node:worker1:cpu_limit, node:worker2:cpu_limit separately. Each node reads its own configuration using its name. This enables heterogeneous clusters where nodes have different capabilities.
Per-application configuration - Settings specific to an application. Store under an application key prefix: app:workers:batch_size, app:workers:concurrency. When the application starts on any node, it reads this configuration from the registrar.
Environment-based configuration - Different values for dev/staging/production. Use key prefixes: prod:database_url, staging:database_url, dev:database_url. Nodes set an environment variable indicating their environment and read the appropriate keys.
Configuration hierarchy - Combine multiple patterns with fallbacks. Read app:workers:batch_size, fall back to default:batch_size, fall back to hardcoded default. This provides specificity where needed and defaults everywhere else.
Dynamic Configuration Updates
Configuration in the registrar is static from the framework's perspective - it doesn't push updates to running nodes. When you change a configuration item in etcd or Saturn, running nodes don't see the change automatically. They have the value they read during startup or their last query.
To implement dynamic configuration updates, use the registrar event system:
Both etcd and Saturn registrars support events and push notifications immediately when:
Configuration changes - EventConfigUpdate with item name and new value
Nodes join/leave - EventNodeJoined / EventNodeLeft with node name
Applications lifecycle - EventApplicationLoaded, EventApplicationStarted, EventApplicationStopping, EventApplicationStopped, EventApplicationUnloaded with application name, node, weight, and mode
Each registrar defines its own event types in its package (ergo.services/registrar/etcd or ergo.services/registrar/saturn). The event structures are identical, but you must use the correct package import for your registrar. This lets you react to cluster changes in real-time.
Embedded registrar doesn't support events.
With event notifications from etcd or Saturn registrars, nodes learn about configuration changes within milliseconds.
Use Cases for Configuration Management
Database connection strings - Instead of deploying configuration files to every node, store the connection string in the registrar. Nodes read it on startup. When you rotate credentials or migrate to a new database, update the registrar. Restart nodes gradually, and they pick up the new connection string automatically. No configuration file deployment needed.
Feature flags - Enable or disable features dynamically across the cluster. Store feature:new_algorithm:enabled in the registrar. Applications check this flag when deciding which code path to use. Change the flag in the registrar, restart applications (or use events for live updates), and the feature rolls out cluster-wide.
Capacity planning - Store node capacity information: CPU limits, memory limits, concurrent job limits. Applications read these limits and respect them when distributing work. When you upgrade hardware, update the capacity values in the registrar. Applications discover the new capacity automatically.
Service discovery integration - Combine application discovery with configuration. Store connection parameters for each application deployment. When you resolve the "workers" application, you get not just the node names but also their specific configurations - which worker pool size, which queue they're processing, which priority level they handle.
Staged rollouts - Store configuration with version tags. Set config:version to "v2". Nodes read their configuration version on startup. Half your cluster uses v1 configuration, half uses v2. Monitor behavior. If v2 performs better, update all nodes to v2. If it causes problems, roll back to v1. Configuration versioning enables controlled changes.
Cluster-wide coordination - Store cluster-wide state that multiple nodes need to coordinate on. Leader election metadata, distributed lock information, shared counters. This isn't what the registrar is designed for (use dedicated coordination services for complex coordination), but simple coordination needs can be met with registrar configuration storage.
Failover and Reliability
The embedded registrar has built-in automatic failover.
When a registrar server node terminates:
Its TCP connections to client nodes (on the same host) close
Client nodes detect the disconnection
Each client attempts to bind localhost:4499
The first to succeed becomes the new server
The rest connect to the new server as clients
Everyone re-registers their routes with the new server
This failover is automatic and takes a few milliseconds. Discovery continues without interruption.
For cross-host discovery, the same failover mechanism applies to each host independently. If a remote host's registrar server node goes down, another node on that host immediately takes over the server role. From the perspective of nodes on other hosts, discovery to that host continues working - they send UDP queries to the host, and whichever node is currently the registrar server responds. The failover is invisible to external hosts because the UDP queries are addressed to the host (port 4499), not to a specific node.
Limitations of the Embedded Registrar
The embedded registrar is minimal by design. It provides route resolution only. What it doesn't provide:
No application discovery - You can discover where nodes are, but not where specific applications are running. Want to find which nodes are running the "workers" application? You have to query every node individually or maintain that mapping yourself.
No load balancing metadata - There's no weight system for distributing load across multiple instances of the same application. You can't express that some nodes have more capacity or should receive more traffic.
No centralized configuration - Configuration lives with each node. There's no cluster-wide config store. If you want to change a setting across the cluster, you modify each node individually through node environment variables or configuration files.
No event notifications - Discovery is pull-based. You query when you need information. The registrar doesn't push updates when things change. If a node joins or leaves, or an application starts or stops, you only discover the change when you query again.
No topology awareness - The registrar doesn't understand your cluster structure. It treats all nodes equally. If you have nodes in different datacenters or regions, the registrar provides no metadata to help you route efficiently based on proximity or cost.
Limited scalability - The UDP query model works for small to medium clusters but doesn't scale to hundreds of nodes efficiently. Cross-host discovery has no caching - every query hits the network. For large clusters, this generates significant network traffic.
These limitations don't matter for development or small deployments. Two nodes on your laptop? Three nodes in a single datacenter? The embedded registrar works fine. But for production clusters, especially large ones or those requiring dynamic topology, you want the richer feature set of etcd or Saturn registrars.
External Registrars
External registrars replace the embedded implementation with centralized discovery services.
etcd registrar (ergo.services/registrar/etcd) uses etcd as the discovery backend. All nodes register their routes in etcd on startup. All discovery queries go to etcd. This centralizes cluster state: any node can discover any other node, applications can advertise their deployment locations, configuration can be stored in etcd's key-value store.
The etcd registrar implementation maintains registration through HTTP polling - each node makes a registration request every second to keep its entry alive. This works well for small to medium clusters (50-70 nodes) but creates overhead at larger scales. The polling approach reflects etcd's design for web services rather than continuous cluster communication. Despite this limitation, etcd provides proven reliability, extensive tooling, and operational familiarity for teams already using etcd in their infrastructure.
Saturn registrar (ergo.services/registrar/saturn) is purpose-built for Ergo clusters. It's an external Raft-based registry designed specifically for the framework's communication patterns. Instead of polling, Saturn maintains persistent connections and pushes updates immediately when cluster state changes. This makes it more efficient at scale - Saturn can handle clusters with thousands of nodes without the overhead of constant HTTP polling. The immediate event propagation means nodes learn about topology changes instantly rather than waiting for the next poll interval.
Which registrar you choose depends on your deployment:
Small clusters (< 10 nodes), same host or trusted network: embedded registrar
Medium clusters (10-70 nodes), existing etcd infrastructure: etcd registrar
Large clusters (70+ nodes) or real-time requirements: Saturn registrar
The choice is transparent to application code. You specify the registrar in gen.NodeOptions.Network.Registrar at startup. Everything else - registration, resolution, failover - works the same way regardless of which registrar you use.
Registrar Configuration
For the embedded registrar, configuration is minimal:
Setting DisableServer: true prevents the node from becoming a registrar server. It will always run in client mode. This is useful if you have a dedicated node that should handle discovery and you don't want application nodes competing for the server role.
For external registrars, configuration includes the service endpoint:
The node connects to the registrar during startup. If the connection fails, startup fails. Discovery is considered essential - if you can't register and discover, the node can't participate in the cluster, so there's no point in starting.
Discovery in Practice
Service discovery is invisible during normal operation. You send messages, make calls, establish links - discovery happens automatically behind the scenes.
Where discovery becomes visible is during debugging and operations. When connections fail, understanding discovery helps diagnose why. Is the registrar unreachable? Is the target node not registered? Are the acceptor configurations incompatible?
The registrar provides an Info() method that shows its status:
This information helps you understand what discovery features are available and whether the registrar is functioning correctly.
For deeper understanding of how discovery integrates with connection establishment and message routing, see the Network Stack chapter. For configuring explicit routes that bypass discovery, see Static Routes.
Actor systems present unique monitoring challenges. Traditional thread-based applications have predictable resource usage patterns - you monitor thread pools, request queues, and database connections. Actor systems are more dynamic - processes spawn and terminate constantly, messages flow asynchronously through mailboxes, and work distribution depends on supervision trees and message routing.
The metrics actor addresses this by tracking:
Process metrics - How many processes exist, how many are running vs. idle vs. zombie. This reveals whether your node is under load or experiencing process leaks.
Memory metrics - Heap allocation and actual memory used. Actor systems can accumulate small allocations across thousands of processes. Memory metrics help identify whether garbage collection keeps pace with allocation.
Network metrics - For distributed Ergo clusters, tracking bytes and messages flowing between nodes reveals network bottlenecks, routing inefficiencies, or failing connections.
Application metrics - How many applications are loaded and running. Applications failing to start or terminating unexpectedly appear in these counts.
These base metrics provide system-level visibility. For application-specific metrics (request rates, business transactions, custom counters), you extend the metrics actor with your own Prometheus collectors.
ActorBehavior Interface
The metrics actor extends gen.ProcessBehavior with a specialized interface:
Only Init() is required - register your custom metrics and return options; all other callbacks have default implementations you can override as needed.
You have two main patterns:
Periodic collection - Implement CollectMetrics() to query state at intervals. Use when metrics reflect current state from other actors or external sources.
Event-driven updates - Implement HandleMessage() or HandleEvent() to update metrics when events occur. Use when your application produces natural event streams or publishes events.
How It Works
When you spawn the metrics actor:
HTTP endpoint starts at the configured host and port. The /metrics endpoint immediately serves Prometheus-formatted data.
Base metrics collect automatically. Node information (processes, memory, CPU) and network statistics (connected nodes, message rates) update at the configured interval.
Custom metrics update via CollectMetrics() callback or HandleMessage() processing, depending on your implementation.
Prometheus scrapes the /metrics endpoint and receives current values for all registered collectors (base + custom).
The actor handles HTTP serving and registry management. You focus on defining metrics and updating their values.
Basic Usage
Spawn the metrics actor like any other process:
Default configuration:
Host: localhost
Port: 3000
CollectInterval: 10 seconds
The HTTP endpoint starts automatically during initialization. The first metrics collection happens immediately, and subsequent collections run at the configured interval.
Configuration
Customize the HTTP endpoint and collection frequency:
Host determines which network interface the HTTP server binds to. Use "localhost" to restrict access to local connections only (development, testing). Use "0.0.0.0" to accept connections from any interface (production, containerized environments).
Port should not conflict with other services. Prometheus conventionally uses 9090, but many Ergo applications use that for other purposes. Choose a port that doesn't collide with your application's HTTP servers, Observer UI (default 9911), or other metrics exporters.
CollectInterval controls how frequently the actor queries node statistics. Shorter intervals provide more granular time-series data but increase CPU usage for collection. Longer intervals reduce overhead but miss short-lived spikes. For most applications, 10-15 seconds balances responsiveness with resource usage. Prometheus typically scrapes every 15-60 seconds, so collecting more frequently than your scrape interval wastes resources.
Base Metrics
The metrics actor automatically exposes these Prometheus metrics without any configuration:
Node Metrics
Metric
Type
Description
ergo_node_uptime_seconds
Gauge
Time since node started. Useful for detecting node restarts and calculating availability.
ergo_processes_total
Gauge
Total number of processes including running, idle, and zombie. High counts suggest process leaks or inefficient cleanup.
Network Metrics
Metric
Type
Labels
Description
ergo_connected_nodes_total
Gauge
-
Number of remote nodes connected. For distributed systems, this should match your expected cluster size.
ergo_remote_node_uptime_seconds
Gauge
Network metrics use labels (node="...") to separate per-node data. This creates multiple time series - one per connected node. Prometheus queries can aggregate across labels or filter to specific nodes.
Custom Metrics
Extend the metrics actor by embedding metrics.Actor. You register custom Prometheus collectors in Init() and update them via CollectMetrics() or HandleMessage().
Approach 1: Periodic Collection
Implement CollectMetrics() to poll state at regular intervals:
Use this when metrics reflect state you need to query - current values from other actors, computed aggregates, external API calls.
Approach 2: Event-Driven Updates
Update metrics immediately when events occur:
Application actors send events to the metrics actor:
Use this when your application naturally produces events. Metrics update in real-time without polling.
Metric Types
Prometheus defines four metric types, each suited for different use cases:
Counter - Monotonically increasing value. Use for events that accumulate (requests processed, errors occurred, bytes sent). Counters never decrease except on process restart. Prometheus queries typically use rate() to calculate per-second rates or increase() for total change over a time window.
Gauge - Value that can go up or down. Use for current state (active connections, queue depth, memory usage, CPU utilization). Gauges represent snapshots. Prometheus queries can graph them directly or use functions like avg_over_time() to smooth spikes.
Histogram - Observations bucketed into configurable ranges. Use for latency or size distributions. Histograms let you calculate percentiles (p50, p95, p99) in Prometheus queries. They're more resource-intensive than gauges because they maintain multiple buckets per metric.
Summary - Similar to histogram but calculates quantiles client-side. Use when you need precise quantiles but can't predict bucket boundaries. Summaries are more expensive than histograms because they track exact quantiles, not approximations.
For most use cases, counters and gauges suffice. Use histograms when you need latency percentiles. Avoid summaries unless you have specific reasons - histograms are more flexible for Prometheus queries.
Integration with Prometheus
Configure Prometheus to scrape the metrics endpoint:
Prometheus fetches /metrics every 15 seconds, parses the text format, and stores time-series data. You can then query, alert, and visualize metrics using Prometheus queries or Grafana dashboards.
For dynamic discovery in Kubernetes or cloud environments, use Prometheus service discovery instead of static targets. The metrics actor itself doesn't need to know about Prometheus - it just exposes an HTTP endpoint.
Observer Integration
The metrics actor includes built-in Observer support via HandleInspect(). When you inspect it in Observer UI (http://localhost:9911), you see:
Total number of registered metrics
HTTP endpoint URL for Prometheus scraping
Collection interval
Current values for all metrics (base + custom)
This works automatically for custom metrics - register them in Init() and they appear in Observer alongside base metrics.
If you need custom inspection behavior, override HandleInspect() in your implementation:
For detailed configuration options, see the metrics.Options struct and ActorBehavior interface in the package. For examples of custom metrics, see the example directory.
Network Transparency
Making distributed communication feel local
Network transparency means the location of a process - whether it's in the same goroutine, on the same node, or on a remote node halfway across the world - doesn't change how you interact with it. You send messages the same way. You make calls with the same API. You establish links and monitors with the same methods. The framework handles the complexity of discovering nodes, encoding messages, and routing them across the network.
This isn't just convenient. It's fundamental to building distributed systems in the actor model. If remote operations looked different from local operations, you'd be constantly checking location and branching your logic. That locality awareness would spread throughout your code, making it brittle and hard to reason about. Network transparency lets you design systems as collections of communicating actors, and deployment topology becomes an operational concern rather than a code concern.
But transparency has limits. Networks are slower than in-process communication. They fail in ways local operations don't. Messages can be lost. Connections drop. Remote nodes crash or become unreachable. The framework makes remote operations look local, but the network's physical reality still matters.
What Transparency Means in Practice
Consider a simple example. You have a gen.PID and you want to send it a message:
This code is identical whether pid points to a local process or a remote one. You don't check. You don't call different methods. You just send.
Behind the scenes, the framework does different things:
For a local process: The message is placed directly in the recipient's mailbox queue. The framework checks the priority, selects the appropriate queue (Main, System, or Urgent), and pushes the message. If the process is sleeping, it wakes up. The entire operation happens in microseconds.
For a remote process: The node extracts the node name from the gen.PID, checks if a connection to that node exists, discovers the node's address if needed, establishes a connection pool if necessary, encodes your OrderRequest using EDF, wraps it in a protocol frame, sends it over TCP, and waits for the remote node to acknowledge delivery. The remote node receives the frame, decodes it, routes it to the recipient's mailbox, and sends an acknowledgment back. This takes milliseconds.
From your code's perspective, both operations look identical. The framework abstracts the complexity.
The Transparency Illusion
Network transparency is an illusion carefully maintained by the framework. Several mechanisms work together to create this effect.
Unified addressing - Every process has a gen.PID that includes the node name. Local and remote processes have the same identifier structure. You don't need different types for "local process" and "remote process". A gen.PID is just a gen.PID, and it works everywhere.
Automatic routing - When you send to a process, the framework examines the node portion of the identifier. If it matches the local node, the message is delivered locally. If it doesn't match, the framework initiates discovery to find the remote node and routes the message over the network. You don't trigger this logic explicitly - it happens automatically.
Location independence - You can receive a gen.PID from anywhere - as a return value, in a message, from a registry lookup - and immediately use it for communication. You don't need to check where it's from or set up connections. The framework handles it.
Failure semantics - When you send to a local process that doesn't exist, you get an error immediately. When you send to a remote process that doesn't exist, you get... nothing, by default. The message is sent over the network, and if nobody's listening, it's silently dropped. This asymmetry breaks the transparency illusion. The Important delivery flag fixes this: with Important enabled, sending to a missing remote process gives you an immediate error, just like local delivery. The framework makes the network behave like local memory.
How Messages Cross The Network
When you send a message to a remote process, what actually happens? The framework performs a complex series of operations to transform your Go value into bytes, transmit them over TCP, and reconstruct them on the receiving side. Understanding this flow helps you design efficient distributed systems and debug problems when they arise.
The sequence diagram below shows the complete message transmission pipeline, from the moment you call Send to the moment the recipient's HandleMessage is invoked:
When you send a message, the framework:
Encodes your value using EDF, transforming it into a byte sequence
Compresses it if the message exceeds the compression threshold (default 1024 bytes)
Frames it with protocol headers containing metadata (message type, sender, recipient, priority)
The remote node reverses this:
Reads the frame from the TCP connection
Decompresses if the compression flag is set
Decodes the bytes back into a Go value using EDF
This entire pipeline is invisible. You call Send, and the framework executes these steps. The receiving process calls HandleMessage, and it receives your value as if you'd passed it locally.
EDF: Ergo Data Format
EDF (Ergo Data Format) is a binary serialization format designed for distributed actor systems. It solves a fundamental problem: how do you serialize Go values - structs, slices, maps, framework types like gen.PID - across the network with the performance of code-generated serializers like Protocol Buffers, but without requiring code generation?
The answer is dynamic specialization. When you register a type, EDF analyzes its structure and builds specialized encoding and decoding functions specifically for that type. For structs, it creates functions for each field and composes them into a single encoder. This happens once at registration time, not during encoding. When you send a message, EDF uses these pre-built functions - no reflection, no runtime type analysis.
This approach delivers Protocol Buffers-class performance without .proto files or protoc code generation.
Registration happens at runtime - no build step, no generated files. You call edf.RegisterTypeOf() in your init() function, and EDF builds the optimized encoders. Framework types like gen.PID, gen.Ref, and gen.Event have native support with specialized encodings. During node handshake, both sides exchange their registered type lists and negotiate short numeric IDs, turning a full type name into 3 bytes on the wire. Field names aren't encoded - only field values in declaration order.
Performance benchmarks (see benchmarks/serial/) show encoding is 50-100% faster than Protocol Buffers, while decoding is 20-60% slower. The encoding advantage comes from the specialized functions built during registration.
EDF enforces strict type contracts - both nodes must register identical type definitions. Type identity is the full package path plus type name, not just the type name. For example, Order in package github.com/myapp/orders becomes #github.com/myapp/orders/Order. Two packages with the same type name Order are different types in EDF - this is Go's type system enforced at the protocol level.
This strict typing is a deliberate design choice that pushes version management to the application level. When you need to evolve a message type, you version it explicitly in your code:
Your actors handle both versions, routing logic based on the type received. This approach is essential for canary deployments where old and new versions coexist - each node declares what it understands, and the application code manages compatibility. Protocol-level backward compatibility would hide versioning from your code, making canary rollouts harder to control.
Type Constraints
EDF imposes size limits on certain types. These limits balance memory safety with practical message sizes.
Atoms (gen.Atom) - Maximum 255 bytes. Atoms are used for names - node names, process names, event names. Names longer than 255 bytes are uncommon and likely indicate a design issue. The 255-byte limit keeps name handling efficient.
Strings - Maximum 65,535 bytes (2^16-1). This covers most string use cases. For larger text (documents, logs, large payloads), use binary encoding ([]byte) instead, which supports up to 4GB.
Errors - Maximum 32,767 bytes (2^15-1). Error messages longer than 32KB are unusual. If you need to send detailed diagnostic information, use a separate field in your message struct.
Binary ([]byte) - Maximum 4,294,967,295 bytes (2^32-1, ~4GB). This is the largest single value EDF can encode. Messages containing multi-gigabyte binaries work but are inefficient. Consider chunking large data into multiple messages or using meta processes for streaming.
Collections (map, array, slice) - Maximum 2^32 elements. A map can have up to 4 billion entries. A slice can have 4 billion elements. These limits are unlikely to be hit in practice - a slice of 4 billion int64 values would consume 32GB of memory.
These limits are enforced during encoding. If you attempt to encode a 70,000 byte string, the encoder returns an error. The message isn't sent. On the receiving side, if a malicious sender tries to send an oversized value, the decoder rejects it and closes the connection.
Type Registration Requirements
For custom types to cross the network, both sending and receiving nodes must register them. Registration tells EDF how to encode and decode the type, and creates a numeric ID that's shared during handshake for efficient encoding.
Register types during initialization:
Registration Requirements
Only exported fields - Structs must have all fields exported (starting with uppercase). This is by design: exported fields define your actor's contract. When actors communicate - locally or across the network - they exchange messages according to explicit contracts. Unexported fields are implementation details, internal state that shouldn't cross actor boundaries. If registration encounters unexported fields, it fails with "struct Order has unexported field(s)".
No pointer types - EDF rejects pointer types and structs containing pointer fields. This is by design: pointers are a local memory optimization and shouldn't be part of network contracts. A *Database field is meaningless to a remote actor - it can't dereference your memory address. Pointers express local sharing semantics that don't translate across address spaces.
For distributed references, use framework types designed for remote access: gen.PID (process reference), gen.Alias (named reference), gen.Ref (call reference). These work across nodes and provide location-independent semantics.
Nested types must be registered first - If your type contains other custom types, register the inner types before the outer type:
The order matters because registration builds the encoding schema by examining fields. When registering Person, EDF sees the Address field. If Address isn't registered yet, registration fails with "type Address must be registered first". If Address is already registered, EDF references its schema, creating an efficient nested encoding.
Custom Marshaling for Special Cases
If your type has unexported fields or needs special encoding, implement custom marshaling:
EDF supports both edf.Marshaler/Unmarshaler and Go's standard encoding.BinaryMarshaler/Unmarshaler interfaces. The key difference is performance: edf.Marshaler writes directly to EDF's internal buffer (io.Writer), avoiding intermediate allocations. When you call MarshalEDF(w), the io.Writer is EDF's reusable buffer - your bytes go straight to the wire. With encoding.BinaryMarshaler, you must allocate and return a []byte, which EDF then copies into its buffer.
For high-throughput message types, prefer edf.Marshaler. For types that implement standard interfaces or rarely-sent messages, encoding.BinaryMarshaler works fine.
Encoding Errors
Go's error type is an interface. Encoding an error requires special handling because interfaces don't have a fixed structure.
Framework errors (gen.ErrProcessUnknown, gen.TerminateReasonNormal, etc.) are pre-registered when the node starts. They have numeric IDs and encode compactly as 3 bytes: type tag 0x9c + 2-byte ID.
Custom errors need registration:
Registered errors encode as 3 bytes total (type tag + 2-byte ID where ID > 32767). Unregistered errors encode as type tag + 2-byte length + error string bytes. On decoding, the framework checks if it has a local error with that string. If it does, it returns the local error instance. If not, it creates a new error with fmt.Errorf(string).
This means error identity can be preserved across nodes if both sides register the error. If only one side registers it, you get an error with the correct message but not the same instance. Code comparing errors with errors.Is needs both sides to register for correct behavior.
Type Registration Timing
Type registration must happen before connection establishment. During handshake, nodes exchange their registered type lists and error lists. These lists become the encoding dictionaries for that connection.
If you register a type after a connection is established, that type isn't in the dictionary. Attempting to send a value of that type fails - the encoder can't find it in the shared schema. The only way to use the newly registered type is to disconnect and reconnect, forcing a new handshake that includes the type.
This is why registration typically happens in init() functions. The registration runs before main(), which runs before node startup, which runs before any connections are established. By the time connections form, all types are registered.
For dynamic type registration (registering types based on runtime configuration or plugin loading), you have limited options:
Register before node start - Load your configuration, determine which types you need, register them all, then start the node. This works but requires knowing all types upfront.
Coordinate reconnection - Register the new type, disconnect existing connections to nodes that need the type, wait for reconnection with new handshake. This is complex and causes temporary communication loss.
Use custom marshaling - Implement edf.Marshaler/Unmarshaler or encoding.BinaryMarshaler/Unmarshaler. These don't require pre-registration - they work immediately. The tradeoff is you write the encoding logic yourself.
Most applications register types statically in init() and avoid these complications.
Compression
Large messages are automatically compressed to reduce network bandwidth. Compression is transparent - you configure it on the process or node, and the framework applies it automatically when appropriate.
When compression is enabled, the framework checks the encoded message size before transmission. If it exceeds the compression threshold (default 1024 bytes), the message is compressed using the configured algorithm. The protocol frame's message type (byte 7) is set to 0xc8 (200, protoMessageZ) and byte 8 contains the compression type ID (100=LZW, 101=ZLIB, 102=GZIP), so the receiving node knows to decompress before decoding.
Configure compression in process options:
Or adjust it dynamically:
Type determines the compression algorithm. GZIP (ID=102) provides good compression ratios with reasonable speed. ZLIB (ID=101) is similar but with slightly different format. LZW (ID=100) is faster but produces lower compression. Choose based on your CPU/bandwidth tradeoff.
Level trades compression time for compression ratio. CompressionLevelBestSize produces smaller messages but takes longer. CompressionLevelBestSpeed compresses quickly but produces larger output. CompressionLevelDefault balances both.
Threshold sets the minimum size for compression. Messages smaller than the threshold aren't compressed, even if compression is enabled. Compressing tiny messages adds overhead without reducing size meaningfully. The default 1024 bytes is reasonable - messages below 1KB go uncompressed, larger messages get compressed.
Compression happens per-message. Each message is independently compressed or not, based on its size. This keeps compression stateless and allows the receiver to decode messages in any order.
Caching and Optimization
During handshake, nodes exchange caching dictionaries for frequently used values. This caching reduces message sizes significantly.
Atom caching - Node names, process names, event names - these atoms appear repeatedly in messages. Every gen.PID contains the node name. Every message frame contains sender and recipient identifiers. Instead of encoding "mynode@localhost" repeatedly (2-byte length + 17 bytes = 19 bytes), the handshake assigns it a numeric ID. Cached atoms encode as 2 bytes (uint16 ID, where ID > 255). All subsequent uses of that atom encode as the 2-byte ID.
Type caching - Registered types get numeric IDs. A User struct registered on both sides gets an agreed-upon ID. Messages containing User values encode the ID instead of the full type name and structure. A typical struct name like "#mypackage/User" might be 20-30 bytes - cached, it's 3 bytes (0x83 + 2-byte cache ID where ID > 4095).
Error caching - Registered errors get IDs. Framework errors are pre-registered with well-known IDs. Custom errors get IDs during handshake. Error responses that might encode as 50+ bytes (error string message) encode as 3 bytes with caching (type tag + 2-byte ID where ID > 32767).
The caches are bidirectional - both nodes maintain the same mappings. During encoding, the sender looks up the cache and uses IDs. During decoding, the receiver looks up IDs and reconstructs values. The cache persists for the connection lifetime. If the connection drops and reconnects, a new handshake creates a new cache.
This caching is automatic. You don't manage the cache or invalidate entries. The framework handles it. You just benefit from smaller messages.
Important Delivery
Network transparency breaks down when dealing with failures. Sending to a local process that doesn't exist returns an error immediately - the framework checks the process table and sees the PID isn't registered. Sending to a remote process that doesn't exist returns... nothing. The message is encoded, sent to the remote node, and the remote node silently drops it because there's no recipient. Your code doesn't know the process was missing.
This asymmetry makes debugging difficult. Is the remote process slow to respond, or does it not exist? Did the message get lost in the network, or was it never received? The fire-and-forget nature of normal Send provides no feedback.
The Important delivery flag fixes this:
With Important delivery:
The message is sent to the remote node with an Important flag in the frame (bit 7 of priority byte set)
The remote node attempts delivery to the recipient's mailbox
If delivery succeeds, the remote node sends an acknowledgment back
If the acknowledgment arrives, SendImportant returns nil. If an error response arrives, it returns the error. If the timeout expires, it returns gen.ErrTimeout.
This gives you the same semantics as local delivery: immediate error feedback when something goes wrong. The network becomes transparent for failures too, not just successes.
The cost is latency. Normal Send returns immediately - it queues the message and continues. SendImportant blocks until the remote node responds, adding a network round-trip. For messages that must be delivered, this cost is worth it. For best-effort messages where occasional loss is acceptable, stick with normal Send.
For detailed exploration of Important Delivery patterns, reliability guarantees, and protocols like RR-2PC and FR-2PC, see .
Protocol Frame Structure
EDF-encoded messages are wrapped in ENP (Ergo Network Protocol) frames for transmission over TCP.
Each frame has an 8-byte header:
Byte 0: Magic byte (78 for ENP)
Byte 1: Protocol version (1 for current version)
Bytes 2-5: Frame length (uint32, total size in bytes)
For PID messages, the frame contains:
Sender PID (8 bytes - just the ID, node is known from connection)
Priority byte (bits 0-6 = priority 0-2, bit 7 = Important delivery flag)
Optional reference (8 bytes - first uint64 of Ref.ID, only if Important)
The order byte (byte 6) preserves message ordering per sender. It's calculated as senderPID.ID % 255, ensuring messages from the same sender have the same order value. This guarantees sequential processing on the receiving side even if messages arrive on different TCP connections in the pool. Messages from different senders have different order values, enabling parallel processing.
When the receiving node reads a frame from TCP, it extracts the order byte and routes the frame to the appropriate receive queue. The connection creates 4 receive queues per TCP connection in the pool. So a 3-connection pool has 12 receive queues total. Frames are distributed to queues based on order_byte % queue_count. Each queue is processed by a dedicated goroutine that decodes frames and delivers messages to recipients. This parallel processing improves throughput while preserving per-sender ordering.
Limits of Transparency
Network transparency is powerful but not magical. The network has physical properties that can't be abstracted away.
Latency - Remote operations are slower. A local Send takes microseconds. A remote Send takes milliseconds. That's three orders of magnitude. For a single message, it's negligible. For thousands of messages, the difference is dramatic. Design systems to minimize remote calls, batch operations, and use asynchronous patterns.
Bandwidth - Network links have finite capacity. Sending millions of small messages can saturate a network connection. Encoding and decoding adds CPU overhead. Compression helps but costs CPU time. Be mindful of message volume and size. Local operations have effectively infinite bandwidth - remote operations don't.
Failures - Networks fail in ways local memory doesn't. Packets get lost. Connections drop. Nodes become unreachable. DNS fails. Firewalls block traffic. Local operations either succeed instantly or fail with a clear error. Remote operations can timeout, leaving you uncertain whether they succeeded. Design for these failure modes with timeouts, retries, and idempotent operations.
Partial failures - In a distributed system, some nodes can fail while others continue working. A local system either works entirely or crashes entirely. A distributed system can be partially operational - some nodes reachable, others not. This partial failure is the hardest aspect of distributed systems. The framework can't hide it entirely.
Ordering - Message ordering is preserved per-sender within a connection. Messages from process A to process B arrive in the order sent. But messages from different senders can interleave arbitrarily. And if a connection drops and reconnects, messages sent during disconnection are lost or delayed. Don't assume global ordering across the cluster.
Network transparency makes distributed programming feel local. But distributed programming has fundamental differences from local programming. The transparency is a tool that simplifies common cases - it doesn't eliminate the need to think about distributed system challenges.
Practical Implications
Understanding network transparency helps you design better distributed systems.
Use local clustering - Group processes that communicate frequently on the same node. If processes exchange hundreds of messages per second, put them locally. Their communication is microseconds instead of milliseconds, and you avoid network overhead.
Prefer async over sync - Use Send (asynchronous) instead of Call (synchronous) for remote communication when possible. Async messaging doesn't block the sender, improving throughput. Sync calls over the network tie up your process waiting for responses.
Design for message batching - Send one message with 100 items instead of 100 messages with 1 item each. Network overhead is per-message. Batching amortizes that overhead.
Handle failures explicitly - Use timeouts on sync calls. Use Important delivery for critical messages. Monitor connection health. Don't assume remote operations succeed - check errors and have fallback logic.
Keep messages small - Encoding and network transmission costs scale with message size. Large messages cause memory allocation, encoding overhead, network congestion. If you're sending megabytes of data, consider whether it belongs in messages or should use a different mechanism (file transfer, streaming, database).
Leverage compression - Enable compression for processes that send large messages. The CPU cost of compression is usually worth the network bandwidth savings. But don't compress tiny messages - the overhead exceeds the benefit.
Register types early - Do all type registration in init() functions before the node starts. Avoid dynamic type registration that requires connection cycling. Static registration is simpler and more reliable.
For details on how the network stack implements transparency, see . For understanding how nodes discover each other, see .
Web
HTTP and actors speak different languages. HTTP is fundamentally synchronous - a request arrives, blocks waiting for processing, gets a response, connection closes. The actor model is fundamentally asynchronous - messages arrive in mailboxes, get processed sequentially one at a time, responses are separate messages sent whenever ready.
Integrating these two worlds is possible, but the integration strategy matters. Choose wrong and you lose the benefits of both models. Choose right and you get HTTP's ubiquity with actors' concurrency and distribution capabilities.
This chapter shows two integration approaches, ordered from simple to complex. The simple approach works for most cases and keeps the entire HTTP ecosystem available. The meta-process approach trades tooling for deeper actor integration, enabling patterns impossible with standard HTTP stacks.
Before reaching for meta-processes, understand what you're giving up and what you're gaining. The simple approach might be all you need.
Simple Approach: Call from HTTP Handlers
The straightforward way: run a standard HTTP server, call actors from handlers using node.Call(), let network transparency distribute requests across the cluster.
This keeps HTTP and actors separate. HTTP handles protocol concerns - routing, middleware, headers, status codes. Actors handle business logic - state management, processing, coordination. Clean separation.
Basic Pattern
The HTTP server runs outside the actor system in a separate goroutine. Handlers call actors synchronously using node.Call(). Actors can be anywhere - same node, remote node, doesn't matter. Network transparency routes the call.
Why This Works
Call() blocks the HTTP handler goroutine, not an actor. Go's HTTP server creates one goroutine per connection. Blocking in a handler is normal - that goroutine waits, others continue serving requests.
The actor receiving the call processes it asynchronously in its own message loop. Multiple handlers can call the same actor concurrently. The actor processes one request at a time from its mailbox. This isolates the actor from HTTP concurrency.
Network transparency means the actor can be anywhere:
Change Node to move the actor. Code stays the same. Distribute load across nodes by routing different requests to different actors.
Cluster Load Distribution
Network transparency means actors can run anywhere in the cluster. The HTTP gateway becomes a router that distributes requests across backend nodes.
Simple consistent hashing distributes load evenly while maintaining request affinity:
Requests for user "alice" always go to the same backend node. That node caches alice's data in memory. Subsequent requests hit warm cache. Change clusterSize to add nodes - hashing redistributes load automatically while preserving most affinity.
For dynamic topology where nodes join and leave unpredictably, use application discovery. Central registrars (etcd, Saturn) track which nodes are running which applications in real-time:
Application discovery returns all nodes currently running the service. Each node reports its weight. Nodes with higher weights (more resources, better hardware, closer proximity) receive proportionally more traffic. Nodes that crash disappear from discovery immediately. New nodes appear as soon as they register. The HTTP gateway adapts to cluster topology changes without restarts.
For details on application discovery and central registrars, see .
Standard HTTP Tooling
This approach keeps the entire HTTP ecosystem available:
OpenAPI generation: Tools like swag/swaggo analyze HTTP handlers and generate OpenAPI specs. They see standard net/http handlers, so generation works normally.
Middleware: Standard HTTP middleware wraps handlers - authentication, logging, CORS, rate limiting. Actors are completely invisible to middleware.
Routing: Use any router - http.ServeMux (Go 1.22+), gorilla/mux, chi, echo. They all work with standard handlers.
Testing: Test HTTP handlers with httptest. Test actors separately with unit tests. Clean separation of concerns.
The actor system is an implementation detail. HTTP sees standard handlers. Clients see standard HTTP. Deployment tools see standard HTTP servers. Only the handler implementation uses actors internally.
When This Approach Works
Use this when:
You need standard HTTP tooling (OpenAPI, gRPC-gateway, middleware ecosystems)
Load balancing happens at the nginx/kubernetes level, not actor level
Backpressure from actors doesn't matter (actors process at their speed, HTTP clients wait)
This covers most HTTP/actor integration cases. The HTTP layer is stateless. Actors hold state and logic. HTTP routes requests to actors. Clean architecture.
For details on synchronous request handling in actors, see .
Meta-Process Approach: Deep Integration
Meta-processes convert HTTP into asynchronous actor messages. Instead of calling actors synchronously from handlers, requests become messages flowing into the actor system.
This approach enables:
Backpressure: actors control request rate through mailbox capacity
Addressable connections: each WebSocket/SSE connection becomes an independent actor with gen.Alias identifier - any actor anywhere in the cluster can send messages directly to specific client connections through network transparency. This is the killer feature for real-time systems (chat, multiplayer games, live dashboards, collaborative editing) where backend logic must push updates to specific clients across cluster nodes. Impossible with the simple approach.
Per-request routing: route to different actor pools based on request content
Standard HTTP routing and middleware still work - meta.WebHandler implements http.Handler and integrates with http.ServeMux or any router. You can wrap handlers in middleware for authentication, logging, CORS. What you lose is introspection-based tooling (OpenAPI generation, gRPC-gateway) because request processing happens inside actors, invisible to HTTP layer analysis tools.
Architecture
Two meta-processes work together:
meta.WebServer: External Reader runs http.Server.Serve(listener). Blocks there forever until listener fails. The http.Server creates its own goroutines for each HTTP connection - those goroutines call handlers, not the External Reader. Actor Handler never runs (no messages received).
meta.WebHandler: Implements http.Handler interface. External Reader blocks in Start() waiting for termination. When http.Server (running in WebServer) accepts a connection, it spawns a goroutine that calls handler.ServeHTTP(). Inside ServeHTTP():
Create context with timeout
Send meta.MessageWebRequest to worker actor
Block on <-ctx.Done() waiting for worker to call Done()
Actor Handler never runs - HandleMessage() and HandleCall() are empty stubs.
Compare this with typical meta-processes like TCP or UDP:
TCP/UDP meta-processes:
External Reader actively loops reading from socket, sends messages to actors
Actor Handler receives messages from actors, writes to socket
Both goroutines do real work - continuous bidirectional I/O
Web meta-processes:
WebServer's External Reader passively blocks in http.Server.Serve() doing nothing - http.Server does all the work internally
WebHandler's External Reader passively blocks on channel doing nothing - just waiting for termination
Neither has an active Actor Handler - no messages arrive in their mailboxes
Web meta-processes are unusual. They use the meta-process mechanism not for bidirectional I/O but for lifecycle management and integration with the actor system. The External Reader goroutines exist only to keep the meta-process alive while http.Server runs. The actual HTTP handling happens in goroutines spawned by http.Server, which are completely outside the meta-process architecture.
This works because http.Server already solves concurrency - it spawns goroutines per connection. The meta-process just wraps it for integration with actor lifecycle and messaging.
Basic Setup
When a request arrives:
WebServer's External Reader is blocked in http.Server.Serve()
http.Server accepts connection, spawns its own goroutine for this connection
That goroutine calls handler.ServeHTTP() (handler is WebHandler)
Critical: ServeHTTP() executes in http.Server goroutines, not in meta-process goroutines. WebHandler's External Reader remains blocked in Start() waiting for termination. WebHandler's Actor Handler never spawns because no messages arrive in its mailbox.
Worker Implementation
Workers receive meta.MessageWebRequest as regular messages in their mailbox:
The pattern: receive MessageWebRequest, process it, write to ResponseWriter, call Done(). The Done() call unblocks the ServeHTTP() goroutine waiting in WebHandler.
Using act.WebWorker: Framework provides act.WebWorker that automatically extracts MessageWebRequest, routes to HTTP-method-specific callbacks (HandleGet, HandlePost, etc.), and calls Done() after processing. Use this instead of manual message handling - it eliminates boilerplate and ensures Done() is always called. See for details.
Concurrent Processing with act.Pool
Single worker processes requests sequentially. Use act.Pool to process multiple requests concurrently:
Pool distributes incoming requests across 20 workers. Each worker processes one request at a time. System handles 20 concurrent requests.
Capacity control: PoolSize × WorkerMailboxSize defines maximum requests the backend accepts. With 20 workers and 10 mailbox size, system capacity is 220 requests (20 processing + 200 queued). Beyond this, requests are shed - pool cannot forward to workers with full mailboxes.
This limits load on backend systems. Database handles 20 concurrent queries maximum. External API gets 20 parallel requests maximum. Worker mailboxes buffer bursts without overwhelming downstream services.
Worker failures are handled automatically. Pool spawns replacement workers when crashes are detected. Other workers continue processing during restart.
Stateful Connections: WebSocket
HTTP request-response is stateless. WebSocket is the opposite - long-lived bidirectional connections remaining open for hours or days.
The framework provides WebSocket meta-process implementation in the extra library (ergo.services/meta/websocket). Each connection becomes an independent meta-process with gen.Alias identifier, addressable from anywhere in the cluster.
Each connection is an independent meta-process:
External Reader continuously reads messages from client
Actor Handler receives messages from backend actors, writes to client
Both operate simultaneously - full-duplex bidirectional communication
Killer feature: cluster-wide addressability. Any actor on any node can send messages directly to specific client connections:
Network transparency makes every WebSocket connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without routing through intermediaries.
This is impossible with the simple approach. node.Call() is request-response. WebSocket requires continuous streaming both directions. Meta-processes provide the architecture: one goroutine reading from client, another writing to client, both operating on the same connection.
For WebSocket implementation and usage examples, see .
Choosing an Approach
Start with the simple approach. Use node.Call() from standard HTTP handlers. This works for most cases and keeps the entire HTTP ecosystem available - OpenAPI generation, middleware, familiar patterns.
Move to meta-processes when you specifically need:
WebSocket or long-lived connections: Each connection must be an addressable actor that backend logic can push updates to. The simple approach cannot do this - it's request-response only. Meta-processes make each connection an independent actor with cluster-wide addressability.
Capacity control through mailbox limits: Backend accepts exactly PoolSize × WorkerMailboxSize requests, no more. Beyond this, requests are rejected. This prevents memory exhaustion during overload. The simple approach queues unbounded requests in HTTP server.
The simple approach handles thousands of requests per second with proper actor distribution. Use meta-processes only when the simple approach cannot provide required capabilities.
Actor
The actor model requires sequential message processing - each actor handles one message at a time in a dedicated goroutine. This eliminates data races within the actor but shifts complexity to the message handling loop: reading from multiple mailbox queues in priority order, dispatching to different handlers based on message type, managing state transitions, converting exit signals to regular messages when trapping is enabled.
You could implement this yourself with gen.ProcessBehavior, but you'd rewrite the same logic for every actor. act.Actor solves this. It implements the low-level gen.ProcessBehavior interface and provides a higher-level act.ActorBehavior interface with straightforward callbacks: Init for initialization, HandleMessage for asynchronous messages, HandleCall for synchronous requests, Terminate for cleanup. You write business logic, act.Actor handles the mailbox mechanics.
Creating an Actor
Embed act.Actor in your struct and implement the act.ActorBehavior callbacks you need:
Spawn it like any process:
The factory function is called each time you spawn. Each process gets a fresh instance with its own state. This isolation is fundamental to the actor model - actors share nothing except messages.
Callback Interface
act.ActorBehavior defines the callbacks act.Actor will invoke:
All callbacks are optional. act.Actor provides default implementations that log warnings for unhandled messages. Implement only what you need.
Since act.Actor embeds gen.Process, you have direct access to all process methods: Send, Call, Spawn, Link, RegisterName, etc. No need to store references - they're built in.
Initialization
Init runs once when the process spawns. The args parameter contains whatever you passed to Spawn:
If Init returns an error, the process is cleaned up and removed. Spawn returns immediately with that error. Use this for validation: check arguments, verify resources, refuse to start if preconditions aren't met.
During Init, the process is in ProcessStateInit. All operations are available: Spawn, Send, SetEnv, RegisterName, CreateAlias, RegisterEvent, Link*, Monitor*, Call*, and property setters.
Any resources created during Init (names, aliases, events, links, monitors) are properly cleaned up if initialization fails.
Message Handling
Messages arrive in the mailbox and sit in one of four queues: Urgent, System, Main, or Log. act.Actor processes them in priority order:
Urgent - Maximum priority messages (MessagePriorityMax)
System - High priority messages (MessagePriorityHigh)
Main - Normal priority messages (MessagePriorityNormal
When a message arrives in Urgent, System, or Main, act.Actor calls HandleMessage:
The return value determines whether the actor continues or terminates:
Return nil to keep running
Return gen.TerminateReasonNormal for clean shutdown
Return any other error to terminate (logged as error)
The from parameter tells you who sent the message. Use it for replies. If you don't need replies, ignore it.
Synchronous Requests
When someone calls process.Call(pid, request), act.Actor invokes your HandleCall:
The error return value controls process termination, not the caller's response:
(result, nil) - Send result to caller, continue running
(result, gen.TerminateReasonNormal) - Send result, then terminate cleanly
To send an application error to the caller, return it as the result value:
This separation between transport errors (err return from Call) and application errors (result as error) is fundamental to actor communication. See for deeper discussion of error channels and when to use SendResponseError.
Asynchronous Handling of Synchronous Requests
Sometimes you can't respond immediately. Maybe you need to query another service, or delegate work to a pool of workers. Return (nil, nil) from HandleCall to defer the response:
The gen.Ref identifies the request. The caller blocks waiting for a response with that ref. You can send the response from any process - the one that received the request, a worker, or even a remote process. Just call SendResponse(callerPID, ref, result).
The ref has a deadline (from the caller's timeout). Check if it's still alive before doing expensive work:
Termination
To stop an actor, return a non-nil error from HandleMessage or HandleCall:
Termination reasons:
gen.TerminateReasonNormal - Clean shutdown, not logged as error
gen.TerminateReasonKill - Process was killed via node.Kill(pid)
gen.TerminateReasonPanic
After termination is triggered, act.Actor calls your Terminate callback:
At this point, the process is in ProcessStateTerminated and has been removed from the node. Most gen.Process methods return gen.ErrNotAllowed. You can still send messages (fire-and-forget), but you can't make calls, create links, or spawn children.
If a panic occurs during Init, HandleMessage, or HandleCall, the framework catches it, logs the stack trace, and terminates the process with gen.TerminateReasonPanic. The Terminate callback still runs, giving you a chance to clean up.
Trapping Exit Signals
By default, when an actor receives an exit signal (via SendExit or from a linked process), it terminates immediately. Enable TrapExit to convert exit signals into regular messages:
Exit signal messages:
gen.MessageExitPID - From a process (SendExit or link)
gen.MessageExitProcessID - From a named process link
gen.MessageExitAlias
Exception: Exit signals from the parent process cannot be trapped. If your parent terminates (and you created a link with LinkParent option or via Link/LinkPID), you terminate regardless of TrapExit. This ensures supervision trees can forcefully terminate subtrees.
Use TrapExit when you want to handle failures gracefully - log them, restart workers, switch to fallback services. Don't use it if you want standard supervision behavior (child fails → parent restarts it).
Split Handle
By default, HandleMessage and HandleCall are invoked regardless of how the process was addressed - by PID, by registered name, or by alias. Enable SetSplitHandle(true) to route based on address type:
The same split applies to HandleCall* variants. Use this when you want different behavior for internal communication (PID) versus public API (registered name) versus temporary sessions (alias).
Most actors don't need this. Leave split handle disabled and use HandleMessage/HandleCall for everything.
Specialized Callbacks
Logging
If your actor is registered as a logger (via node.AddLogger(pid, level)), it receives log messages in the Log queue:
Log messages have the lowest priority. They're processed after Urgent, System, and Main are empty. This prevents logging from starving regular message processing.
Events
If your actor subscribed to an event (via LinkEvent or MonitorEvent), it receives event messages:
Events arrive in the System queue (high priority). Use them for cross-cutting concerns where multiple actors need to react to the same occurrence.
Inspection
Actors can expose runtime state for monitoring and debugging via the HandleInspect callback:
Inspect the actor from within a process context or directly from the node:
Both methods only work for local processes (same node). Inspection requests go to the Urgent queue and bypass normal message processing. Keep HandleInspect implementation fast - don't do expensive computations or I/O. Return only string values (serialization limitation). The optional item parameters allow filtering which fields to return, though most implementations ignore them and return all fields.
Actor Pools
For workload distribution, use act.Pool instead of implementing manual worker management. See for details.
Patterns and Pitfalls
Don't spawn goroutines in callbacks. The actor model is sequential - one message at a time. Spawning goroutines breaks this, introducing data races on actor state. If you need concurrency, spawn child actors and send them messages.
Don't block on channels or mutexes. Callbacks run in the actor's goroutine. Blocking it starves message processing. Use async message passing (Send) instead of sync primitives.
Don't store gen.Process references. The embedded act.Actor provides all process methods. Storing additional references wastes memory and can cause confusion about which instance is authoritative.
Return errors for termination, not for caller responses. HandleCall's error return terminates the process. To send errors to callers, return them as the result value.
Use ref.IsAlive() before expensive async work. When handling calls asynchronously, check if the caller is still waiting before spending resources on the response.
Enable TrapExit only when needed. Default behavior (terminate on exit signal) works for most actors. Trap only when you have specific failure handling logic.
Supervisor
Actors fail. They panic, encounter errors, or lose external resources. In traditional systems, you add defensive code: catch exceptions, retry operations, validate state. This spreads failure handling throughout your codebase, mixing recovery logic with business logic.
The actor model takes a different approach: let it crash. When an actor fails, terminate it cleanly and restart it in a known-good state. This requires something watching the actor and managing its lifecycle - a supervisor.
act.Supervisor is an actor that manages child processes. It starts them during initialization, monitors them for failures, and applies restart strategies when they terminate. Supervisors can manage other supervisors, creating hierarchical fault tolerance trees where failures are isolated and recovered automatically.
Like act.Actor, the act.Supervisor struct implements the low-level gen.ProcessBehavior
// Local - errors are immediate
err := process.Send(localPID, message)
if err != nil {
// ErrProcessUnknown or ErrProcessMailboxFull
// You know immediately something is wrong
}
// Remote - errors are hidden
err = process.Send(remotePID, message)
if err != nil {
// Only reports local problems (serialization, no connection)
// Cannot report remote problems (process missing, mailbox full)
}
// Message sent to network, no idea if it arrived
err := process.SendImportant(remotePID, message)
if err != nil {
// Immediate errors:
// - ErrProcessUnknown: process doesn't exist on remote node
// - ErrProcessMailboxFull: process exists but mailbox is full
// - ErrTimeout: remote node received message but no confirmation
// - ErrNoConnection: cannot reach remote node
}
// If no error, message is definitely in the recipient's mailbox
func (a *Actor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case CriticalUpdate:
// This message must be delivered or we need to know it failed
if err := a.SendImportant(targetPID, msg); err != nil {
a.Log().Error("failed to send critical update: %s", err)
return err
}
a.Log().Info("critical update confirmed delivered")
}
return nil
}
func (a *Actor) Init(args ...any) error {
// Enable important delivery for all messages from this process
a.SetImportantDelivery(true)
return nil
}
func (a *Actor) HandleMessage(from gen.PID, message any) error {
// Send uses important delivery automatically
err := a.Send(targetPID, message)
if err != nil {
// Immediate confirmation or error
}
return nil
}
type Coordinator struct {
act.Actor
participants []gen.PID
}
func (c *Coordinator) Prepare() error {
c.SetImportantDelivery(true) // FR-2PC for all messages
// Phase 1: Prepare
for _, p := range c.participants {
result, err := c.CallImportant(p, PrepareRequest{})
if err != nil {
// Participant unreachable - abort
return c.abort()
}
if result != "yes" {
// Participant voted no - abort
return c.abort()
}
}
// Phase 2: Pre-commit (guaranteed delivery)
for _, p := range c.participants {
result, err := c.CallImportant(p, PreCommitRequest{})
if err != nil {
// This is a problem - participant didn't receive pre-commit
// But FR-2PC guarantees we know immediately
return c.handlePreCommitFailure(p, err)
}
}
// Phase 3: Commit (guaranteed delivery)
for _, p := range c.participants {
_, err := c.CallImportant(p, CommitRequest{})
if err != nil {
// Participant didn't receive commit
// Need recovery protocol
return c.handleCommitFailure(p, err)
}
}
return nil
}
// Local send - immediate error, important flag ignored
err := process.SendImportant(localPID, message)
if err != nil {
// ErrProcessUnknown or ErrProcessMailboxFull
// No ACK needed, mailbox operation is synchronous
}
// For etcd registrar
import "ergo.services/registrar/etcd"
// For Saturn registrar
import "ergo.services/registrar/saturn"
registrar, _ := node.Network().Registrar()
event, err := registrar.Event()
if err != nil {
// registrar doesn't support events (embedded registrar only)
}
// Link to the event to receive notifications
process.LinkEvent(event)
// In your HandleEvent callback (etcd example):
func (w *Worker) HandleEvent(message gen.MessageEvent) error {
switch ev := message.Message.(type) {
case etcd.EventConfigUpdate:
// Configuration item changed
w.Log().Info("config updated: %s = %v", ev.Item, ev.Value)
w.loadConfig()
case etcd.EventNodeJoined:
// New node joined the cluster
w.Log().Info("node joined: %s", ev.Name)
w.checkNewNode(ev.Name)
case etcd.EventNodeLeft:
// Node left the cluster
w.Log().Info("node left: %s", ev.Name)
w.handleNodeDown(ev.Name)
case etcd.EventApplicationLoaded:
// Application loaded on a node
w.Log().Info("application %s loaded on %s (weight: %d)",
ev.Name, ev.Node, ev.Weight)
case etcd.EventApplicationStarted:
// Application started running
w.Log().Info("application %s started on %s (mode: %s, weight: %d)",
ev.Name, ev.Node, ev.Mode, ev.Weight)
w.refreshServices()
case etcd.EventApplicationStopping:
// Application is stopping
w.Log().Info("application %s stopping on %s", ev.Name, ev.Node)
case etcd.EventApplicationStopped:
// Application stopped completely
w.Log().Info("application %s stopped on %s", ev.Name, ev.Node)
w.refreshServices()
case etcd.EventApplicationUnloaded:
// Application unloaded from node
w.Log().Info("application %s unloaded from %s", ev.Name, ev.Node)
}
return nil
}
// For Saturn registrar, use saturn.EventConfigUpdate, saturn.EventNodeJoined, etc.
// The event types are identical in structure but defined in separate packages.
network := node.Network()
registrar, err := network.Registrar()
if err != nil {
// node has no registrar configured
}
info := registrar.Info()
// info.Server - registrar endpoint
// info.EmbeddedServer - true if running as server
// info.SupportConfig - whether config storage is available
// info.SupportRegisterApplication - whether app routing is available
Processes actively handling messages. Low relative to total suggests most processes are idle (good) or blocked (bad - investigate what they're waiting for).
ergo_processes_zombie
Gauge
Processes terminated but not yet fully cleaned up. These should be transient. Persistent zombies indicate bugs in termination handling.
ergo_memory_used_bytes
Gauge
Total memory obtained from OS (uses runtime.MemStats.Sys).
ergo_memory_alloc_bytes
Gauge
Bytes of allocated heap objects (uses runtime.MemStats.Alloc).
ergo_cpu_user_seconds
Gauge
CPU time spent executing user code. Increases as the node does work. Rate of change indicates CPU utilization.
ergo_cpu_system_seconds
Gauge
CPU time spent in kernel (system calls). High system time relative to user time suggests I/O bottlenecks or excessive syscalls.
ergo_applications_total
Gauge
Number of applications loaded. Should match your expected count. Unexpected changes indicate applications starting or stopping.
ergo_applications_running
Gauge
Applications currently active. Compare to total to identify stopped or failed applications.
ergo_registered_names_total
Gauge
Processes registered with atom names. High counts suggest heavy use of named processes for routing.
ergo_registered_aliases_total
Gauge
Total number of registered aliases. Includes aliases created by processes via CreateAlias() and aliases identifying meta-processes.
ergo_registered_events_total
Gauge
Event subscriptions active in the node. High counts indicate extensive pub/sub usage.
node
Uptime of each connected remote node. Resets when the remote node restarts.
ergo_remote_messages_in_total
Gauge
node
Messages received from each remote node. Rate indicates traffic volume.
ergo_remote_messages_out_total
Gauge
node
Messages sent to each remote node. Asymmetric in/out rates may reveal routing issues.
ergo_remote_bytes_in_total
Gauge
node
Bytes received from each remote node. Disproportionate bytes-to-messages ratio suggests large messages or inefficient serialization.
ergo_remote_bytes_out_total
Gauge
node
Bytes sent to each remote node. Monitors network bandwidth usage per peer.
, default)
Log - Logging messages (lowest priority)
(nil, someError) - Terminate immediately with someError (caller times out)
- Panic occurred in callback (framework catches it)
gen.TerminateReasonShutdown - Node is stopping (sent by parent or node)
Any other error - Application-specific failure (logged as error)
- From an alias link
gen.MessageExitEvent - From an event link
gen.MessageExitNode - From a node link (network disconnect)
Only Init is mandatory. All other methods are optional - act.Supervisor provides default implementations that log warnings. The Init method returns SupervisorSpec which defines the supervisor's behavior, children, and restart strategy.
Creating a Supervisor
Embed act.Supervisor and implement Init to define the supervision spec:
The supervisor spawns all children during Init (except Simple One For One, which starts with zero children). Each child is linked bidirectionally to the supervisor (LinkChild and LinkParent set automatically). If a child terminates, the supervisor receives an exit signal and applies the restart strategy.
Children are started sequentially in declaration order. If any child's spawn fails (the factory's ProcessInit returns an error), the supervisor terminates immediately with that error. This ensures the supervision tree is fully initialized or not at all - no partial states.
Supervision Types
The Type field in SupervisorSpec determines what happens when a child fails.
One For One
Each child is independent. When one child terminates, only that child is restarted. Other children continue running unaffected.
If worker2 crashes, the supervisor restarts only worker2. worker1 and worker3 keep running. Use this when children are independent - databases, caches, API handlers that don't depend on each other.
Each child runs with a registered name (the Name from the spec). This means only one instance per child spec. To run multiple instances of the same worker, use Simple One For One instead.
All For One
Children are tightly coupled. When any child terminates, all children are stopped and restarted together.
If cache crashes, the supervisor stops processor and api (in reverse order if KeepOrder is true, simultaneously otherwise), then restarts all three in declaration order. Use this when children share state or dependencies that can't survive partial failures.
Rest For One
When a child terminates, only children started after it are affected. Children started before it continue running.
If cache crashes, the supervisor stops api, then restarts cache and api in order. database is unaffected. Use this for dependency chains where later children depend on earlier ones, but earlier ones don't depend on later ones.
With KeepOrder: true, children are stopped sequentially (last to first). With KeepOrder: false, they stop simultaneously. Either way, restart happens in declaration order after all affected children have stopped.
Simple One For One
All children run the same code, spawned dynamically instead of at supervisor startup.
The supervisor starts with zero children. Call supervisor.StartChild("worker", "custom-args") to spawn instances:
Each instance is independent. They're not registered by name (no SpawnRegister), so you track them by PID. When an instance terminates, only that instance is restarted (if the restart strategy allows). Other instances continue running.
Use Simple One For One for worker pools where you dynamically scale the number of identical workers based on load. The child spec is a template - each StartChild creates a new instance from that template.
Restart Strategies
The Restart.Strategy field determines when children are restarted.
Transient (Default)
Restart only on abnormal termination. If a child returns gen.TerminateReasonNormal or gen.TerminateReasonShutdown, it's not restarted:
Use this for workers that can gracefully stop - maybe they finished their work, or received a shutdown command. Crashes (panics, errors, kills) trigger restarts. Normal termination doesn't.
Temporary
Never restart, regardless of termination reason:
The child runs once. If it terminates (normal or crash), it stays terminated. Use this for initialization tasks or processes that shouldn't be restarted automatically.
Permanent
Always restart, regardless of termination reason:
Even gen.TerminateReasonNormal triggers restart. Use this for critical processes that must always be running - maybe a health monitor or connection manager that should never stop.
With Permanent strategy, DisableAutoShutdown is ignored, and the Significant flag has no effect - every child termination triggers restart.
Restart Intensity
Restarts aren't free. If a child crashes repeatedly, restarting it repeatedly just wastes resources. The Intensity and Period options limit restart frequency:
The supervisor tracks restart timestamps (in milliseconds). When a child terminates and needs restart, the supervisor checks: have there been more than Intensity restarts in the last Period seconds? If yes, the restart intensity is exceeded. The supervisor stops all children and terminates itself with act.ErrSupervisorRestartsExceeded.
Old restarts outside the period window are discarded from tracking. This is a sliding window: if your child crashes 5 times in 10 seconds, then runs stable for 11 seconds, then crashes again - the counter resets. It's 1 restart in the window, not 6 total.
Default values are Intensity: 5 and Period: 5 if you don't specify them.
Significant Children
In All For One and Rest For One supervisors, the Significant flag marks children whose termination can trigger supervisor shutdown:
With SupervisorStrategyTransient:
Significant child terminates normally → supervisor stops all children and terminates
Non-significant child → restart strategy applies regardless of termination reason
With SupervisorStrategyTemporary:
Significant child terminates (any reason) → supervisor stops all children and terminates
Non-significant child → no restart, child stays terminated
With SupervisorStrategyPermanent:
Significant flag is ignored
All terminations trigger restart
For One For One and Simple One For One, Significant is always ignored.
Use significant children when a specific child's clean termination means "mission accomplished, shut down the subtree." Example: a batch processor that finishes its work and terminates normally should stop the entire supervision tree, not get restarted.
Auto Shutdown
By default, if all children terminate normally (not crashes) and none are significant, the supervisor stops itself with gen.TerminateReasonNormal. This is auto shutdown.
Enable DisableAutoShutdown to keep the supervisor running even with zero children:
Auto shutdown is ignored for Simple One For One supervisors (they're designed for dynamic children) and ignored when using Permanent strategy.
Use auto shutdown when your supervisor's purpose is managing those specific children. When they're all gone, the supervisor has no purpose. Disable it when the supervisor manages dynamically added children or should stay alive to accept management commands.
Keep Order
For All For One and Rest For One, the KeepOrder flag controls how children are stopped:
With KeepOrder: true:
Children stop one at a time, last to first
Supervisor waits for each child to fully terminate before stopping the next
Slow but orderly - useful when children have shutdown dependencies
With KeepOrder: false (default):
All affected children receive SendExit simultaneously
They terminate in parallel
Fast but unordered - use when children can shut down independently
After stopping (either way), children restart sequentially in declaration order. KeepOrder only affects stopping, not starting.
For One For One and Simple One For One, KeepOrder is ignored (only one child is affected).
Dynamic Management
Supervisors provide methods for runtime adjustments:
Critical: These methods fail with act.ErrSupervisorStrategyActive if called while the supervisor is executing a restart strategy. The supervisor is in supStateStrategy mode - it's stopping children, waiting for exit signals, or starting replacements. You must wait for it to return to supStateNormal before making management calls.
When the supervisor is applying a strategy, it processes only the Urgent queue (where exit signals arrive) and ignores System and Main queues. This ensures exit signals are handled promptly without interference from management commands or regular messages.
For Simple One For One supervisors, StartChild with args stores those args for that specific child instance. When that instance restarts (due to crash, kill, etc.), it uses the stored args, not the template args from the spec. For other supervisor types (One For One, All For One, Rest For One), StartChild with args updates the spec's args for future restarts.
Child Callbacks
Enable EnableHandleChild: true to receive notifications when children start or stop:
These callbacks run after the restart strategy completes. For example:
Child crashes
Supervisor applies restart strategy (stops affected children if needed)
Supervisor starts replacement children
ThenHandleChildTerminate is called for the terminated child
ThenHandleChildStart is called for the replacement
The callbacks are invoked as regular messages sent by the supervisor to itself. They arrive in the Main queue, so they're processed after the restart logic (which happens in the exit signal handler).
If HandleChildStart or HandleChildTerminate returns an error, the supervisor terminates with that error. Use these callbacks for integration with external systems, not for restart decisions - restart logic is handled by the supervisor type and strategy.
Supervisor as a Regular Actor
Supervisors are actors. They have mailboxes, handle messages, and can communicate with other processes:
This lets you build management APIs: query supervisor state, scale children dynamically, reconfigure at runtime. The supervisor processes these messages between handling exit signals.
Observer Integration
Supervisors provide runtime inspection via the HandleInspect method, which is automatically integrated with the Observer monitoring tool. When you call gen.Process.Inspect() on a supervisor, it returns detailed metrics about its current state:
One For One / All For One / Rest For One:
type: Supervisor type ("One For One", "All For One", "Rest For One")
period: Time window in seconds for restart intensity
keep_order: Whether children stop sequentially (All/Rest For One only)
auto_shutdown: Whether supervisor stops when all children terminate
restarts_count: Number of restart timestamps currently tracked
children_total: Total child specs defined
children_running: Currently running children
children_disabled: Disabled children that won't restart
Simple One For One:
type: "Simple One For One"
strategy: Restart strategy
intensity: Maximum restart count within period
period: Time window in seconds
restarts_count: Number of restart timestamps tracked
specs_total: Total child spec templates
specs_disabled: Disabled specs
instances_total: Total running instances across all specs
child:<name>: Number of running instances for specific child spec
child:<name>:args: Number of instances with custom args for specific child spec
The Observer UI displays this information in real-time, letting you monitor supervision trees, track restart patterns, and identify failing components. You can also query this data programmatically:
Both methods only work for local supervisors (same node). This integration makes it easy to diagnose issues in production: check restart counts to identify unstable processes, verify child counts match expected scaling, monitor which instances have custom configurations.
Restart Intensity Behavior
Understanding restart intensity is critical for reliable systems. Here's exactly how it works:
The supervisor maintains a list of restart timestamps in milliseconds. When a child terminates and restart is needed:
Append current timestamp to the list
Remove timestamps older than Period seconds
If list length > Intensity, intensity is exceeded
If exceeded: stop all children, terminate supervisor with act.ErrSupervisorRestartsExceeded
If not exceeded: proceed with restart
Example with Intensity: 3, Period: 5:
But if the child runs stable between crashes:
The sliding window means intermittent failures don't accumulate. Only rapid repeated failures exceed intensity.
Shutdown Behavior
When a supervisor terminates (receives exit signal, calls terminate from HandleMessage, or crashes), it stops all children first:
Send gen.TerminateReasonShutdown via SendExit to all running children
Wait for all children to terminate
Call Terminate callback
Remove supervisor from node
With KeepOrder: true (All For One / Rest For One), children stop sequentially. With KeepOrder: false, they stop in parallel. Either way, the supervisor waits for all to finish before terminating itself.
If a non-child process sends the supervisor an exit signal (via Link or SendExit), the supervisor initiates shutdown. This is how parent supervisors stop child supervisors - send an exit signal, and the entire subtree shuts down cleanly.
Dynamic Children (Simple One For One)
Simple One For One supervisors start with empty children and spawn them on demand:
Start instances with StartChild:
Each call spawns a new worker. The args passed to StartChild are stored for that specific instance. When the restart strategy triggers (child crashes, exceeds intensity, etc.), the child restarts with the same args it was originally started with, not the template args from the spec. This ensures each worker instance maintains its configuration across restarts.
Workers are not registered by name (no SpawnRegister). You track them by PID from the return value or via supervisor.Children().
Disabling a child spec stops all running instances with that spec name:
Simple One For One ignores DisableAutoShutdown - the supervisor never auto-shuts down, even with zero children. It's designed for dynamic workloads where zero children is a valid state.
Patterns and Pitfalls
Set restart intensity carefully. Too low and transient failures kill your supervisor. Too high and crash loops consume resources. Start with defaults (Intensity: 5, Period: 5) and tune based on observed behavior.
Use Significant sparingly. Marking a child significant couples its lifecycle to the entire supervision tree. This is powerful but reduces isolation. Prefer non-significant children and handle critical failures at a higher supervision level.
Don't call management methods during restart. StartChild, AddChild, EnableChild, DisableChild fail with ErrSupervisorStrategyActive if the supervisor is mid-restart. Wait for the restart to complete (check via Inspect or wait for HandleChildStart callback).
Disable auto shutdown for dynamic supervisors. If your supervisor uses AddChild to add children at runtime, enable DisableAutoShutdown. Otherwise, it terminates when it starts with zero children or when all dynamically added children eventually stop.
Use HandleChildStart for integration, not validation. By the time HandleChildStart is called, the child is already spawned and linked. Returning an error terminates the supervisor, but doesn't prevent the child from running. Use child's Init for validation instead.
KeepOrder is only for stopping. Children always start sequentially in declaration order. KeepOrder controls only the stopping phase of All For One and Rest For One restarts.
Simple One For One args are persistent per instance. Args passed to StartChild are stored and used for that specific instance across all restarts. If you start a worker with StartChild("worker", "config-A") and it crashes, the restarted instance receives "config-A" again, not the template args from the child spec. This persistence ensures each worker maintains its identity and configuration through failures. If you need different args for a restart, you must manually stop the old instance and start a new one with different args.
type Worker struct {
act.Actor
counter int
}
func (w *Worker) Init(args ...any) error {
w.counter = 0
w.Log().Info("worker %s starting", w.PID())
return nil
}
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case IncrementRequest:
w.counter += msg.Amount
w.Send(from, IncrementResponse{Counter: w.counter})
}
return nil
}
func (w *Worker) Terminate(reason error) {
w.Log().Info("worker stopped: %s", reason)
}
// Factory function for spawning
func createWorker() gen.ProcessBehavior {
return &Worker{}
}
func (w *Worker) Terminate(reason error) {
w.Log().Info("worker %s stopping: %s", w.PID(), reason)
// Clean up resources
w.closeConnections()
w.sendFinalStats()
}
func (w *Worker) Init(args ...any) error {
w.SetTrapExit(true)
return nil
}
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case gen.MessageExitPID:
w.Log().Info("linked process %s terminated: %s", msg.PID, msg.Reason)
// Decide how to handle it
if msg.Reason == gen.TerminateReasonPanic {
// Linked worker panicked, maybe restart it
w.restartWorker(msg.PID)
}
// Don't terminate - we're trapping
return nil
case gen.MessageExitNode:
w.Log().Warning("node %s disconnected", msg.Name)
// Handle network partition
return nil
}
return nil
}
func (w *Worker) Init(args ...any) error {
w.SetSplitHandle(true)
w.RegisterName("worker_service")
alias, _ := w.CreateAlias()
w.publicAPI = alias
return nil
}
func (w *Worker) HandleMessage(from gen.PID, message any) error {
// Messages sent to PID directly (internal use)
w.Log().Debug("internal message from %s", from)
return nil
}
func (w *Worker) HandleMessageName(name gen.Atom, from gen.PID, message any) error {
// Messages sent to registered name "worker_service" (public API)
w.Log().Info("public API call via name %s", name)
return nil
}
func (w *Worker) HandleMessageAlias(alias gen.Alias, from gen.PID, message any) error {
// Messages sent to alias (temporary session)
w.Log().Debug("session message via alias %s", alias)
return nil
}
func (w *Worker) HandleLog(message gen.MessageLog) error {
// Format and write log message
fmt.Printf("[%s] %s: %s\n", message.Level, message.PID, message.Message)
return nil
}
func (w *Worker) HandleEvent(message gen.MessageEvent) error {
switch message.Name {
case "config_updated":
w.reloadConfig()
case "cache_invalidated":
w.clearCache()
}
return nil
}
DisableAutoShutdown: false, // Default - supervisor stops when children stop
DisableAutoShutdown: true, // Supervisor stays alive with zero children
Restart: act.SupervisorRestart{
KeepOrder: true, // Stop sequentially in reverse order
}
// Start a child from the spec (if not already running)
err := supervisor.StartChild("worker")
// Start with different args (overrides spec)
err := supervisor.StartChild("worker", "new-config")
// Add a new child spec and start it
err := supervisor.AddChild(act.SupervisorChildSpec{
Name: "new_worker",
Factory: createWorker,
})
// Disable a child (stops it, won't restart on crash)
err := supervisor.DisableChild("worker")
// Re-enable a disabled child (starts it again)
err := supervisor.EnableChild("worker")
// Get list of children
children := supervisor.Children()
for _, child := range children {
fmt.Printf("Spec: %s, PID: %s, Disabled: %v\n",
child.Spec, child.PID, child.Disabled)
}
func (s *AppSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
return act.SupervisorSpec{
EnableHandleChild: true,
// ... rest of spec
}, nil
}
func (s *AppSupervisor) HandleChildStart(name gen.Atom, pid gen.PID) error {
s.Log().Info("child %s started with PID %s", name, pid)
// Maybe register in service discovery, send init message
return nil
}
func (s *AppSupervisor) HandleChildTerminate(name gen.Atom, pid gen.PID, reason error) error {
s.Log().Info("child %s (PID %s) terminated: %s", name, pid, reason)
// Maybe deregister from service discovery, clean up resources
return nil
}
func (s *AppSupervisor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case ScaleCommand:
if msg.Up {
s.AddWorkers(msg.Count)
} else {
s.RemoveWorkers(msg.Count)
}
case HealthCheckRequest:
children := s.Children()
s.Send(from, HealthResponse{
Running: len(children),
Healthy: s.countHealthy(children),
})
}
return nil
}
func (s *AppSupervisor) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
switch request.(type) {
case GetChildrenRequest:
return s.Children(), nil
}
return nil, nil
}
// From within a process context
info, err := process.Inspect(supervisorPID)
// Directly from the node
info, err := node.Inspect(supervisorPID)
// Returns map[string]string with metrics above
Time 0s: Child crashes → restart (count: 1)
Time 1s: Child crashes → restart (count: 2)
Time 2s: Child crashes → restart (count: 3)
Time 3s: Child crashes → EXCEEDED (count: 4 within 5s window)
→ Stop all children, supervisor terminates
Time 0s: Child crashes → restart (count: 1)
Time 6s: Child crashes → restart (count: 1, previous outside window)
Time 12s: Child crashes → restart (count: 1, previous outside window)
Network services need to accept TCP connections, read data from sockets, and write responses - all blocking operations that don't fit the one-message-at-a-time actor model. You could spawn goroutines for each connection, but this breaks actor isolation. You need synchronization, careful lifecycle management, and lose the benefits of supervision trees.
TCP meta-processes solve this by wrapping socket I/O in actors. The framework handles accept loops, connection management, and data buffering. Your actors receive messages when connections arrive or data is read. To send data, you send a message to the connection's meta-process. The actor model stays intact while integrating with blocking TCP operations.
Ergo provides two TCP meta-processes: TCPServer for accepting connections, and TCPConnection for handling established connections (both incoming and outgoing).
UDP
UDP is fundamentally different from TCP. There are no connections, no ordering guarantees, no reliability. Datagrams arrive independently, potentially out of order, possibly duplicated, or lost entirely. This makes UDP simpler than TCP, but also requires different handling patterns.
Traditional UDP servers use blocking ReadFrom calls in loops. This doesn't fit the actor model's one-message-at-a-time processing. You could spawn goroutines to read packets, but this breaks actor isolation and requires manual synchronization.
UDP meta-process wraps the socket in an actor. It runs a read loop in the Start goroutine, sending each received datagram as a message to your actor. To send datagrams, you send messages to the UDP server's meta-process. The actor model stays intact while integrating with blocking UDP operations.
Unlike TCP, UDP has no connections. One meta-process handles the entire socket - all incoming datagrams from all remote addresses. There's no per-connection state, no connection lifecycle, no connect/disconnect messages. Just datagrams in, datagrams out.
type Order struct {
ID int64
Items []string
}
func init() {
edf.RegisterTypeOf(Order{}) // Analyzed once, functions built
}
// Later, during message sending:
process.Send(to, Order{ID: 42, Items: []string{"item1"}}) // Uses pre-built encoder
package orders
type OrderV1 struct { ID int64 } // #github.com/myapp/orders/OrderV1
type OrderV2 struct { ID int64; Priority int } // #github.com/myapp/orders/OrderV2
type Order struct {
ID int64
Items []string
}
func init() {
edf.RegisterTypeOf(Order{})
}
type Order struct {
ID int64 // Exported - part of the contract
items []Item // Unexported - internal state, registration fails
}
type Order struct {
ID int64
Cache *OrderCache // Registration fails - pointer is local optimization
}
type Address struct {
City string
Street string
}
type Person struct {
Name string
Address Address
}
func init() {
edf.RegisterTypeOf(Address{}) // register child first
edf.RegisterTypeOf(Person{}) // then parent
}
err := process.SendImportant(remotePID, message)
if err != nil {
// Definitely failed - remote process doesn't exist,
// or mailbox is full, or connection dropped
}
func main() {
// Start node
node, err := ergo.StartNode("gateway@localhost", gen.NodeOptions{})
if err != nil {
panic(err)
}
defer node.Stop()
// Start HTTP server with node reference
server := &APIServer{node: node}
if err := server.Start(); err != nil {
panic(err)
}
}
type APIServer struct {
node gen.Node
mux *http.ServeMux
}
func (a *APIServer) Start() error {
a.mux = http.NewServeMux()
a.mux.HandleFunc("/users/{id}", a.handleGetUser)
a.mux.HandleFunc("/orders", a.handleCreateOrder)
return http.ListenAndServe(":8080", a.mux)
}
func (a *APIServer) handleGetUser(w http.ResponseWriter, r *http.Request) {
userID := r.PathValue("id")
// Call actor anywhere in the cluster
result, err := a.node.Call(
gen.ProcessID{Name: "user-service", Node: "backend@node1"},
GetUserRequest{ID: userID},
)
if err != nil {
http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
return
}
if errResult, ok := result.(error); ok {
http.Error(w, errResult.Error(), http.StatusNotFound)
return
}
user := result.(User)
json.NewEncoder(w).Encode(user)
}
// Same call works regardless of actor location
result, err := node.Call(
gen.ProcessID{Name: "user-service", Node: "backend@node1"},
request,
)
func (a *APIServer) handleGetUser(w http.ResponseWriter, r *http.Request) {
userID := r.PathValue("id")
// Route requests for the same user to the same node
// This improves cache locality - user data stays hot
nodeID := consistentHash(userID, a.clusterSize)
targetNode := fmt.Sprintf("backend@node%d", nodeID)
result, err := a.node.Call(
gen.ProcessID{Name: "user-service", Node: targetNode},
GetUserRequest{ID: userID},
)
// handle result...
}
func (a *APIServer) handleRequest(w http.ResponseWriter, r *http.Request) {
// Application discovery requires central registrar (etcd or Saturn)
// See: networking/service-discovering.md
registrar, err := a.node.Network().Registrar()
if err != nil {
http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
return
}
resolver := registrar.Resolver()
routes, err := resolver.ResolveApplication("user-service")
if err != nil || len(routes) == 0 {
http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
return
}
// Select node based on weight, load, health, proximity
target := a.selectNode(routes)
result, err := a.node.Call(
gen.ProcessID{Name: "user-service", Node: target.Node},
GetUserRequest{ID: r.PathValue("id")},
)
// handle result...
}
func (a *APIServer) selectNode(routes []gen.ApplicationRoute) gen.ApplicationRoute {
// Weighted random selection
totalWeight := 0
for _, r := range routes {
totalWeight += r.Weight
}
pick := rand.Intn(totalWeight)
for _, r := range routes {
pick -= r.Weight
if pick < 0 {
return r
}
}
return routes[0]
}
// Standard middleware works
func authMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if !isAuthorized(r) {
http.Error(w, "Unauthorized", http.StatusUnauthorized)
return
}
next.ServeHTTP(w, r)
})
}
mux.Handle("/api/", authMiddleware(http.HandlerFunc(a.handleAPI)))
type WebService struct {
act.Actor
}
func (w *WebService) Init(args ...any) error {
// Spawn worker that will handle HTTP requests
_, err := w.SpawnRegister("web-worker",
func() gen.ProcessBehavior { return &WebWorker{} },
gen.ProcessOptions{},
)
if err != nil {
return err
}
// Create HTTP multiplexer
mux := http.NewServeMux()
// Create handler meta-process pointing to worker
handler := meta.CreateWebHandler(meta.WebHandlerOptions{
Worker: "web-worker",
RequestTimeout: 5 * time.Second,
})
// Spawn handler meta-process
handlerID, err := w.SpawnMeta(handler, gen.MetaOptions{})
if err != nil {
return err
}
// Register handler with mux (handler implements http.Handler)
// Standard middleware works - handler is just http.Handler
mux.Handle("/", authMiddleware(rateLimitMiddleware(handler)))
// Create web server meta-process
server, err := meta.CreateWebServer(meta.WebServerOptions{
Host: "localhost",
Port: 8080,
Handler: mux,
})
if err != nil {
return err
}
// Spawn server meta-process
serverID, err := w.SpawnMeta(server, gen.MetaOptions{})
if err != nil {
server.Terminate(err)
return err
}
w.Log().Info("HTTP server listening on :8080 (server=%s, handler=%s)",
serverID, handlerID)
return nil
}
type WebWorker struct {
act.Actor
}
func (w *WebWorker) HandleMessage(from gen.PID, message any) error {
request, ok := message.(meta.MessageWebRequest)
if !ok {
return nil
}
defer request.Done() // Always call Done to unblock ServeHTTP
// Process HTTP request
switch request.Request.Method {
case "GET":
user := w.getUserFromDB(request.Request.URL.Query().Get("id"))
json.NewEncoder(request.Response).Encode(user)
case "POST":
var order Order
json.NewDecoder(request.Request.Body).Decode(&order)
w.createOrder(order)
request.Response.WriteHeader(http.StatusCreated)
default:
http.Error(request.Response, "Method not supported", http.StatusMethodNotAllowed)
}
return nil
}
type WebWorkerPool struct {
act.Pool
}
func (p *WebWorkerPool) Init(args ...any) (act.PoolOptions, error) {
return act.PoolOptions{
PoolSize: 20, // 20 concurrent workers
WorkerMailboxSize: 10, // Each worker queues up to 10 requests
WorkerFactory: func() gen.ProcessBehavior { return &WebWorker{} },
}, nil
}
func (w *WebService) Init(args ...any) error {
// Spawn pool with registered name
_, err := w.SpawnRegister("web-worker",
func() gen.ProcessBehavior { return &WebWorkerPool{} },
gen.ProcessOptions{},
)
if err != nil {
return err
}
handler := meta.CreateWebHandler(meta.WebHandlerOptions{
Worker: "web-worker",
})
_, err = w.SpawnMeta(handler, gen.MetaOptions{})
// rest of setup...
}
// Chat room broadcasts to all connected clients
for _, connAlias := range room.connections {
room.Send(connAlias, ChatMessage{From: sender, Text: text})
}
// Game server on node1 pushes update to player connection on node2
gameServer.Send(playerConnAlias, StateUpdate{HP: hp, Position: pos})
// Backend actor pushes notification to user's browser
backend.Send(userConnAlias, Notification{Text: "Task completed"})
TCP Server: Accepting Connections
Create a TCP server with meta.CreateTCPServer:
The server opens a TCP socket and enters an accept loop. When a connection arrives, the server spawns a new TCPConnection meta-process to handle it. Each connection runs in its own meta-process, isolated from other connections.
If SpawnMeta fails, you must call server.Terminate(err) to close the listening socket. Without this, the port remains bound and unusable until the process exits.
The server runs forever, accepting connections and spawning handlers. When the parent actor terminates, the server terminates too (cascading termination), closing the listening socket and stopping all connection handlers.
TCP Connection: Handling I/O
When the server accepts a connection, it automatically spawns a TCPConnection meta-process. This meta-process reads data from the socket and sends it to your actor. To write data, you send messages to the connection's meta-process.
MessageTCPConnect arrives when the connection is established. It contains the connection's meta-process ID (m.ID), remote address, and local address. Save the ID if you need to track connections or send data later.
MessageTCP arrives when data is read from the socket. m.Data contains the bytes read (up to ReadBufferSize at a time). To send data, send a MessageTCP back to the connection's ID. The meta-process writes it to the socket.
MessageTCPDisconnect arrives when the connection closes (client disconnected, network error, or you terminated the connection). After this, the connection meta-process is dead - sending to its ID returns an error.
If the connection meta-process cannot send messages to your actor (actor crashed, mailbox full), it terminates the connection and stops. This ensures failed actors don't leak connections.
Routing to Workers
By default, all connections send messages to the parent actor - the one that spawned the server. For a server handling many connections, this creates a bottleneck. All connections compete for the parent's mailbox, and messages are processed sequentially.
Use ProcessPool to distribute connections across multiple workers:
The server distributes connections round-robin across the pool. Connection 1 goes to tcp_worker_0, connection 2 goes to tcp_worker_1, and so on. After tcp_worker_9, it wraps back to tcp_worker_0.
Each worker handles its connections independently. If a worker crashes, its connections terminate (they can't send messages anymore). The supervisor restarts the worker, which begins handling new connections. The distribution is stateless - the server doesn't track which worker handles which connection.
Do not use act.Pool in ProcessPool. act.Pool forwards messages to any available worker, breaking the connection-to-worker binding. If connection A sends message 1 to worker X and message 2 to worker Y, the protocol state becomes corrupted. Use a list of individual process names instead.
Workers are typically actors that maintain per-connection state:
Client Connections
To initiate outgoing TCP connections, use meta.CreateTCPConnection:
CreateTCPConnection connects to the remote host immediately. If the connection fails (host unreachable, connection refused), it returns an error. If successful, it returns a meta-process behavior ready to spawn.
The spawned meta-process sends MessageTCPConnect when ready, then streams received data as MessageTCP messages. To send data, send MessageTCP to the connection's ID.
Client connections use the same TCPConnection meta-process as server-side connections. The only difference is how they're created: CreateTCPConnection initiates a connection, while the server spawns connections automatically on accept.
Chunking: Message Framing
Raw TCP is a byte stream, not a message stream. If you send two 100-byte messages, they might arrive as one 200-byte read, or three reads (150 bytes, 40 bytes, 10 bytes). You must frame messages to detect boundaries.
Enable chunking for automatic framing:
Fixed-length messages:
Every MessageTCP contains exactly 256 bytes. The meta-process buffers reads until 256 bytes accumulate, then sends them. If a socket read returns 512 bytes, you receive two MessageTCP messages.
Header-based messages:
The meta-process reads the 4-byte header, extracts the length as a big-endian integer, waits for the full payload, then sends the complete message (header + payload) as one MessageTCP.
Protocol example:
You receive:
First MessageTCP: 14 bytes (4 + 10)
Second MessageTCP: 260 bytes (4 + 256)
If both messages arrive in one socket read (274 bytes total), the meta-process splits them automatically. If the header arrives first and the payload arrives later (slow connection), the meta-process waits for the complete message.
MaxLength protects against malformed or malicious messages. If the header claims a message is 4GB, the meta-process terminates with gen.ErrTooLarge instead of allocating 4GB of memory.
HeaderLengthSize can be 1, 2, or 4 bytes (big-endian). HeaderLengthPosition specifies the offset within the header. Example for a protocol with type + flags + length:
Without chunking, you receive raw bytes as the meta-process reads them. You must buffer and frame messages yourself - typically by accumulating data in your actor's state and detecting message boundaries manually.
Buffer Management
The meta-process allocates buffers for reading socket data. By default, each read allocates a new buffer, which becomes garbage after you process it. For high-throughput servers, this causes GC pressure.
Use a buffer pool:
The meta-process gets buffers from the pool when reading. When you receive MessageTCP, the Data field is a buffer from the pool. Return it to the pool after processing:
When you send MessageTCP to write data, the meta-process automatically returns the buffer to the pool after writing (if a pool is configured). Don't use the buffer after sending.
If you need to store data beyond the current message, copy it:
Buffer pools are essential for servers handling thousands of connections or high throughput. For low-volume clients, the GC overhead is negligible - skip the pool for simplicity.
Write Keepalive
Some protocols require periodic writes to keep connections alive. If no data is sent for a timeout period, the peer disconnects. You could send keepalive messages with timers, but this is tedious and error-prone.
Enable automatic keepalive:
The meta-process wraps the socket with a keepalive writer. If nothing is written for 30 seconds, it automatically sends a null byte. The peer receives it as normal data. Design your protocol to ignore keepalive messages.
Keepalive bytes can be anything: a ping message, a heartbeat packet, or a protocol-specific keepalive. The peer sees them as regular socket data.
This is application-level keepalive (layer 7), not TCP keepalive (layer 4). Both can be used simultaneously.
TCP Keepalive (OS-Level)
TCP has built-in keepalive at the protocol level. Enable it with KeepAlivePeriod:
The OS sends TCP keepalive probes every 60 seconds when the connection is idle. If the peer doesn't respond, the connection is closed. This detects dead connections (network partition, crashed peer) without application involvement.
Set KeepAlivePeriod to 0 to disable TCP keepalive (default). Set it to -1 for OS default behavior (typically 2 hours on Linux, varies by platform).
TCP keepalive (OS-level) and write buffer keepalive (application-level) serve different purposes:
TCP keepalive: Detects dead connections
Write keepalive: Satisfies application protocols that require periodic data
Most servers need TCP keepalive to clean up dead connections. Some protocols also need write keepalive to satisfy their requirements.
TLS Encryption
Enable TLS with a certificate manager:
The server wraps accepted connections with TLS. The certificate manager provides certificates dynamically (for SNI, certificate rotation, etc.). See CertManager for details.
For client connections:
The client establishes a TLS connection during CreateTCPConnection. By default, the client verifies the server's certificate. To skip verification (testing only):
Never use InsecureSkipVerify in production. It disables certificate validation, making you vulnerable to man-in-the-middle attacks.
With TLS enabled, data is encrypted automatically. Your actor sends and receives plaintext MessageTCP - the meta-process handles encryption/decryption transparently.
Process Routing
For both server and client connections, you can route messages to a specific process:
If Process is not set (client) or ProcessPool is empty (server), messages go to the parent actor.
For servers, ProcessPool enables load distribution. For clients, Process enables separation of concerns - the actor that initiates connections doesn't need to handle the protocol.
Inspection
TCP meta-processes support inspection for debugging:
Use this for monitoring, debugging, or displaying connection status in management interfaces.
Patterns and Pitfalls
Pattern: Connection registry
Track all active connections. Useful for monitoring, rate limiting, or forced disconnection.
Pattern: Protocol state machine
Maintain per-connection protocol state for complex protocols with multiple stages (handshake, authentication, data transfer).
Pattern: Broadcast to all connections
Send the same data to all active connections. Useful for chat servers, pub/sub systems, or monitoring dashboards.
Pitfall: Not handling MessageTCPDisconnect
After disconnect, the connection state remains in memory forever. Always clean up on disconnect.
Pitfall: act.Pool in ProcessPool
If worker_pool is an act.Pool, messages from one connection are distributed across multiple workers. Connection A's messages might go to worker 1, then worker 2, then worker 1 again. Protocol state is split across workers, causing corruption.
Use individual process names, not pools.
Pitfall: Blocking in message handler
If the worker handles multiple connections, one slow operation blocks all of them. The worker can't process messages from other connections while blocked.
Solution: Spawn a goroutine for slow operations, or use a worker pool (one worker per connection).
Pitfall: Forgetting to return buffers
Pool buffers are reused. Storing them directly leads to data corruption. Always copy, then return.
TCP meta-processes handle the complexity of socket I/O, connection management, and buffering - letting you focus on protocol implementation while maintaining the actor model's isolation and supervision benefits.
Creating a UDP Server
Create a UDP server with meta.CreateUDPServer:
The server opens a UDP socket and enters a read loop. For each received datagram, it sends MessageUDP to your actor. Your actor processes it and optionally sends a response by sending MessageUDP back to the server's meta-process ID.
If SpawnMeta fails, call server.Terminate(err) to close the socket. Without this, the port remains bound until the process exits.
The server runs forever, reading datagrams and forwarding them as messages. When the parent actor terminates, the server terminates too (cascading termination), closing the socket.
Handling Datagrams
The UDP server sends MessageUDP for each received datagram:
MessageUDP contains:
ID: The UDP server's meta-process ID (same for all datagrams)
Addr: Remote address that sent this datagram (net.Addr - typically *net.UDPAddr)
Data: The datagram payload (up to BufferSize bytes)
To send a datagram, send MessageUDP to the server's ID with the destination address and payload. The server writes it to the socket with WriteTo. The ID field is ignored when sending (it's only used for incoming datagrams).
Unlike TCP:
No connect/disconnect messages - datagrams are independent
Addr changes for each datagram - track remote addresses yourself if needed
No message framing - each UDP datagram is a complete message
No ordering guarantees - process datagrams as they arrive
Connectionless Nature
UDP has no connections. Each datagram is independent. The same remote address might send multiple datagrams, but there's no session state. If you need state per remote address, maintain it yourself:
Because UDP has no connection lifecycle, you need application-level timeout logic to clean up stale state. The server doesn't know when clients "disconnect" - they just stop sending datagrams.
Routing to Workers
By default, all datagrams go to the parent actor. For servers handling high datagram rates, this creates a bottleneck. Use Process to route to a different handler:
All datagrams go to metrics_collector instead of the parent. This enables separation of concerns - the actor that creates the UDP server doesn't need to handle datagrams.
Unlike TCP's ProcessPool, UDP only has a single Process field. You can route to an act.Pool:
Each datagram is forwarded to the pool, which distributes them across workers. This works for UDP because datagrams are independent - there's no per-connection state to corrupt. For TCP, ProcessPool uses round-robin to maintain connection-to-worker binding. For UDP, the pool can distribute freely.
Use pools when datagram processing is CPU-intensive or slow (database writes, external API calls). Workers process datagrams in parallel, maximizing throughput.
Buffer Management
The UDP server allocates a buffer for each datagram read. By default, it allocates a new buffer every time, which becomes garbage after you process it. For high datagram rates, this causes GC pressure.
Use a buffer pool:
The server gets buffers from the pool when reading. When you receive MessageUDP, the Data field is a buffer from the pool. Return it to the pool after processing:
When you send MessageUDP to write a datagram, the server automatically returns the buffer to the pool after writing (if a pool is configured). Don't use the buffer after sending.
If you need to store data beyond the current message, copy it:
Buffer pools are essential for servers receiving thousands of datagrams per second. For low-volume servers (a few datagrams per second), the GC overhead is negligible - skip the pool for simplicity.
Buffer Size
UDP datagrams are limited by the network's Maximum Transmission Unit (MTU). IPv4 networks typically have 1500-byte MTU, IPv6 has 1280-byte minimum. After subtracting IP and UDP headers (28 bytes for IPv4, 48 bytes for IPv6), you get:
Datagrams larger than MTU are fragmented at the IP layer. Fragmented datagrams are reassembled by the receiving OS before ReadFrom returns. However, if any fragment is lost, the entire datagram is discarded - UDP reliability degrades.
The default BufferSize is 65000 bytes (close to UDP's theoretical maximum of 65507 bytes). This handles any UDP datagram, but it's wasteful if your protocol uses smaller messages:
If a datagram is larger than BufferSize, it's truncated - you receive only the first BufferSize bytes. The rest is discarded. Set BufferSize to the maximum expected datagram size for your protocol.
Smaller buffers reduce memory usage (important with buffer pools). Larger buffers avoid truncation but waste memory if datagrams are typically small.
No Chunking
Unlike TCP, UDP meta-process has no chunking support. UDP datagrams are atomic - each datagram is a complete message. There's no byte stream to split or reassemble. The protocol boundary is the datagram boundary.
If your protocol sends multi-datagram messages, you must handle reassembly yourself:
UDP delivers datagrams out of order. Fragment 2 might arrive before fragment 1. Your reassembly logic must handle this. Use sequence numbers, timeouts for incomplete sets, and protection against memory exhaustion (limit maximum incomplete messages).
Most UDP protocols avoid multi-datagram messages entirely. Keep messages under MTU size for reliability and simplicity.
Loss tolerance: Don't rely on every datagram arriving. Either accept loss (game state updates, sensor readings) or implement application-level acknowledgment and retransmission.
Duplicate tolerance: Process datagrams idempotently. If the same datagram arrives twice, the result is the same. Use sequence numbers to detect and discard duplicates:
Reordering tolerance: Don't assume datagrams arrive in send order. Use timestamps or sequence numbers to handle reordering:
Corruption detection: UDP has a 16-bit checksum, but it's weak. Critical data should have application-level integrity checks (CRC32, hash, signature).
Most importantly: design your protocol so datagram loss doesn't break functionality. UDP is for scenarios where loss is acceptable (real-time updates) or where you implement your own reliability layer (QUIC, custom protocols).
Inspection
UDP server supports inspection for debugging:
Use this for monitoring datagram counts, bandwidth usage, or displaying server status.
Patterns and Pitfalls
Pattern: Metrics aggregation
Aggregate many datagrams into periodic summaries. Lossy protocols (like StatsD) rely on volume - losing a few datagrams doesn't affect aggregate accuracy.
Pattern: Request-response with timeout
Implement application-level reliability with timeouts and retries. UDP doesn't guarantee delivery, so you must detect and handle failures.
Pattern: Broadcast responder
Respond to broadcast discovery requests. Track sender address from MessageUDP.Addr and reply directly.
Pitfall: Not returning buffers
Pool buffers are reused immediately. Storing them leads to data corruption when the pool reuses the buffer for the next datagram.
Pitfall: Assuming reliability
Some chunks will be lost. The server waits forever for missing chunks, or processes incomplete data. Either accept loss (send redundant data) or implement acknowledgment and retransmission.
Pitfall: Large datagrams
IP-level fragmentation significantly increases loss probability. If any fragment is lost, the entire datagram is discarded. Keep datagrams under 1472 bytes for reliability, or 512 bytes for internet-wide compatibility.
Pitfall: Not handling duplicates
Network equipment can duplicate UDP datagrams (switch mirroring, retransmission logic). Process commands idempotently or track sequence numbers.
UDP meta-process handles the complexity of socket I/O and datagram delivery while maintaining actor isolation. Design your protocol for UDP's unreliable, unordered, connectionless nature - and leverage its simplicity and low latency where reliability isn't critical.
type EchoServer struct {
act.Actor
}
func (e *EchoServer) Init(args ...any) error {
options := meta.TCPServerOptions{
Host: "0.0.0.0", // Listen on all interfaces
Port: 8080,
}
server, err := meta.CreateTCPServer(options)
if err != nil {
return fmt.Errorf("failed to create TCP server: %w", err)
}
// Start the server meta-process
serverID, err := e.SpawnMeta(server, gen.MetaOptions{})
if err != nil {
// Failed to spawn - close the listening socket
server.Terminate(err)
return fmt.Errorf("failed to spawn TCP server: %w", err)
}
e.Log().Info("TCP server listening on %s:%d (id: %s)",
options.Host, options.Port, serverID)
return nil
}
func (e *EchoServer) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCPConnect:
// New connection established
e.Log().Info("client connected: %s -> %s (id: %s)",
m.RemoteAddr, m.LocalAddr, m.ID)
// Send welcome message
e.Send(m.ID, meta.MessageTCP{
Data: []byte("Welcome to echo server!\n"),
})
case meta.MessageTCP:
// Received data from client
e.Log().Info("received %d bytes from %s", len(m.Data), m.ID)
// Echo it back
e.Send(m.ID, meta.MessageTCP{
Data: m.Data,
})
case meta.MessageTCPDisconnect:
// Connection closed
e.Log().Info("client disconnected: %s", m.ID)
}
return nil
}
type TCPDispatcher struct {
act.Actor
}
func (d *TCPDispatcher) Init(args ...any) error {
// Start worker pool
for i := 0; i < 10; i++ {
workerName := gen.Atom(fmt.Sprintf("tcp_worker_%d", i))
_, err := d.SpawnRegister(workerName, createWorker, gen.ProcessOptions{})
if err != nil {
return err
}
}
// Configure server with worker pool
options := meta.TCPServerOptions{
Port: 8080,
ProcessPool: []gen.Atom{
"tcp_worker_0",
"tcp_worker_1",
"tcp_worker_2",
"tcp_worker_3",
"tcp_worker_4",
"tcp_worker_5",
"tcp_worker_6",
"tcp_worker_7",
"tcp_worker_8",
"tcp_worker_9",
},
}
server, err := meta.CreateTCPServer(options)
if err != nil {
return err
}
_, err = d.SpawnMeta(server, gen.MetaOptions{})
if err != nil {
server.Terminate(err)
return err
}
return nil
}
type TCPWorker struct {
act.Actor
connections map[gen.Alias]*ConnectionState
}
type ConnectionState struct {
remoteAddr net.Addr
buffer []byte
// ... protocol state
}
func (w *TCPWorker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCPConnect:
w.connections[m.ID] = &ConnectionState{
remoteAddr: m.RemoteAddr,
}
case meta.MessageTCP:
state := w.connections[m.ID]
w.processData(m.ID, state, m.Data)
case meta.MessageTCPDisconnect:
delete(w.connections, m.ID)
}
return nil
}
type HTTPClient struct {
act.Actor
connID gen.Alias
}
func (c *HTTPClient) Init(args ...any) error {
options := meta.TCPConnectionOptions{
Host: "example.com",
Port: 80,
}
connection, err := meta.CreateTCPConnection(options)
if err != nil {
return fmt.Errorf("failed to connect: %w", err)
}
connID, err := c.SpawnMeta(connection, gen.MetaOptions{})
if err != nil {
connection.Terminate(err)
return fmt.Errorf("failed to spawn connection: %w", err)
}
c.connID = connID
return nil
}
func (c *HTTPClient) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCPConnect:
// Connection established, send HTTP request
request := "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
c.Send(m.ID, meta.MessageTCP{
Data: []byte(request),
})
case meta.MessageTCP:
// Received HTTP response
c.Log().Info("response: %s", string(m.Data))
case meta.MessageTCPDisconnect:
// Server closed connection
c.Log().Info("connection closed by server")
}
return nil
}
options := meta.TCPServerOptions{
Port: 8080,
ReadChunk: meta.ChunkOptions{
Enable: true,
FixedLength: 256, // Every message is exactly 256 bytes
},
}
options := meta.TCPServerOptions{
Port: 8080,
ReadBufferSize: 8192,
ReadChunk: meta.ChunkOptions{
Enable: true,
// Protocol: [4-byte length][payload]
HeaderSize: 4,
HeaderLengthPosition: 0,
HeaderLengthSize: 4,
HeaderLengthIncludesHeader: false, // Length is payload only
MaxLength: 1048576, // Max 1MB per message
},
}
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCP:
// Process data
result := w.processPacket(m.Data)
// Send response
w.Send(m.ID, meta.MessageTCP{Data: result})
// Return read buffer to pool
bufferPool.Put(m.Data)
}
return nil
}
case meta.MessageTCP:
state := w.connections[m.ID]
// Store in connection state - must copy
state.buffer = append(state.buffer, m.Data...)
// Return original buffer
bufferPool.Put(m.Data)
type ProtocolHandler struct {
act.Actor
connections map[gen.Alias]*ProtocolState
}
type ProtocolState struct {
state int // Current state in protocol state machine
buffer []byte
}
func (h *ProtocolHandler) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCPConnect:
h.connections[m.ID] = &ProtocolState{state: STATE_INITIAL}
case meta.MessageTCP:
state := h.connections[m.ID]
state.buffer = append(state.buffer, m.Data...)
// Process buffered data according to current state
for {
complete, nextState := h.processState(m.ID, state)
if !complete {
break
}
state.state = nextState
}
bufferPool.Put(m.Data)
}
return nil
}
func (m *ConnectionManager) broadcastMessage(data []byte) {
for connID := range m.connections {
m.Send(connID, meta.MessageTCP{Data: data})
}
}
// WRONG: Connection state leaked
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCPConnect:
w.connections[m.ID] = &State{}
case meta.MessageTCP:
w.connections[m.ID].process(m.Data)
// No MessageTCPDisconnect handler!
}
return nil
}
// WRONG: Protocol state corrupted
options := meta.TCPServerOptions{
ProcessPool: []gen.Atom{"worker_pool"}, // Don't use act.Pool!
}
// WRONG: Blocks actor, stalls other connections
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCP:
// Slow database query
result := w.db.Query("SELECT * FROM large_table")
w.Send(m.ID, meta.MessageTCP{Data: result})
}
return nil
}
// WRONG: Buffer leaked
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCP:
// Store data, never return buffer
w.dataQueue = append(w.dataQueue, m.Data)
}
return nil
}
// CORRECT: Copy if storing
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageTCP:
copied := make([]byte, len(m.Data))
copy(copied, m.Data)
w.dataQueue = append(w.dataQueue, copied)
bufferPool.Put(m.Data)
}
return nil
}
type DNSServer struct {
act.Actor
udpID gen.Alias
}
func (d *DNSServer) Init(args ...any) error {
options := meta.UDPServerOptions{
Host: "0.0.0.0",
Port: 53,
BufferSize: 512, // DNS messages are typically small
}
server, err := meta.CreateUDPServer(options)
if err != nil {
return fmt.Errorf("failed to create UDP server: %w", err)
}
udpID, err := d.SpawnMeta(server, gen.MetaOptions{})
if err != nil {
// Failed to spawn - close the socket
server.Terminate(err)
return fmt.Errorf("failed to spawn UDP server: %w", err)
}
d.udpID = udpID
d.Log().Info("DNS server listening on %s:%d (id: %s)",
options.Host, options.Port, udpID)
return nil
}
func (d *DNSServer) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
// Received UDP datagram
d.Log().Info("received %d bytes from %s", len(m.Data), m.Addr)
// Parse DNS query
query, err := d.parseDNSQuery(m.Data)
if err != nil {
d.Log().Warning("invalid DNS query from %s: %s", m.Addr, err)
return nil
}
// Build DNS response
response := d.buildDNSResponse(query)
// Send response back to the same address
d.Send(d.udpID, meta.MessageUDP{
Addr: m.Addr,
Data: response,
})
}
return nil
}
type GameServer struct {
act.Actor
udpID gen.Alias
players map[string]*PlayerState // Key: remote address string
}
type PlayerState struct {
addr net.Addr
lastSeen time.Time
position Vector3
health int
}
func (g *GameServer) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
addrStr := m.Addr.String()
// Get or create player state
player, exists := g.players[addrStr]
if !exists {
player = &PlayerState{
addr: m.Addr,
health: 100,
}
g.players[addrStr] = player
g.Log().Info("new player: %s", addrStr)
}
// Update last seen
player.lastSeen = time.Now()
// Process game packet
g.processGamePacket(player, m.Data)
case CleanupTick:
// Remove stale players
now := time.Now()
for addr, player := range g.players {
if now.Sub(player.lastSeen) > 30*time.Second {
delete(g.players, addr)
g.Log().Info("player timeout: %s", addr)
}
}
}
return nil
}
func (s *StatsServer) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
// Process datagram
s.processMetric(m.Data)
// Return buffer to pool
bufferPool.Put(m.Data)
}
return nil
}
case meta.MessageUDP:
// Store in queue - must copy
copied := make([]byte, len(m.Data))
copy(copied, m.Data)
s.queue = append(s.queue, copied)
// Return original buffer
bufferPool.Put(m.Data)
// DNS server - queries rarely exceed 512 bytes
options := meta.UDPServerOptions{
Port: 53,
BufferSize: 512,
}
// Game server - small position updates
options := meta.UDPServerOptions{
Port: 9999,
BufferSize: 128,
}
// Media streaming - large packets OK
options := meta.UDPServerOptions{
Port: 5004,
BufferSize: 8192,
}
type ReassemblyHandler struct {
act.Actor
fragments map[uint32]*FragmentSet // Key: message ID
}
type FragmentSet struct {
fragments []*Fragment
received map[int]bool
total int
}
func (r *ReassemblyHandler) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
// Parse fragment header
msgID, fragNum, totalFrags := r.parseFragmentHeader(m.Data)
// Get or create fragment set
set, exists := r.fragments[msgID]
if !exists {
set = &FragmentSet{
fragments: make([]*Fragment, totalFrags),
received: make(map[int]bool),
total: totalFrags,
}
r.fragments[msgID] = set
}
// Store fragment
set.fragments[fragNum] = &Fragment{data: m.Data}
set.received[fragNum] = true
// Check if complete
if len(set.received) == set.total {
complete := r.reassemble(set.fragments)
r.processMessage(complete)
delete(r.fragments, msgID)
}
bufferPool.Put(m.Data)
}
return nil
}
type Player struct {
lastSequence uint32
}
func (g *GameServer) processGamePacket(player *Player, data []byte) {
seq := binary.BigEndian.Uint32(data[0:4])
// Discard old/duplicate packets
if seq <= player.lastSequence {
return
}
player.lastSequence = seq
// Process packet
}
type Measurement struct {
timestamp time.Time
value float64
}
func (s *StatsCollector) processMeasurement(m Measurement) {
// Store measurements in order by timestamp
s.insertSorted(m)
}
type DiscoveryServer struct {
act.Actor
udpID gen.Alias
}
func (d *DiscoveryServer) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
if string(m.Data) == "DISCOVER" {
response := d.buildDiscoveryResponse()
// Reply to sender
d.Send(d.udpID, meta.MessageUDP{
Addr: m.Addr,
Data: response,
})
}
bufferPool.Put(m.Data)
}
return nil
}
// WRONG: Buffer leaked
func (s *Server) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
// Store in queue without copying
s.queue = append(s.queue, m.Data) // Buffer still referenced!
}
return nil
}
// CORRECT: Copy before storing
func (s *Server) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessageUDP:
copied := make([]byte, len(m.Data))
copy(copied, m.Data)
s.queue = append(s.queue, copied)
bufferPool.Put(m.Data) // Return original
}
return nil
}
// WRONG: Assumes all datagrams arrive
func (c *Client) sendTransaction(tx Transaction) {
// Send 10 chunks
for i := 0; i < 10; i++ {
chunk := tx.getChunk(i)
c.Send(c.udpID, meta.MessageUDP{
Addr: c.serverAddr,
Data: chunk,
})
}
// Server will process when all 10 arrive... right? WRONG!
}
// WRONG: Likely to be fragmented or lost
data := make([]byte, 8000) // 8KB datagram
c.Send(c.udpID, meta.MessageUDP{
Addr: serverAddr,
Data: data,
})
// WRONG: Processes duplicate commands
func (g *GameServer) handleCommand(player *Player, cmd Command) {
switch cmd.Type {
case CmdFireWeapon:
player.ammo-- // Duplicate datagram = fire twice!
g.spawnProjectile(player)
}
}
// CORRECT: Idempotent with sequence tracking
func (g *GameServer) handleCommand(player *Player, cmd Command) {
if cmd.Sequence <= player.lastSequence {
return // Duplicate or old command
}
player.lastSequence = cmd.Sequence
switch cmd.Type {
case CmdFireWeapon:
player.ammo--
g.spawnProjectile(player)
}
}
Message Versioning
Evolving message contracts in distributed clusters
Distributed systems evolve. Services gain features, data models change, and deployments happen gradually. During a rolling upgrade, some nodes run new code while others still run the old version. A message sent from a new node must be understood by an old node, and vice versa.
EDF serializes messages by their exact Go type. Change a struct - and you have a new, incompatible type. This is intentional: explicit versioning catches breaking changes at compile time rather than hiding them until production.
This article explains how to version messages so your cluster handles upgrades gracefully.
Explicit Versioning
Leader
Distributed leader election for coordinating work across a cluster
Distributed systems often require coordination - ensuring only one node writes to prevent conflicts, scheduling tasks exactly once, or managing exclusive access to shared resources. This coordination demands selecting one node as the leader while others follow. The leader actor implements this election mechanism, handling failures, network issues, and dynamic cluster changes automatically.
When you embed leader.Actor in your process, it participates in distributed leader election with other instances across your Ergo cluster. The framework manages the election protocol - tracking terms, exchanging votes, broadcasting heartbeats. Your code focuses on what matters: what to do when elected leader, and how to behave as a follower.
Unlike Protobuf or Avro, EDF does not provide automatic backward compatibility. There are no optional fields, no field numbers, no schema evolution. A struct is its type. Change the struct - create a new type.
The approach is straightforward: create a new type for each version.
Both types coexist in the codebase. The receiver handles whichever version arrives:
All message types must be registered with EDF before connection establishment:
There are two ways to organize versioned types: version in the type name or version in the package path. Both work with EDF. Choose based on your team's preferences.
Important: Do not confuse package path versioning with Go modules v2+. Go modules v2+ requires changing both go.mod and all import paths when bumping major version (company.com/events/v2). This forces all consumers to update imports simultaneously, creates diamond dependency problems, and generally causes more pain than it solves. Keep your module below v2.0.0 to avoid triggering this mechanism.
Version in Type Name
All versions live in the same package:
Handler uses type names directly:
Advantages:
Single import for all versions
All versions visible in one place - evolution is clear
One registration file for all types
Simpler directory structure
Version in Package Path
Each version is a separate package:
Handler uses package aliases:
Advantages:
Clean type names without version suffix
Familiar to Protobuf users
Clear directory separation between versions
Removing a version means deleting a directory
Module Organization
For projects where message versions evolve in parallel, place go.mod in each domain directory:
The /v1/ and /v2/ segments are in the middle of the module path, not at the end. Go only applies v2+ import path requirements when /vN is the final path element, so company.com/messaging/v1/events is safe.
This structure allows:
V1 to continue receiving new message types while V2 is developed
Each domain to have isolated dependencies
Clean removal - deleting a directory removes the module entirely
Tagging submodules: Git tags for nested modules must include the path prefix. For module company.com/messaging/v1/events located at v1/events/, use tag v1/events/v0.1.0, not just v0.1.0.
Which to Choose
This documentation uses version in type name for examples. The approach keeps related versions together and requires less import management. However, version in path is equally valid if your team prefers cleaner type names.
Whichever you choose, stay consistent across the codebase.
The versioning mechanism is clear. The next question: where should these types live, and who controls their evolution?
Message Scopes
The answer depends on how the message is used. Not all messages are equal - some travel between two specific services, others broadcast across the entire cluster.
Private Messages
Direct communication between specific services. Request/response patterns between known parties.
Owner: receiver
Payment Service defines what it accepts. Order Service adapts to Payment's contract.
Cluster-Wide Events
Domain events published to multiple subscribers. Any service can subscribe.
Owner: shared repository
Events represent domain facts, not service-specific contracts. Ownership belongs to a shared module that all services import.
Scope determines ownership. Who decides when to create V2? Who approves changes?
Scope
Owner
Module
Changes approved by
Private messages
Receiver
receiver-api/
Receiver team
Cluster-wide events
Shared
The receiver owns private contracts because it implements the logic. Multiple senders may use the same contract, but they all adapt to what the receiver accepts. This follows the Consumer-Driven Contracts pattern. Events are shared because they represent domain facts, not service-specific APIs.
Private Contract Ownership
Payment Service owns its API contract:
Order Service imports and uses it:
Payment team decides when to create V2. Order team adapts.
Cluster Event Ownership
Events require broader coordination:
Breaking changes require sign-off from all consumers.
Repository Organization
With ownership defined, the repository structure follows naturally. Private contracts live with their receivers. Cluster-wide events live in a shared module.
Version in Type Name
Version in Package Path
Registration Helper
All message types must be registered with EDF before connection establishment - during handshake, nodes exchange their registered type lists which become the encoding dictionaries. Registration typically happens in init() functions before node startup. There are two approaches: centralized registration in the shared module or manual registration in each client.
Centralized registration uses init() to register all types when the package is imported:
When clients import the package to use message types, init() runs automatically at program startup and registers all types:
No risk of forgetting a type.
Manual registration means each client registers only the types it uses. This gives more control but introduces risk: a missing registration is only detected at runtime - "no encoder for type" when sending, "unknown reg type for decoding" when receiving. For most projects, centralized registration is simpler and safer. Choose based on your needs.
For message isolation patterns within a single codebase, see Project Structure.
Compatibility Rules
EDF enforces strict type identity. Any struct change breaks wire compatibility.
Change
Compatible
Action
Add field
No
Create new version
Remove field
No
Create new version
This differs from Protobuf/Avro where adding optional fields is compatible. In EDF, every change requires explicit versioning.
Yes, this means more work upfront. But consider the alternative: Protobuf lets you add an optional Priority field, and everything "just works" - until you spend three days debugging why orders aren't prioritized correctly. Turns out half your cluster sends the new field, half ignores it, and the receivers silently default missing values to zero. Good luck finding that in logs.
EDF makes this impossible. The receiver either handles OrderV2 with its Priority field, or it doesn't - and you know this at compile time, not at 3 AM when on-call.
Version Lifecycle
With compatibility rules clear, how do versions evolve over time?
When to Create New Version
Any change from the compatibility table above requires a new version. Additionally, create a new version when changing field semantics (same type, different meaning).
Deprecation
Mark deprecated versions:
Log when receiving deprecated versions:
Removal
Remove only when:
All senders upgraded to V2
Monitoring confirms zero V1 traffic
Deprecation period passed
Remove in order:
Stop accepting (return error for V1)
Remove from registration
Delete type definition
Rolling Upgrades
Back to the scenario from the introduction: you're deploying a new version, nodes restart one by one, and for some time the cluster runs mixed code versions. How do you handle this?
Upgrade Strategy
Deploy V2 types to shared module
Update receivers to handle V1 and V2
Rolling restart receiver nodes
Update senders to send V2
Rolling restart sender nodes
Deprecate V1 after all nodes upgraded
Remove V1 after deprecation period
Coexistence Period
Receivers must support both versions during the upgrade window.
Supporting multiple versions means your handler has multiple code paths. As versions accumulate, this becomes messy. The Anti-Corruption Layer pattern isolates version translation:
Use in handler:
Single implementation handles V2. ACL converts V1 to V2. When V1 is removed, delete the ACL function - no changes to business logic needed.
Contract Testing
With version handling and ACL in place, how do you verify it actually works? Contract tests verify compatibility:
Test ACL conversion:
Run contract tests in CI before merging changes to shared modules.
Consistent naming makes code self-documenting. When you see a type name, you should immediately know: is this async or sync? Is it a request or event? What version?
Async Messages
Prefix with Message, suffix with version:
The prefix signals fire-and-forget semantics. When reading code, MessageXXX means no response is expected. If someone writes Call(pid, MessageOrderShippedV1{}), the mismatch is immediately visible.
Sync Messages
Use Request/Response suffix:
Paired naming makes contracts explicit. ChargeRequest implies ChargeResponse exists. The caller knows to expect a result.
Events
Domain events use past tense without prefix:
Events describe facts that already happened, not requests for action. Past tense (Created, Received) distinguishes them from commands (Create, Charge).
Version Suffix
If using version in type name strategy, always suffix with version number:
If using version in path strategy, the package path carries the version and type names stay clean.
Common Mistakes
These patterns emerge repeatedly in production systems. Avoid them:
Changing existing type instead of creating new version
Forgetting to register new types
Long coexistence periods
Supporting V1 for months creates maintenance burden. Set clear deprecation deadlines and enforce them.
Registering after connection established
Types must be registered before node starts. Dynamic registration requires connection cycling.
Summary
Message versioning in EDF is explicit by design. No hidden compatibility rules, no runtime surprises.
Aspect
Private Messages
Cluster Events
Nature
Service API contract
Domain fact
Owner
Receiver (implements logic)
Shared (belongs to domain)
Key principles:
Version in type name or package path, never in Go module path
Receiver owns private contracts
Shared repository for domain events
Test version compatibility
Set deprecation deadlines
Use ACL to isolate version translation
The Coordination Problem
Consider a typical scenario: you have a multi-replica service that needs to perform periodic cleanup. If every replica runs cleanup independently, you waste resources and might corrupt data through concurrent modifications. You need exactly one replica to run cleanup while others stand ready to take over if it fails.
Traditional solutions involve external systems - ZooKeeper, etcd, or distributed locks in databases. These work, but add operational complexity. You need to deploy and maintain additional infrastructure. Your application depends on external services being available, correctly configured, and network-accessible. Each external dependency is another potential failure point.
The leader actor embeds coordination directly into your Ergo cluster. No external dependencies. Election happens through actor message passing using the same network protocols your application already uses. If your Ergo nodes can communicate with each other, they can elect a leader.
How Election Works
The election protocol follows Raft consensus principles, adapted for actor message passing. Understanding the mechanism requires knowing about three concepts: states, terms, and quorum.
States and Transitions
Every process starts as a follower. This is the initial state - passive, waiting to hear from a leader. If no heartbeats arrive within the election timeout, the follower transitions to candidate and starts an election. If the candidate receives enough votes, it becomes leader. If it discovers another leader or loses the election, it reverts to follower.
The transitions are deliberate. Followers conserve resources by remaining passive. Only when leadership is needed (timeout occurs) does a node become active by candidacy. Leadership is earned through votes, not asserted unilaterally.
Terms and Logical Time
Elections happen in numbered terms. Terms increment monotonically - term 1, term 2, term 3, and so on. Each term has at most one leader. When a candidate starts an election, it increments the term. When nodes communicate, they include their current term. If a node sees a higher term, it updates immediately and acknowledges the new term.
Terms solve a subtle problem: distinguishing stale information from current state. Without terms, a network partition could cause confusion - is this heartbeat from the current leader, or from a partitioned node that thinks it's still leader? Terms provide a logical clock that orders events without requiring synchronized system clocks.
This mechanism ensures that newer elections always supersede older ones, regardless of network delays or partitions.
Quorum and Split-Brain Prevention
To become leader, a candidate needs votes from a majority of nodes. In a three-node cluster, that's two votes (including voting for itself). In a five-node cluster, three votes. The majority requirement prevents split-brain - a dangerous scenario where multiple nodes believe they're leader simultaneously.
Consider a network partition splitting five nodes into groups of 3 and 2:
Only the majority side can elect a leader. The minority side remains leaderless, preventing conflicting leadership. When the partition heals, the minority nodes recognize the higher term from the majority side's leader and follow it.
Election Sequence
Here's what happens when a cluster starts:
Election timeouts are randomized, so typically one node times out first and wins the election before others start their own campaigns. This reduces the chance of split votes.
Leader Maintenance
Once elected, the leader sends periodic heartbeats to all followers:
Heartbeats serve two purposes: they suppress elections on followers (by resetting their timeouts), and they act as a liveness signal. If heartbeats stop, followers know the leader has failed and trigger a new election.
Peer Discovery
Nodes discover each other dynamically. You provide bootstrap addresses - a list of known peers to contact initially. When a node sends or receives election messages, it monitors the sender. Over time, all nodes discover all peers, even if they didn't initially know about each other.
Discovery is automatic. You can provide a bootstrap list for faster initial synchronization, or start with an empty list and add peers dynamically using the Join() method. Bootstrap accelerates cluster formation but isn't required - nodes discover each other through any election message exchange.
Using the Leader Actor
To create a leader-electing process, embed leader.Actor in your struct and implement the leader.ActorBehavior interface:
Spawn it like any actor, passing cluster configuration:
When you spawn identical processes on three nodes with the same ClusterID and Bootstrap, they form a cluster. Within milliseconds, one becomes leader and starts processing tasks. The others stand by as followers.
The ActorBehavior Interface
The interface extends gen.ProcessBehavior with leader-specific callbacks:
Mandatory Callbacks
Init returns election configuration. The Options specify ClusterID (identifying which cluster this process belongs to), Bootstrap (initial peers to contact), and optional timing parameters for election and heartbeat intervals.
HandleBecomeLeader is called when this process becomes leader. Start exclusive work here - processing task queues, scheduling cron jobs, claiming resources. Return an error to reject leadership and trigger a new election.
HandleBecomeFollower is called when this process follows a leader. The leader parameter identifies the leader's PID. If leader is empty (gen.PID{}), it means no leader is currently elected. Stop exclusive work here. Followers should redirect requests to the leader or buffer them until leadership is established.
Optional Callbacks
HandlePeerJoined notifies when a new peer joins the cluster. Use this to track cluster size for capacity planning, or to send initialization messages to newcomers.
HandlePeerLeft notifies when a peer crashes or disconnects. Use this to detect cluster degradation or to clean up peer-specific state.
HandleTermChanged notifies when the election term increases. This is useful for distributed log replication or versioned command processing - the term can serve as a logical timestamp for ordering operations.
The other callbacks (HandleMessage, HandleCall, Terminate, HandleInspect) work as they do in regular actors. leader.Actor provides default implementations that log warnings, so you only override what you need.
Error Handling
If any callback returns an error, the actor terminates. This includes leadership callbacks - returning an error from HandleBecomeLeader causes the process to reject leadership, step down, and terminate. This is intentional: if initialization of leader responsibilities fails (can't open files, can't connect to database, etc.), it's better to terminate and let a supervisor restart with clean state than to limp along as a broken leader.
Configuration Options
The Options struct controls election behavior:
ClusterID must match across all processes in the same election cluster. Processes with different cluster IDs ignore each other, allowing multiple independent elections in the same Ergo cluster.
Bootstrap lists the initial peers to contact on startup. Can be empty - in this case, use the Join() method to add peers dynamically. When provided, each process should include itself in the list. At startup, processes send vote requests to bootstrap peers even if they haven't discovered them yet. This accelerates initial election and cluster formation.
ElectionTimeoutMin and ElectionTimeoutMax define the randomization range for election timeouts. Actual timeouts are randomly chosen from this range to reduce the chance of simultaneous elections. Defaults (150-300ms) work well for local networks.
HeartbeatInterval controls how often leaders send heartbeats. Must be significantly smaller than ElectionTimeoutMin - typically at least 3x smaller. The default (50ms) provides a 3x safety margin against the default election timeout.
Tuning for Network Conditions
For local clusters (single datacenter, low latency):
For geographically distributed clusters (high latency, possible packet loss):
The tradeoff: longer timeouts increase failover time but reduce false elections during network hiccups. Shorter timeouts provide fast failover but risk spurious elections if networks are slow.
API Methods
The embedded leader.Actor provides methods for querying state and communicating with peers:
State Queries
IsLeader() bool - Returns true if this process is currently the leader.
Leader() gen.PID - Returns the current leader's PID, or empty if no leader elected yet.
Term() uint64 - Returns the current election term.
ClusterID() string - Returns the cluster identifier.
Peer Information
Peers() []gen.PID - Returns a snapshot of discovered peers. The slice is a copy, so you can iterate safely.
PeerCount() int - Returns the number of known peers.
HasPeer(pid gen.PID) bool - Checks if a specific PID is a known peer.
Bootstrap() []gen.ProcessID - Returns the bootstrap peer list.
Communication
Broadcast(message any) - Sends a message to all discovered peers. Useful for disseminating information or coordinating state across the cluster.
BroadcastBootstrap(message any) - Sends a message to all bootstrap peers (excluding self). Useful for announcements before peer discovery completes.
Join(peer gen.ProcessID) - Manually adds a peer to the cluster by sending it a vote request. Use this for dynamic cluster growth when new nodes join after initial bootstrap.
Example: Leader-Only Processing
Example: Broadcasting State Updates
Common Patterns
Single Writer Coordination
Only the leader writes to external storage:
Task Scheduling
Only the leader schedules periodic tasks:
Forwarding to Leader
Followers forward writes to the leader:
Dynamic Cluster Membership
You can start a node with an empty bootstrap list and add peers dynamically:
Network Partitions and Split-Brain
Network partitions are inevitable in distributed systems. The election algorithm handles them safely through the quorum requirement.
Partition Scenario
Consider a five-node cluster that splits into groups of 3 and 2:
Group A (majority side) - Node1 remains leader because it can send heartbeats to Node2 and Node3, which acknowledge them. The majority side continues operating normally.
Group B (minority side) - Node4 and Node5 don't receive heartbeats from Node1. They trigger elections, but neither can get 3 votes (only 2 nodes total in their partition). They remain leaderless and reject write requests.
This asymmetry is intentional. Only one side can have a leader, preventing split-brain writes that would corrupt data.
Partition Healing
When the network partition heals:
The minority nodes recognize the majority leader's heartbeats and rejoin the cluster. If they had incremented their term during failed election attempts, they would detect the higher term and update accordingly.
Integration with Applications
For real applications, the leader actor is a building block for distributed systems patterns:
Distributed Key-Value Store
Extend the leader actor with log replication for a linearizable KV store:
Distributed Lock Service
Implement distributed locks where the leader grants leases:
Limitations and Trade-offs
The leader election actor solves coordination, but it's not a complete distributed database. Understanding what it doesn't provide is as important as knowing what it does.
No automatic log replication - The actor handles leader election but doesn't replicate application state. If you need replicated state machines, you must implement log replication yourself on top of the election foundation.
No persistence - Election state exists only in memory. If all nodes restart simultaneously, the cluster performs a fresh election. For state that must survive restarts, use external storage or implement persistence in your application.
Cluster membership is dynamic discovery, not consensus - Nodes discover peers through message exchange, not through a formal membership protocol. This is sufficient for most use cases but isn't suitable for scenarios requiring precise, consensus-based membership changes.
Leader election is not instantly consistent - During network partitions or failures, there may be brief periods with no leader, or where nodes have inconsistent views of leadership. This is fundamental to distributed consensus and cannot be avoided.
The actor provides the foundation - stable leader election with safety guarantees. Building complete distributed systems (databases, coordination services) requires additional mechanisms built on this foundation.
Observability
The leader actor integrates with Ergo's inspection system:
Monitor leadership changes in your logging:
Track cluster health by monitoring peer counts and leadership stability over time.
Port
Actors communicate through message passing within the framework. But what if you need to integrate with an external program written in Python, C, or any other language? You could spawn goroutines to manage stdin/stdout, handle protocol framing, deal with buffer management - but this breaks the actor model and spreads I/O complexity throughout your code.
Port meta-process solves this by wrapping external programs as actors. The external program runs as a child process. You send messages to the Port, and it writes them to the program's stdin. The Port reads from stdout and sends you messages. From your actor's perspective, you're just exchanging messages with another actor - the external program's details are abstracted away.
This enables clean integration with legacy systems, specialized libraries in other languages, or any tool that uses stdin/stdout for communication. The actor model stays intact while bridging to external processes.
// Version 1
type OrderCreatedV1 struct {
OrderID int64
}
// Version 2 - new field
type OrderCreatedV2 struct {
OrderID int64
Priority int
}
func (a *Actor) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case OrderCreatedV1:
return a.handleOrderV1(m)
case OrderCreatedV2:
return a.handleOrderV2(m)
}
return nil
}
func init() {
types := []any{
OrderCreatedV1{},
OrderCreatedV2{},
}
for _, t := range types {
if err := edf.RegisterTypeOf(t); err != nil && err != gen.ErrTaken {
panic(err)
}
}
}
type MessageOrderShippedV1 struct {
OrderID int64
TrackingN string
}
type ChargeRequestV1 struct {
OrderID int64
Amount int64
}
type ChargeResponseV1 struct {
TransactionID string
Status string
}
type OrderCreatedV1 struct { ... }
type PaymentReceivedV1 struct { ... }
type OrderV1 struct { ... } // correct
type Order struct { ... } // avoid - unclear versioning
type OrderNew struct { ... } // avoid - not a version number
// Wrong - breaks existing consumers
type Order struct {
ID int64
Priority int // added field breaks wire format
}
// Correct - create new version (in type name or new package path)
type OrderV2 struct {
ID int64
Priority int
}
// Type exists but not registered - encoding fails at runtime
type OrderV3 struct { ... }
// Must register before node starts
edf.RegisterTypeOf(OrderV3{})
Create a Port with meta.CreatePort and spawn it as a meta-process:
The Port starts the external program and establishes three pipes: stdin (for writing), stdout (for reading), and stderr (for errors). The program runs as a child process managed by the Port meta-process.
When the Port starts, it sends MessagePortStart to your actor. When the external program terminates (or the Port is stopped), it sends MessagePortTerminate. Between these, you exchange data messages.
Text Mode: Line-Based Communication
By default, Port operates in text mode. It reads stdout line by line and sends each line as MessagePortText. It reads stderr the same way and sends errors as MessagePortError.
Text mode uses bufio.Scanner internally, which splits input by lines (newline delimiter). You can customize the splitting logic:
Text mode is simple and works well for line-oriented protocols: command-response pairs, JSON-per-line, log output, or any text-based format. But it's not suitable for binary protocols.
Binary Mode: Raw Bytes
For binary protocols (Protobuf, MessagePack, custom framing), enable binary mode:
In binary mode, the Port reads raw bytes from stdout and sends them as MessagePortData. You send binary data using MessagePortData messages:
The Port reads up to ReadBufferSize bytes at a time from stdout and sends each chunk as MessagePortData. There's no framing or splitting - you receive raw bytes as the Port reads them. If your protocol has message boundaries, you must track them yourself.
Stderr is always processed in text mode, even when binary mode is enabled. Stderr messages arrive as MessagePortError.
Chunking: Automatic Message Framing
Reading raw bytes means dealing with partial messages. A 1KB message might arrive as three separate MessagePortData messages (512 bytes, 400 bytes, 88 bytes), or multiple messages might arrive together in one chunk. You need to buffer, reassemble, and detect message boundaries.
Chunking solves this by automatically framing messages. Instead of receiving raw bytes, you receive complete chunks - one MessagePortData per message, properly framed.
Fixed-Length Chunks
If every message is the same size, use fixed-length chunking:
The Port buffers stdout until it has 256 bytes, then sends them as one MessagePortData. If a read returns 512 bytes, you receive two MessagePortData messages (256 bytes each). If a read returns 100 bytes, the Port waits for more data before sending.
This is efficient for fixed-size protocols: binary structs, fixed-width encodings, or any format where every message has the same length.
Header-Based Chunking
Most binary protocols use variable-length messages with a header that specifies the length. Chunking can parse these headers automatically:
This configuration matches a protocol where:
Every message starts with a 4-byte header
The header contains a 4-byte big-endian integer (bytes 0-3)
The integer specifies the payload length (header not included)
Messages are: [4-byte length][payload]
The Port reads the header, extracts the length, waits for the full payload to arrive, then sends the complete message (header + payload) as MessagePortData.
Example protocol:
With the configuration above, you receive two MessagePortData messages:
If the external program writes both messages at once (274 bytes total), the Port automatically splits them. If the program writes slowly (header arrives, then payload arrives later), the Port waits for the complete message before sending.
Header length options:
HeaderLengthSize can be 1, 2, or 4 bytes. All lengths are big-endian. The Port reads the header, extracts the length value, computes the total message size (adding header size if HeaderLengthIncludesHeader is false), and buffers until the complete message arrives.
MaxLength protection:
If the header specifies a length exceeding MaxLength, the Port terminates with gen.ErrTooLarge. This protects against malformed messages or malicious programs that claim a message is 4GB (causing memory exhaustion).
Set MaxLength based on your protocol's reasonable maximum. Leave it zero for no limit (use cautiously).
Buffer Management
The Port allocates buffers for reading stdout. By default, each read allocates a new buffer, which is sent in MessagePortData and becomes garbage when you're done with it. For high-throughput ports, this causes GC pressure.
Use a buffer pool to reuse buffers:
The Port gets buffers from the pool when reading stdout. When you receive MessagePortData, the Data field is a buffer from the pool. You must return it to the pool when done:
If you forget to return buffers, the pool will allocate new ones, defeating the purpose. If you return a buffer and then access it later, you'll get corrupted data (the buffer is reused by the Port for the next read).
When you send MessagePortData to write to stdin, the Port automatically returns the buffer to the pool after writing (if a pool is configured). You don't need to do anything:
Buffer pools are critical for high-throughput scenarios. For low-volume ports (a few messages per second), the GC overhead is negligible - skip the pool for simplicity.
Write Keepalive
Some external programs expect periodic input to stay alive. If stdin goes silent for too long, they timeout or disconnect. You could send keepalive messages from your actor (with timers), but that's tedious and error-prone.
Enable automatic keepalive:
The Port wraps stdin with a keepalive flusher. If nothing is written for WriteBufferKeepAlivePeriod, it automatically sends WriteBufferKeepAlive bytes. This keeps the connection alive without any action from your actor.
The keepalive message can be anything: a null byte, a specific protocol message, a ping command. The external program receives it as normal stdin input. Design your protocol to ignore or handle keepalive messages.
Keepalive is only available in binary mode. In text mode, you need to send keepalive messages manually.
Environment Variables
The external program inherits environment variables based on your configuration:
EnableEnvOS: Includes the operating system's environment. This gives the program access to PATH, HOME, USER, and other system variables. Useful when the program needs to find other executables or access user-specific paths.
EnableEnvMeta: Includes environment variables from the meta-process (inherited from its parent actor). Meta-processes share their parent's environment. If the parent has MY_VAR=value, the Port's external program sees MY_VAR=value too.
Env: Custom variables specific to this Port. These are always included regardless of the other flags.
Order of precedence (if duplicate names):
Custom Env (highest priority)
Meta-process environment
OS environment (lowest priority)
Routing Messages
By default, all Port messages (start, terminate, data, errors) go to the parent process - the actor that spawned the Port. For single-port scenarios, this is fine. For multiple ports or advanced architectures, you want routing:
All Port messages are sent to the process registered as data_handler. This enables:
Worker pools:
The Port sends all messages to a pool, which distributes them across workers. Multiple ports can share the same pool for load balancing.
Centralized handlers:
Both ports send messages to python_manager, which coordinates multiple Python scripts.
Distinguishing ports with tags:
The Tag field appears in all Port messages. The manager uses it to distinguish which port sent the message:
If Process is empty or not registered, messages go to the parent process.
Port Messages
Messages you receive from the Port:
MessagePortStart - Port started successfully, external program is running:
Sent once after the external program starts. Use this to send initialization commands.
MessagePortTerminate - Port stopped, external program exited:
Sent when the external program terminates (exit, crash, killed) or when you terminate the Port. After this, the Port is dead - you cannot send it more messages.
MessagePortText - Line from stdout (text mode only):
Sent for each line read from stdout in text mode. The delimiter (newline or custom) is stripped from Text.
MessagePortData - Binary data from stdout (binary mode only):
In binary mode without chunking, Data contains whatever bytes the Port read (up to ReadBufferSize). With chunking, Data contains one complete chunk.
If ReadBufferPool is configured, Data is from the pool - return it when done.
MessagePortError - Line from stderr (always text mode):
Sent for each line read from stderr. Stderr is always processed in text mode, even when binary mode is enabled for stdout.
Messages you send to the Port:
MessagePortText - Send text to stdin (text mode):
Writes Text to stdin. Newlines are not added automatically - include them if your protocol needs them.
MessagePortData - Send binary data to stdin (binary mode):
Writes Data to stdin. If ReadBufferPool is configured, the Port returns the buffer to the pool after writing. Don't use the buffer after sending.
Termination and Cleanup
When the external program exits (normally or crash), the Port sends MessagePortTerminate and terminates itself. The Port also kills the external program if:
The Port is terminated (you call process.SendExit to the Port's ID)
The Port's parent terminates (cascading termination)
An error occurs reading stdout (broken pipe, I/O error)
The Port calls Kill() on the child process and waits for it to exit. This ensures cleanup happens even if the program is misbehaving.
Stderr is read in a separate goroutine. This means stderr messages can arrive after MessagePortTerminate if the program wrote to stderr just before exiting. Design your actor to handle this ordering.
Inspection
Port supports inspection for debugging:
Returns a map with Port status:
Use this for monitoring, debugging, or displaying Port status in management UIs.
Patterns and Pitfalls
Pattern: Request-response wrapper
Wrap a Port to provide synchronous Call semantics. Useful for RPC-style protocols.
Pattern: Supervised restart
Supervise the actor that spawns ports. If the actor crashes, the supervisor restarts it, which re-spawns ports. Ports inherit parent lifecycle - when the actor terminates, all its ports terminate.
Pattern: Backpressure with buffer pool
Limit memory usage by capping concurrent buffers. If processing is slow, the semaphore blocks, which blocks the actor's message loop, which applies backpressure to the Port.
Pitfall: Forgetting to return buffers
Pool buffers are reused. If you store them, they'll be overwritten by future reads. Copy data if you need to keep it.
Pitfall: Blocking on stdin writes
If the external program stops reading stdin (buffer full, process blocked), the Port blocks writing. The Port's HandleMessage is blocked, so it can't send you more stdout data. Deadlock.
Solution: Design your protocol so the external program never stops reading stdin. Use flow control or chunking to prevent overflows.
Pitfall: Ignoring MessagePortError
Stderr messages arrive as MessagePortError. If you don't handle them, warnings and errors from the external program are lost. Always handle stderr or explicitly decide to ignore it.
Pitfall: Not handling MessagePortTerminate
After MessagePortTerminate, the Port is dead. Sending messages returns errors. Handle termination: restart the Port, fail gracefully, or terminate your actor.
Port meta-processes enable clean integration with external programs. They handle process management, I/O buffering, protocol framing, and lifecycle coordination - letting you focus on the protocol logic while maintaining the actor model's isolation and simplicity.
type Controller struct {
act.Actor
portID gen.Alias
}
func (c *Controller) Init(args ...any) error {
// Define port options
options := meta.PortOptions{
Cmd: "python3",
Args: []string{"processor.py", "--mode=batch"},
Env: map[gen.Env]string{
"WORKER_ID": "worker-1",
},
}
// Create port behavior
portBehavior, err := meta.CreatePort(options)
if err != nil {
return fmt.Errorf("failed to create port: %w", err)
}
// Spawn as meta-process
portID, err := c.SpawnMeta(portBehavior, gen.MetaOptions{})
if err != nil {
return fmt.Errorf("failed to spawn port: %w", err)
}
c.portID = portID
c.Log().Info("spawned port for %s (id: %s)", options.Cmd, portID)
return nil
}
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortStart:
c.Log().Info("port started: %s", m.ID)
// Send initial command
c.Send(m.ID, meta.MessagePortText{Text: "INIT worker-1\n"})
case meta.MessagePortText:
// Received line from stdout
c.Log().Info("port output: %s", m.Text)
c.processOutput(m.Text)
case meta.MessagePortError:
// Received line from stderr
c.Log().Warning("port error: %s", m.Error)
case meta.MessagePortTerminate:
c.Log().Info("port terminated: %s", m.ID)
// Restart or cleanup
}
return nil
}
func (c *Controller) processCommand(cmd string) {
// Send command to external program
c.Send(c.portID, meta.MessagePortText{
Text: cmd + "\n",
})
}
options := meta.PortOptions{
Cmd: "processor",
// Custom split function for stdout
SplitFuncStdout: func(data []byte, atEOF bool) (advance int, token []byte, err error) {
// Find null-terminated strings instead of newlines
if i := bytes.IndexByte(data, 0); i >= 0 {
return i + 1, data[:i], nil
}
if atEOF && len(data) > 0 {
return len(data), data, nil
}
return 0, nil, nil
},
// Custom split function for stderr (optional)
SplitFuncStderr: bufio.ScanWords, // Split stderr by words
}
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortData:
// Process the data
c.processData(m.Data)
// Return buffer to pool
bufferPool.Put(m.Data)
}
return nil
}
buf := bufferPool.Get().([]byte)
// Fill buf with data
c.Send(portID, meta.MessagePortData{Data: buf})
// Port returns buf to pool after writing
bufferPool := &sync.Pool{
New: func() any {
return make([]byte, 8192)
},
}
// Limit concurrent buffers
sem := make(chan struct{}, 100) // Max 100 buffers in flight
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortData:
// Acquire semaphore (blocks if 100 buffers in use)
sem <- struct{}{}
go func() {
defer func() {
<-sem // Release semaphore
bufferPool.Put(m.Data) // Return buffer
}()
// Process data (can be slow)
c.processData(m.Data)
}()
}
return nil
}
// WRONG: Buffer leaked
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortData:
c.dataQueue = append(c.dataQueue, m.Data) // Stored, never returned!
}
return nil
}
// CORRECT: Copy if you need to store
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortData:
copied := make([]byte, len(m.Data))
copy(copied, m.Data)
c.dataQueue = append(c.dataQueue, copied)
bufferPool.Put(m.Data) // Return original
}
return nil
}
// Port writes are blocking
c.Send(portID, meta.MessagePortData{Data: largeBuffer})
// ^ This Send doesn't block, but the Port's write to stdin might
// WRONG: Stderr ignored
func (c *Controller) HandleMessage(from gen.PID, message any) error {
switch m := message.(type) {
case meta.MessagePortData:
c.process(m.Data)
// No case for MessagePortError!
}
return nil
}
// WRONG: Port terminated, but actor keeps trying to use it
func (c *Controller) processData(data []byte) {
c.Send(c.portID, meta.MessagePortData{Data: data})
// ^ Fails if port terminated
}
events/
All consumers
Change field type
No
Create new version
Rename field
No
Create new version
Reorder fields
No
Create new version
Module
receiver-api/
events/
Changes
Receiver team decides
All consumers coordinate
Handling Sync Requests
Handling synchronous requests in the asynchronous actor model
The actor model is fundamentally asynchronous. Processes send messages and continue immediately without waiting for responses. This asynchrony is core to the model - actors don't block, they process messages one at a time from their mailbox, and they scale because thousands of actors can run concurrently without threads blocking on I/O or responses.
But real systems often need synchronous patterns. A client makes a request and must wait for a response before continuing. An HTTP handler receives a request and can't return to the client until the response is ready. A database query needs to block until the data arrives. These synchronous requirements don't disappear just because your system uses actors.
The challenge is satisfying these synchronous requirements without actually blocking the actor. If an actor blocks waiting for a response, it can't process other messages in its mailbox. The actor becomes unresponsive to everything else. This defeats the purpose of the actor model - you want concurrent message processing, not sequential blocking.
This chapter explores how to handle synchronous-style requests while maintaining asynchronous actor behavior. You'll learn how the framework implements request-response, how to handle Call requests efficiently, and how to process them asynchronously even when the caller is blocked waiting.
The Nature of Synchronous Calls in Actors
In traditional synchronous code, when you call a function, you wait for it to return:
The calling thread stops. The operating system schedules other threads. Eventually the query completes, the thread wakes up, and execution continues. This is fine when you have many threads - some block, others run. But it's wasteful, and it doesn't scale to tens of thousands of concurrent operations.
In the actor model, you send a message and continue:
The sender doesn't block. The message goes into the database actor's mailbox. When the database actor processes it, it sends a response message back. The original sender handles that response later in its own message loop. This is how actors achieve massive concurrency - no actor ever blocks waiting, so you can run thousands of actors with a small thread pool.
But what if the sender legitimately needs to wait? What if it's an HTTP handler that can't return to the client until the query completes?
The framework provides Call for this:
From the caller's perspective, this looks synchronous - you call, you wait, you get a result. But from the system's perspective, it's asynchronous:
The caller sends a request message with a unique reference (gen.Ref)
The caller's goroutine blocks waiting for a response with that reference
The recipient receives the request as a HandleCall invocation
The caller blocks, but blocking is isolated to that one actor. The actor's goroutine is suspended (cheap), not spinning (expensive). Other actors run normally. The recipient processes the request whenever it gets to it in its mailbox, not immediately. The entire system remains asynchronous, but individual actors can use synchronous-style APIs when needed.
Basic HandleCall Implementation
When a process receives a Call request, the framework invokes HandleCall:
Critical distinction: The error you return from HandleCall is not the response to the caller - it's the termination reason for your process!
return result, nil - Send result to caller, continue running
return errorValue, nil - Send errorValue to caller, continue running
When you return a non-nil result from HandleCall, the framework automatically sends it as a response message to the caller. The caller's blocked Call unblocks and returns your result. Any value can be a result - integers, strings, structs, even errors.
If you need to send an error to the caller, return the error as the result value, not as the error return:
The second return value (error) is for terminating your process. Return gen.TerminateReasonNormal to gracefully stop, or any other error for abnormal termination. If you return both a result and gen.TerminateReasonNormal, the framework sends the result first, then terminates your process.
From the caller's side:
The caller blocks at Call until your HandleCall returns. This can be milliseconds (local, fast computation) or seconds (remote, slow operation). The caller can specify a timeout - if no response arrives within the timeout, Call returns nil, gen.ErrTimeout.
Note the distinction: err from Call is a framework-level error (timeout, network failure, process terminated). The result itself might be an error value sent by your HandleCall - that's application-level.
Why Not Just Use Channels?
You might wonder: why not just use Go channels for request-response?
This breaks the actor model in subtle ways:
Shared memory - Channels are shared memory. Passing a channel in a message creates a direct communication path outside the actor system. If the worker is on a remote node, the channel doesn't work (channels don't serialize). Your code becomes non-portable between local and remote.
Blocking semantics - Blocking on a channel blocks the actor's goroutine, but the actor is still "running" from the framework's perspective. The actor can't process other messages while blocked. With Call, the framework knows the actor is waiting for a response and can properly account for it (the actor is in ProcessStateWaitResponse).
Timeout coordination - Channels don't have built-in timeouts. You'd wrap them in select with time.After, but timeout cleanup is tricky. With Call, timeouts are built-in, and references have deadlines that the receiver can check.
No network transparency - Call works identically for local and remote processes. Channels don't. If you use channels for local request-response, your code won't work when you move to a distributed deployment.
The framework's Call mechanism is designed specifically for request-response in the actor model, works across the network, and integrates properly with the actor lifecycle.
Handling Requests with Worker Pools
A common pattern is a server process that receives many Call requests. If processing each request takes time (database query, HTTP call, complex computation), handling them sequentially in HandleCall creates a bottleneck. One slow request delays all subsequent requests.
The solution is act.Pool - a specialized actor that automatically distributes requests across a pool of worker actors:
Notice what's not in this code - there's no HandleCall for the Server. You don't need one.
act.Pool automatically intercepts all incoming Call requests and forwards them to workers. When you send a Call to the Server PID, the Pool:
Receives the Call request in its mailbox
Pops an available worker from the pool
Forwards the entire request (from, ref, message) to the worker
The worker receives the Call request with the original caller's PID and ref. When the worker returns a result from HandleCall, it goes directly to the original caller, bypassing the Pool entirely. The Pool is just a router.
From the caller's perspective:
This gives you concurrent request processing:
10 Call requests arrive at the Server simultaneously
Pool forwards each to a different worker
All 10 workers process concurrently
The caller's experience is unchanged - they call, they block, they get a result. They don't know about the pool. The concurrency is entirely internal to the server.
Worker resilience:
If a worker crashes or becomes unresponsive, the Pool automatically spawns a replacement worker. Worker failures don't affect the Pool's availability - other workers continue processing requests while the Pool restarts failed workers in the background.
If all workers are busy (mailboxes full), incoming requests queue up in the Pool's mailbox until a worker becomes available.
For more details on Pool configuration and advanced patterns, see .
Asynchronous Processing of Synchronous Requests
Sometimes you need to handle a Call request asynchronously within a single actor, without workers. Maybe you're waiting for a timer, or you need to make another Call before you can respond, or you want to batch multiple requests.
You can do this manually:
The pattern:
HandleCall stores from and ref for later
HandleCall returns (nil, nil) - async handling
You must respond eventually, or the caller will timeout. If you lose track of the ref or forget to respond, the caller waits until timeout and gets gen.ErrTimeout.
The result you send with SendResponse can be any value - strings, numbers, structs, even errors. If you want to send an error to the caller, just send it as a normal result value:
The caller receives it as result (first return value from Call) and can check if it's an error.
SendResponse vs SendResponseError: Two Channels for Results
When you handle Call requests asynchronously, you send responses later using SendResponse. But there's also SendResponseError. What's the difference, and when do you use each?
The difference is in which return value the caller receives from Call.
SendResponse sends to the result channel:
Whatever you send appears as the first return value (result). The second return value (err) is nil, meaning no framework error occurred. The result can be anything - strings, numbers, structs, even errors:
The caller must check if the result is an error:
SendResponseError sends to the error channel:
The error appears as the second return value (err), exactly where framework errors like timeout and network failures appear. The first return value (result) is nil.
From the caller's perspective, there's no difference between an error from SendResponseError and a framework error:
The problem with mixing channels
The framework uses the error channel for transport errors - problems with the messaging infrastructure. Your application uses it for business logic results. When you call SendResponseError, you're mixing these two concerns.
Consider a typical caller error handling:
This makes sense for transport errors - network glitches, temporary overload. But if the database actor uses SendResponseError for "record not found", the caller retries unnecessarily. The record won't appear in one second.
The caller has no way to distinguish. Both arrive through the error channel.
When mixing is justified
Despite this issue, SendResponseError has legitimate uses. The key is: use it for errors that should be handled like transport errors.
Imagine a database query actor. It receives queries, executes them against a database, and returns results. What errors can occur?
Application errors - problems with the query itself:
Bad SQL syntax
Permission denied
Constraint violation
These are not infrastructure problems. The actor is working fine, the database is up, the request was processed. The query just has issues. The caller should see these as results, not transport failures.
Infrastructure errors - problems with the database connection:
Database server is down
Network to database lost
Connection pool exhausted
Too many simultaneous connections
These are infrastructure problems. The actor couldn't process the request because a dependency is unavailable. From the caller's perspective, this is the same as if the actor itself were unreachable (timeout) or the node were down (network failure). The caller should handle all of these identically - retry, fallback, circuit breaking.
Here's how to implement this:
The caller handles both channels naturally:
This works because the caller wants to handle infrastructure failures identically, regardless of whether they originate from the framework (timeout, network) or from the application (database down). Both represent unavailable service, both trigger the same fallback logic.
Guideline
Use SendResponse for all normal cases, including expected errors (validation, not found, unauthorized). These are results - the request was processed, here's what happened.
Use SendResponseError only when the error represents an infrastructure failure that the caller should treat the same as transport errors - retry with backoff, circuit breaking, fallback to alternative services.
If in doubt, use SendResponse. It keeps transport and application concerns separate, giving the caller maximum clarity.
Using Ref.IsAlive for Timeout Awareness
When you handle requests asynchronously, the caller might timeout before you respond. Imagine:
Caller makes a Call with 5 second timeout
Your HandleCall stores the request, returns nil (async)
6 seconds pass
Caller's timeout fires,
Your response arrives after the caller stopped waiting. The caller won't receive it (it's not waiting on that ref anymore). Your work was wasted.
You can detect this with ref.IsAlive():
ref.IsAlive() checks the deadline embedded in the reference. When the caller made the Call with a timeout, the framework created a reference with MakeRefWithDeadline(now + timeout). The deadline is stored in ref.ID[2] as a unix timestamp. IsAlive() compares it to the current time - if the deadline passed, it returns false.
This lets you skip processing expired requests. If a request took too long to reach the front of the queue, and the caller already gave up, don't waste resources computing a response nobody will receive.
But be careful: IsAlive() returning false doesn't mean the caller is definitely gone. It means the deadline passed. The caller might have disappeared for other reasons (crash, network disconnect), or they might still exist but already moved on. It's a hint for optimization, not a guarantee about caller state.
If you send a response after the deadline, nothing bad happens. The response message arrives, the receiver checks if anyone is waiting for that ref, finds nobody, and drops the message. It's just wasted work - harmless but inefficient.
Common Patterns and Pitfalls
Pattern: Immediate vs deferred
Some requests you can answer immediately, others need async processing. Mix both in the same HandleCall based on the situation.
Pattern: Batch processing
Accumulate requests, process them together, respond to each individually. Efficient for operations with high setup cost (database connections, API requests with rate limits).
Pitfall: Losing references
You need both from and ref to send a response. Store them together.
Pitfall: Confusing result errors with termination errors
This is the most common mistake. Remember: the error return from HandleCall terminates your process, it doesn't go to the caller (except the special case of gen.TerminateReasonNormal with a non-nil result).
Pitfall: Blocking in HandleCall
Even though the caller is blocked waiting, your actor shouldn't block. If you sleep for 5 seconds, you can't handle other messages during that time. Other callers will queue up waiting. If this is unavoidable (calling a blocking API you don't control), spawn a worker to handle it or use act.Pool.
The Path to Important Delivery
Everything discussed so far assumes the response message arrives. But what if it doesn't? Networks drop packets. Remote processes crash. Connections fail.
When a response is lost, the caller blocks until timeout. Eventually Call returns gen.ErrTimeout, but you don't know if the request was processed or not. Did the receiver handle it and the response got lost? Or did the request itself get lost before reaching the receiver?
This uncertainty is a fundamental problem in distributed systems. The framework's Call mechanism gives you request-response semantics, but it doesn't guarantee the response arrives. It's "best effort" - works reliably for local calls and stable network connections, but no guarantees.
For many use cases, this is fine. Timeouts are acceptable. Callers can retry. Idempotent operations tolerate retries. But some operations can't tolerate uncertainty. A payment authorization must definitely succeed or definitely fail - timeout isn't acceptable.
The solution is Important Delivery. When you enable the Important flag, the framework changes from "best effort" to "confirmed delivery." Responses don't just get sent, they get acknowledged. If the response fails to deliver, you know immediately rather than waiting for timeout.
Important Delivery makes the network transparent for failures, not just successes. It turns request-response from "probably works" into "definitely works or definitely fails, no ambiguity."
We'll explore Important Delivery in depth in the next chapter. For now, understand that everything you've learned about Call and HandleCall still applies. Important Delivery is a layer on top, not a replacement. You'll still handle requests the same way - the framework just makes delivery more reliable.
For details on how messages and calls flow through the network, see . For understanding delivery guarantees, continue to .
The recipient processes it and sends a response message with the same reference
The response arrives in the caller's mailbox, waking up the blocked goroutine
The caller's Call returns with the result
return result, gen.TerminateReasonNormal - Send result to caller, then terminate normally
return nil, someError - Terminate with someError, no response sent to caller
Returns the worker to the pool (reusable for next request)
Each worker responds directly to its caller
Pool remains free to route more requests
Later (timer, another message, whatever), you process the request
Call SendResponse(from, ref, result) to send the result
Caller's blocked Call unblocks with your result
Call
returns
gen.ErrTimeout
Your actor finishes processing and calls SendResponse
result, err := process.Call(databasePID, QueryRequest{SQL: "SELECT * FROM users"})
// blocked here, but only this actor is blocked
// other actors continue running normally
type Calculator struct {
act.Actor
}
func (c *Calculator) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
switch req := request.(type) {
case AddRequest:
result := req.A + req.B
return result, nil
case DivideRequest:
if req.B == 0 {
// Return error as the result value, not as termination reason
return fmt.Errorf("division by zero"), nil
}
result := req.A / req.B
return result, nil
default:
// Return error as the result value
return fmt.Errorf("unknown request type"), nil
}
}
// WRONG - terminates the process!
if invalid {
return nil, fmt.Errorf("invalid request")
}
// CORRECT - sends error to caller
if invalid {
return fmt.Errorf("invalid request"), nil
}
// Somewhere in another actor
result, err := process.Call(calculatorPID, AddRequest{A: 10, B: 20})
if err != nil {
// This is a framework error (timeout, connection lost, etc)
process.Log().Error("call failed: %s", err)
return err
}
// Check if the result itself is an error (application-level error)
if errResult, ok := result.(error); ok {
process.Log().Error("calculator returned error: %s", errResult)
return errResult
}
sum := result.(int)
process.Log().Info("10 + 20 = %d", sum)
// Tempting but wrong in actor model
response := make(chan Result)
process.Send(workerPID, Request{Data: data, ResponseChan: response})
result := <-response // block waiting
type Server struct {
act.Pool
}
type Worker struct {
act.Actor
}
func (s *Server) Init(args ...any) (act.PoolOptions, error) {
return act.PoolOptions{
PoolSize: 10, // 10 worker actors
WorkerFactory: func() gen.ProcessBehavior { return &Worker{} },
}, nil
}
// No HandleCall needed for Server! Pool handles forwarding automatically.
func (w *Worker) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
// Process the request
switch req := request.(type) {
case QueryRequest:
// Simulate slow operation
time.Sleep(100 * time.Millisecond)
result := fmt.Sprintf("Result for: %s", req.Query)
return result, nil
default:
// Return error as result value, not termination reason
return fmt.Errorf("unknown request"), nil
}
}
// Caller doesn't know about the pool
result, err := process.Call(serverPID, QueryRequest{Query: "data"})
// Result comes from whichever worker handled it
type AsyncHandler struct {
act.Actor
pending map[gen.Ref]pendingRequest
}
type pendingRequest struct {
from gen.PID
data any
}
func (a *AsyncHandler) Init(args ...any) error {
a.pending = make(map[gen.Ref]pendingRequest)
return nil
}
func (a *AsyncHandler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
switch req := request.(type) {
case BatchRequest:
// Store the request for later
a.pending[ref] = pendingRequest{from: from, data: req}
// Maybe set a timer to process after accumulating more requests
a.SendAfter(a.PID(), BatchTrigger{}, 100 * time.Millisecond)
// Return nil to handle asynchronously
return nil, nil
case ImmediateRequest:
// This one we can answer immediately
return "immediate result", nil
}
// Return error as result value
return fmt.Errorf("unknown request"), nil
}
func (a *AsyncHandler) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case BatchTrigger:
// Time to respond to all pending requests
for ref, pr := range a.pending {
result := a.processBatch(pr.data)
a.SendResponse(pr.from, ref, result)
}
a.pending = make(map[gen.Ref]pendingRequest) // clear
}
return nil
}
if invalid {
a.SendResponse(pr.from, ref, fmt.Errorf("validation failed"))
}
// Handler sends an error as a result
a.SendResponse(caller, ref, fmt.Errorf("user not found"))
// Caller receives
result, err := process.Call(handler, request)
// result = error("user not found")
// err = nil
result, err := process.Call(handler, request)
if err != nil {
// Framework problem - timeout, network, process died
return fmt.Errorf("call failed: %w", err)
}
if errResult, ok := result.(error); ok {
// Application-level error
return fmt.Errorf("operation failed: %w", errResult)
}
// Success - use result
processResult(result)
result, err := process.Call(handler, request)
if err != nil {
// Could be:
// - Timeout (gen.ErrTimeout)
// - Network failure (gen.ErrNoConnection)
// - Process crashed (gen.ErrProcessTerminated)
// - OR: Handler sent via SendResponseError
// Caller cannot distinguish!
return fmt.Errorf("call failed: %w", err)
}
result, err := process.Call(databaseActor, query)
if err != nil {
// Retry logic for transport errors
time.Sleep(1 * time.Second)
result, err = process.Call(databaseActor, query)
if err != nil {
return err // Give up
}
}
type DatabaseActor struct {
act.Actor
db *sql.DB
pending map[gen.Ref]pendingRequest
}
type pendingRequest struct {
from gen.PID
query string
}
func (d *DatabaseActor) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
query := request.(string)
// Store for async processing
d.pending[ref] = pendingRequest{from: from, query: query}
// Trigger async processing
d.Send(d.PID(), executeQuery{ref: ref})
return nil, nil // Will respond asynchronously
}
func (d *DatabaseActor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case executeQuery:
pr := d.pending[msg.ref]
// Execute query
rows, err := d.db.Query(pr.query)
if err != nil {
// Distinguish error types
if isInfrastructureError(err) {
// Database down, connection lost, etc
// Send as transport error - caller should retry/fallback
d.SendResponseError(pr.from, msg.ref, fmt.Errorf("database unavailable: %w", err))
} else {
// Bad SQL, permission denied, etc
// Send as application result - caller should show to user
d.SendResponse(pr.from, msg.ref, fmt.Errorf("query failed: %w", err))
}
delete(d.pending, msg.ref)
return nil
}
// Success
d.SendResponse(pr.from, msg.ref, rows)
delete(d.pending, msg.ref)
}
return nil
}
func isInfrastructureError(err error) bool {
// Check for connection-related errors
if strings.Contains(err.Error(), "connection refused") {
return true
}
if strings.Contains(err.Error(), "too many connections") {
return true
}
// ... other infrastructure error checks
return false
}
result, err := process.Call(databaseActor, "SELECT * FROM users")
if err != nil {
// Infrastructure problem:
// - Database is down (SendResponseError)
// - Actor timed out (gen.ErrTimeout)
// - Network failure (gen.ErrNoConnection)
// All handled the same way - try fallback
process.Log().Warning("database unavailable, using cache: %s", err)
return useFallbackCache()
}
// Check if result is an error
if errResult, ok := result.(error); ok {
// Application error - bad query, permission denied, etc
// Don't retry, don't fallback - show to user
return fmt.Errorf("query error: %w", errResult)
}
// Success
return result
func (a *AsyncHandler) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case BatchTrigger:
for ref, pr := range a.pending {
// Check if the caller is still waiting
if !ref.IsAlive() {
// Timeout expired, don't bother processing
a.Log().Warning("request %s expired, skipping", ref)
delete(a.pending, ref)
continue
}
// Still waiting, process and respond
result := a.processBatch(pr.data)
a.SendResponse(pr.from, ref, result)
delete(a.pending, ref)
}
}
return nil
}
func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
switch req := request.(type) {
case CachedRequest:
// We have the answer immediately
if result, found := a.cache[req.Key]; found {
return result, nil
}
// Cache miss, fetch asynchronously
a.pending[ref] = pendingRequest{from: from, data: req}
a.fetchFromBackend(req.Key, ref)
return nil, nil
case WriteRequest:
// Writes are fast, handle synchronously
a.data[req.Key] = req.Value
return "ok", nil
}
// Return error as result value
return fmt.Errorf("unknown request"), nil
}
type Batcher struct {
act.Actor
pending []pendingRequest
timer gen.CancelFunc
}
type pendingRequest struct {
from gen.PID
ref gen.Ref
data any
}
func (b *Batcher) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
// Add to batch
b.pending = append(b.pending, pendingRequest{from, ref, request})
// Start timer if this is the first request
if len(b.pending) == 1 {
b.timer = b.SendAfter(b.PID(), Flush{}, 100 * time.Millisecond)
}
// If batch is full, flush immediately
if len(b.pending) >= 100 {
if b.timer != nil {
b.timer() // cancel timer
}
b.flush()
}
return nil, nil
}
func (b *Batcher) flush() {
// Process all pending requests in one batch
results := b.processBatch(b.pending)
for i, pr := range b.pending {
if pr.ref.IsAlive() {
b.SendResponse(pr.from, pr.ref, results[i])
}
}
b.pending = b.pending[:0] // clear, keep capacity
}
// WRONG: Storing only the reference
func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
a.pendingRefs = append(a.pendingRefs, ref) // Lost the 'from'!
return nil, nil
}
// Later - how do we respond?
func (a *Handler) respond() {
for _, ref := range a.pendingRefs {
a.SendResponse(???, ref, result) // Who do we send to?
}
}
// WRONG: This terminates your process!
func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
if !a.isAuthorized(from) {
return nil, fmt.Errorf("unauthorized") // OOPS! Process terminates
}
return a.process(request), nil
}
// CORRECT: Send error as result to caller
func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
if !a.isAuthorized(from) {
return fmt.Errorf("unauthorized"), nil // Caller gets error, process continues
}
return a.process(request), nil
}
// ALSO CORRECT: For async handling
func (a *Handler) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case processedResult:
// Send any result - value or error, doesn't matter
a.SendResponse(msg.caller, msg.ref, msg.result)
}
return nil
}
// WRONG: Blocks the actor
func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
time.Sleep(5 * time.Second) // Actor can't process other messages!
return "done", nil
}
Pub/Sub Internals
How the Pub/Sub system works internally
This document explains how Ergo Framework's pub/sub system works under the hood. It's written for developers who want to understand the architecture, network behavior, and performance characteristics when building distributed systems.
For basic usage, see Links and Monitors and Events. This document assumes you're familiar with those concepts and focuses on how the system works internally.
The Unified Architecture
Links, monitors, and events look like separate features when you use them. But underneath, they share the same mechanism. Understanding this unification explains why the system behaves consistently and why certain optimizations work.
The Core Concept
Every interaction in the pub/sub system follows one pattern:
A consumer subscribes to a target and receives notifications about that target.
This applies whether you're linking to a process, monitoring a registered name, or subscribing to an event stream. The differences are in what you subscribe to and what notifications you receive.
Three Components of Every Subscription
1. Consumer - The process creating the subscription. This is the process that will receive notifications when something happens to the target.
2. Target - What the consumer subscribes to. Targets come in several types:
Target Type
Example
What It Represents
3. Subscription Type - How the consumer wants to receive notifications:
Type
Creates
Notification
Effect on Consumer
The combination of target type and subscription type determines what message you receive:
Implicit vs Explicit Events
The targets divide into two categories based on what notifications they generate:
Implicit Events - Processes, names, aliases, and nodes generate termination notifications automatically. The target doesn't do anything special - when it terminates (or disconnects, for nodes), the framework generates notifications for all subscribers.
Explicit Events - Registered events generate both published messages AND termination notifications. A producer process explicitly registers an event and publishes messages to it. When the producer terminates or unregisters the event, subscribers also receive termination notification.
The key difference: implicit events give you one notification (termination). Explicit events give you N published messages plus termination notification.
Why Unification Matters
This unified architecture has practical benefits:
Consistent behavior - The same subscription and notification mechanics work for all target types. Once you understand how monitors work for processes, you understand how they work for events.
Shared optimizations - Network optimizations (covered later) apply to all subscription types. Whether you're monitoring 100 remote processes or subscribing 100 consumers to a remote event, the same sharing mechanism kicks in.
Predictable cleanup - Termination cleanup works identically for all subscriptions. When a process terminates, all its subscriptions are cleaned up using the same code path.
How Local Subscriptions Work
When you subscribe to a target on the same node, the operation is simple and fast.
What You Experience
The call returns instantly. There's no network communication, no blocking. The node records your subscription in memory.
What Happens Internally
The Target Manager maintains subscription records. When a process terminates, Target Manager looks up all subscribers and delivers notifications to their mailboxes. For links, notifications go to the Urgent queue. For monitors, they go to the System queue.
Guarantees
Instant subscription - No waiting, no blocking. The subscription is recorded synchronously.
Guaranteed notification - If the target terminates after you subscribe, you will receive notification. The notification mechanism is part of the termination process itself.
Asynchronous delivery - Notifications arrive in your mailbox like any other message. You process them in your HandleMessage callback.
Automatic cleanup - When the target terminates and you receive notification, the subscription is removed automatically. You don't need to unsubscribe.
How Remote Subscriptions Work
Remote subscriptions involve network communication but provide the same guarantees as local subscriptions.
What You Experience
The call blocks while the subscription request travels to the remote node and the response returns. This typically takes milliseconds on a local network.
What Happens Internally
The subscription request travels to the remote node's Target Manager. It validates that the target exists (returning an error if not), records the subscription, and sends confirmation. Once established, termination notifications travel back over the network.
Subscription Validation
Both local and remote subscriptions validate that the target exists:
For events, the target node validates the event is registered:
Guaranteed Notification Delivery
Remote subscriptions guarantee you receive exactly one termination notification. This guarantee holds even when networks fail. Two paths can deliver your notification:
Path 1: Normal Delivery
The target terminates normally. The remote node sends the notification over the network. You receive it with the actual termination reason:
Path 2: Connection Failure
The network connection fails before the notification arrives (or before the target even terminates). Your local node detects the disconnection and generates notifications for all subscriptions to targets on the failed node:
Why This Works
You're guaranteed notification through one of two mechanisms:
Remote node delivers it (normal case)
Local node generates it when detecting connection failure (failover)
The Reason field tells you which path occurred. Your code typically handles both the same way - the target is no longer accessible regardless of why.
This failover mechanism compensates for network unreliability. You write code assuming notifications always arrive, because they do.
Network Optimization: Shared Subscriptions
This section describes the optimization that makes distributed pub/sub practical at scale. Without it, many common patterns would be impractical.
The Problem
Consider a realistic scenario:
Naive implementation: Each MonitorPID call creates a separate network subscription. Result:
100 network round-trips to create subscriptions
100 subscription records on the remote node
100 network messages when the coordinator terminates
This doesn't scale. With 1000 workers, you'd have 1000 network messages just to deliver one termination notification.
What Actually Happens
The framework automatically detects when multiple local processes subscribe to the same remote target and shares the network subscription.
What you observe:
The first subscription to a remote target requires network communication. Every subsequent subscription from the same node to the same target returns instantly - it shares the existing network subscription.
How Notification Delivery Works
When the remote target terminates:
Remote node sends ONE notification message to your node
Your node receives it and looks up all local subscribers to that target
Your node delivers individual notifications to each subscriber's mailbox
Performance Characteristics
Operation
Without Sharing
With Sharing
Network cost comparison for 100 subscribers:
Phase
Without Sharing
With Sharing
Impact on Event Publishing
The same optimization applies to event publishing. When you publish an event with subscribers on multiple nodes:
The framework groups subscribers by node and sends ONE message per node:
What the producer sees:
What subscribers see:
Real-World Scale Example
Consider a market data feed with 1 million subscribers distributed across 10 nodes:
When the producer publishes one price update:
Approach
Network Messages
The optimization transforms O(N) network cost (where N = total subscribers) into O(M) cost (where M = number of nodes). For distributed systems with many subscribers per node, this is the difference between practical and impossible.
Actual benchmark results (from the benchmark):
Why This Matters for System Design
This optimization enables patterns that would be impractical otherwise:
Worker pools monitoring coordinators:
Distributed caching with invalidation:
Hierarchical supervision across nodes:
High-frequency event streaming:
When Sharing Doesn't Apply
The optimization applies when multiple processes on the SAME node subscribe to the SAME remote target.
These share:
These don't share:
Buffered Events: Partial Optimization
Buffered events receive partial optimization. The subscription is shared, but each subscriber must retrieve buffer contents individually.
Why Buffers Complicate Sharing
Event buffers store recent messages for new subscribers:
When a subscriber joins, they receive the buffered messages:
The problem: different subscribers joining at different times need different buffer contents.
If subscriptions were fully shared, all subscribers would receive the same buffer - incorrect for late subscribers.
What Actually Happens
First subscriber: Network round-trip to create subscription AND retrieve buffer.
Subsequent subscribers: Network round-trip to retrieve current buffer (subscription already exists).
Published messages: Still optimized - one network message per node, distributed locally.
Performance Comparison
Aspect
Unbuffered Event
Buffered Event
When to Use Buffers
Use buffers when:
New subscribers need recent history (last N configuration updates)
Subscribers might miss messages during brief disconnections
State can be reconstructed from recent messages
Avoid buffers when:
Real-time streaming where history isn't useful
High subscriber count across many nodes (each pays network cost)
Messages are only meaningful at publish time
Practical guidance:
Producer Notifications
Producers can receive notifications when subscriber interest changes. This enables demand-driven data production.
Enabling Notifications
What You Receive
When Notifications Arrive
Transition
Notification
You only receive notifications when crossing the zero threshold. The notifications answer: "is anyone listening?" - not "how many are listening?"
Practical Use Case: On-Demand Data Production
The producer idles when nobody's listening, avoiding unnecessary API calls and resource usage. When subscribers appear, it starts producing. When all subscribers leave, it stops.
Network Transparency
Notifications work across nodes. Remote subscribers count toward "someone is listening":
The producer doesn't know or care whether subscribers are local or remote. The notification mechanism handles it transparently.
Multiple Events
Each event tracks subscribers independently:
Automatic Cleanup
Subscriptions clean up automatically when any participant terminates. This eliminates resource leaks from forgotten subscriptions.
When Target Terminates
All subscribers receive notification. The subscription ceases to exist - there's nothing to unsubscribe from.
When Subscriber Terminates
Your subscriptions are removed from:
Local subscription records
Remote nodes (for remote subscriptions)
If you were the last local subscriber to a remote target, the network subscription is removed. Otherwise, it stays for remaining local subscribers.
When Event Producer Terminates
Subscribers can't distinguish explicit UnregisterEvent from producer termination - both deliver termination notification with reason gen.ErrUnregistered.
When Network Connection Fails
All subscriptions involving the failed node are cleaned up. If the node reconnects later, you need to re-subscribe - the framework doesn't automatically restore subscriptions.
Explicit Unsubscription
You can explicitly remove subscriptions:
Explicit unsubscription is useful when:
You want to stop watching before termination
You're switching to a different target
You're implementing connection retry logic
But in most cases, you don't need explicit unsubscription. Let termination handle cleanup.
Cleanup Order Guarantees
When a process terminates, cleanup happens in a specific order:
Process state changes to Terminated
All outgoing subscriptions (where process is consumer) are removed
All incoming subscriptions (where process is target) generate notifications
This ordering ensures:
You don't receive notifications after your process starts terminating
Subscribers to you receive notifications before your resources are freed
No race conditions between notification delivery and cleanup
Summary
Concept
How It Works
Key Performance Insights
For subscriptions:
First subscription to remote target: network round-trip
Additional subscriptions to same target: instant
Unbuffered events: full sharing
For notifications:
One network message per subscriber node
Local distribution to all subscribers on that node
Cost scales with number of nodes, not number of subscribers
For cleanup:
Automatic on any termination
No resource leaks possible
No manual unsubscription required
gen.Alias{...}
A process alias
Node
gen.Atom("node@host")
A network connection
Event
gen.Event{Name: "prices", Node: "node@host"}
A registered event
Continues running
N network messages
1 network message
Unsubscribe (not last)
1 network round-trip
0 (instant)
Unsubscribe (last)
1 network round-trip
1 network round-trip
100 round-trips
1 round-trip
Total
300 network operations
3 network operations
1 message per node
1 message per node
Termination notification
1 message per node
1 message per node
Subscriber count is moderate
Memory constraints (buffers consume memory on producer node)
Process resources are freed
Event publishing
One network message per subscriber node, local fanout to subscribers
Buffered events
Shared delivery, but each subscriber retrieves buffer individually
Producer notifications
MessageEventStart/Stop when crossing zero subscriber threshold
Multiple local subscribers share one network subscription to remote target
Building a Cluster
Building production clusters with Ergo technologies
Ergo provides a complete technology stack for building distributed systems. Service discovery, load balancing, failover, observability - all integrated and working together. No external dependencies except the registrar. No API gateways, service meshes, or orchestration layers between your services.
This chapter shows how to use Ergo technologies to build production clusters. You'll see how service discovery enables automatic load balancing, how the leader actor provides failover, how metrics and Observer give you visibility into cluster state. Each technology solves a specific problem; together they cover the full spectrum of distributed system requirements.
The Integration Cost Problem
LinkPID(target) → MessageExitPID when target terminates
MonitorPID(target) → MessageDownPID when target terminates
LinkEvent(event) → MessageExitEvent when event ends
MonitorEvent(event) → MessageDownEvent when event ends
// Subscribe to process - implicit event
process.MonitorPID(targetPID)
// When targetPID terminates, you receive MessageDownPID
// The target process didn't send anything - framework generated it
// Producer registers event - explicit event source
token, _ := producer.RegisterEvent("prices", gen.EventOptions{})
// Producer publishes messages
producer.SendEvent("prices", token, PriceUpdate{...})
// Subscriber receives:
// - MessageEvent for each published message
// - MessageDownEvent when producer terminates or unregisters
// Subscribe to local process
err := process.MonitorPID(localTarget)
// Returns immediately - no waiting
// Subscribe to remote process
err := process.MonitorPID(remotePID)
// Blocks briefly during network round-trip
// May return error if target doesn't exist or network fails
// If target doesn't exist, you get an error
err := process.MonitorPID(nonExistentPID)
// err == gen.ErrProcessUnknown
// If target exists but already terminated
err := process.MonitorPID(terminatedPID)
// err == gen.ErrProcessTerminated
// If event isn't registered, you get an error
_, err := process.MonitorEvent(gen.Event{Name: "unknown", Node: "node@host"})
// err == gen.ErrEventUnknown
func (w *Worker) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case gen.MessageDownPID:
if msg.Reason == gen.ErrNoConnection {
// Connection failed
// Target might still be running on an isolated node
// Or it might have terminated - we can't know
log.Printf("Lost connection to target's node")
}
}
return nil
}
case gen.MessageDownPID:
// Whether normal termination or connection failure,
// the target is gone from our perspective
w.handleTargetGone(msg.PID, msg.Reason)
// On node A, 100 worker processes all monitor the same coordinator on node B
coordinatorPID := gen.PID{Node: "nodeB@host", ID: 500}
for i := 0; i < 100; i++ {
workers[i].MonitorPID(coordinatorPID)
}
// First worker subscribes
err := worker1.MonitorPID(coordinatorPID)
// Takes a few milliseconds - network round-trip
// Second worker subscribes to SAME target
err := worker2.MonitorPID(coordinatorPID)
// Returns instantly - no network communication
// 98 more workers subscribe
// All return instantly
// All 100 workers receive notification
// But only ONE network message was sent
// Your node distributed it locally
func (w *Worker) HandleMessage(from gen.PID, message any) error {
case gen.MessageDownPID:
// You can't tell if you're the only subscriber
// or one of 1000 subscribers
// The timing and behavior are identical
}
// Producer on node A publishes
process.SendEvent("market.prices", token, PriceUpdate{Symbol: "BTC", Price: 42000})
// Publish returns immediately
process.SendEvent("market.prices", token, update)
// You don't wait for delivery
// You don't know how many subscribers there are
// You don't know which nodes they're on
func (c *Consumer) HandleEvent(message gen.MessageEvent) error {
// Event arrives in your mailbox
// Same timing whether you're the only subscriber or one of thousands
// Same timing whether producer is local or remote
return nil
}
Configuration:
- 1 producer on node A
- 10 consumer nodes (B through K)
- 100,000 subscribers per consumer node
- 1,000,000 total subscribers
Total subscribers: 1000000
Consumer nodes: 10
Subscribers per node: 100000
Time to publish: 64.125µs
Time to deliver all: 342.414375ms
Network messages sent: 10 (1 per consumer node)
Delivery rate: 2920438 msg/sec
// 50 workers on each of 10 nodes monitor a shared coordinator
// Network cost: 10 subscriptions, not 500
for i := 0; i < 50; i++ {
worker := SpawnWorker()
worker.MonitorPID(coordinatorPID)
}
// Cache instances on every node subscribe to invalidation events
// When data changes, ONE message per node delivers invalidation
// Each node updates all its local cache instances
// Multiple supervisors can monitor the same critical process
// Notification cost stays constant regardless of supervisor count
// Price feed publishes thousands of updates per second
// Cost per update: one message per subscriber NODE
// Not one message per subscriber PROCESS
// Same node, same remote target
processA.MonitorPID(remoteTarget) // Network round-trip
processB.MonitorPID(remoteTarget) // Instant (shared)
processC.MonitorPID(remoteTarget) // Instant (shared)
// Different remote targets
processA.MonitorPID(remoteTarget1) // Network round-trip
processB.MonitorPID(remoteTarget2) // Network round-trip (different target)
// Different nodes subscribing to same target
// (Each node has its own subscription to the target)
nodeX_process.MonitorPID(remoteTarget) // Network round-trip
nodeY_process.MonitorPID(remoteTarget) // Network round-trip (from different node)
// Producer creates event with 100-message buffer
token, _ := process.RegisterEvent("prices", gen.EventOptions{
Buffer: 100,
})
// Producer publishes messages over time
process.SendEvent("prices", token, msg1) // Stored in buffer
process.SendEvent("prices", token, msg2) // Stored in buffer
// ... more messages ...
// Subscriber joins and receives buffer
buffered, _ := process.MonitorEvent(event)
for _, msg := range buffered {
// These are recent messages published before subscription
}
// Process 1 subscribes at 10:00:00
buffered1, _ := process1.MonitorEvent(event)
// Receives messages 1-100
// Producer publishes messages 101-150
// Process 2 subscribes at 10:00:30
buffered2, _ := process2.MonitorEvent(event)
// Must receive messages 51-150 (different from process 1!)
// Good: Configuration updates, moderate subscribers
token, _ := process.RegisterEvent("config.updates", gen.EventOptions{
Buffer: 10, // Last 10 config changes
})
// Good: State snapshots for late joiners
token, _ := process.RegisterEvent("game.state", gen.EventOptions{
Buffer: 1, // Just the latest state
})
// Better without buffer: High-frequency price feed
token, _ := process.RegisterEvent("prices.realtime", gen.EventOptions{
Buffer: 0, // Full sharing optimization
})
// Better without buffer: High subscriber count
token, _ := process.RegisterEvent("system.metrics", gen.EventOptions{
Buffer: 0, // Thousands of subscribers, skip buffer overhead
})
// Producer on node A
token, _ := process.RegisterEvent("data", gen.EventOptions{Notify: true})
// Subscriber on node B subscribes
// Producer receives MessageEventStart
// Subscriber on node B unsubscribes (and was the only subscriber)
// Producer receives MessageEventStop
token1, _ := process.RegisterEvent("prices.stocks", gen.EventOptions{Notify: true})
token2, _ := process.RegisterEvent("prices.crypto", gen.EventOptions{Notify: true})
// MessageEventStart for "prices.stocks" when first stock subscriber appears
// MessageEventStart for "prices.crypto" when first crypto subscriber appears
// These are independent - one can have subscribers while other doesn't
// You subscribed to this process
process.MonitorPID(target)
// Target terminates for any reason
// - Returns error from callback
// - Panics
// - Receives kill signal
// - Node shuts down
// You receive notification
case gen.MessageDownPID:
// Subscription is automatically removed
// No cleanup needed on your part
// You created several subscriptions
process.MonitorPID(target1)
process.MonitorPID(target2)
process.MonitorEvent(event)
// Your process terminates
// All subscriptions are automatically removed
// No cleanup code needed
// Producer registered an event
token, _ := producer.RegisterEvent("prices", gen.EventOptions{Buffer: 100})
// Producer terminates (or explicitly calls UnregisterEvent)
// All subscribers receive termination notification:
// - Links receive MessageExitEvent
// - Monitors receive MessageDownEvent
// Event resources are cleaned up:
// - Buffer memory freed
// - Event name available for re-registration
// You have subscriptions to targets on node B
process.MonitorPID(pid_on_nodeB)
process.MonitorProcessID(name_on_nodeB)
process.MonitorEvent(event_on_nodeB)
// Connection to node B fails
// (Network partition, node crash, etc.)
// You receive termination for ALL subscriptions to that node:
case gen.MessageDownPID:
if msg.Reason == gen.ErrNoConnection {
// Network failure, not process termination
}
case gen.MessageDownProcessID:
if msg.Reason == gen.ErrNoConnection {
// Network failure
}
case gen.MessageDownEvent:
if msg.Reason == gen.ErrNoConnection {
// Network failure
}
// Remove link
process.UnlinkPID(target)
process.UnlinkEvent(event)
// Remove monitor
process.DemonitorPID(target)
process.DemonitorEvent(event)
Traditional microservice architectures pay a heavy integration tax. Each service needs:
HTTP/gRPC endpoints for communication
Client libraries with retry logic and circuit breakers
Service mesh sidecars for traffic management
API gateways for routing and load balancing
Health check endpoints and probes
Metrics exporters and tracing spans
Configuration management and secret injection
Each layer adds latency, complexity, and failure modes. A simple call between two services traverses client library, sidecar proxy, load balancer, another sidecar, server library. Each hop serializes, deserializes, and can fail independently.
Ergo eliminates these layers. Processes communicate directly through message passing. The framework handles serialization, routing, load balancing, and failure detection. No sidecars, no API gateways, no client libraries.
One network hop. One serialization. Built-in load balancing and failover. This isn't a philosophical difference - it's orders of magnitude less infrastructure to deploy, maintain, and debug.
Service Discovery with Registrars
Service discovery is the foundation of clustering. How does node A find node B? How does a process locate the right service instance? Ergo provides three registrar options, each suited for different scales and requirements.
Embedded Registrar
The embedded registrar requires no external infrastructure. The first node on a host becomes the registrar server; others connect as clients.
Cross-host discovery uses UDP queries. When node 2 needs to reach node 4, it asks its local registrar server (node 1), which queries node 4's host via UDP.
Use for: Development, testing, single-host deployments, simple multi-host setups without firewalls blocking UDP.
Limitations: No application discovery, no configuration management, no event notifications.
etcd Registrar
etcd provides centralized discovery with application routing, configuration management, and event notifications. Nodes register with etcd and maintain leases for automatic cleanup.
etcd registrar capabilities:
Feature
Description
Node discovery
Find all nodes in the cluster
Application discovery
Find which nodes run specific applications
Weighted routing
Load balance based on application weights
Configuration
Use for: Teams already running etcd, clusters up to 50-70 nodes, deployments needing application discovery.
Saturn Registrar
Saturn is purpose-built for Ergo. Instead of polling (like etcd), it maintains persistent connections and pushes updates immediately. Topology changes propagate in milliseconds.
Saturn vs etcd:
Aspect
etcd
Saturn
Update propagation
Polling (seconds)
Push (milliseconds)
Connection model
HTTP requests
Persistent TCP
Use for: Large clusters, real-time topology awareness, production systems where discovery latency matters.
Application Discovery and Load Balancing
Applications are the unit of deployment in Ergo. A node can load multiple applications, start them with different modes, and register them with the registrar. Other nodes discover applications and route requests based on weights.
Registering Applications
When you start an application, it automatically registers with the registrar (if using etcd or Saturn):
The registrar now knows: application "api" is running on this node with weight 100.
Discovering Applications
Other nodes can discover where applications run:
Output might show:
Weighted Load Balancing
Weights enable traffic distribution. A node with weight 100 receives twice as much traffic as a node with weight 50. Use this for:
Canary deployments: New version with weight 10, stable with weight 90
Capacity matching: Powerful nodes get higher weights
Graceful draining: Set weight to 0 before maintenance
During canary and rolling deployments, the cluster runs mixed code versions. Messages sent from new nodes must be understood by old nodes, and vice versa. Ensure your message types support version coexistence as described in Message Versioning.
Routing Requests
Once you know where applications run, route requests using weighted selection:
This is application-level load balancing without external infrastructure. No load balancer service, no sidecar proxies.
Running Multiple Instances for Load Balancing
Horizontal scaling means running the same application on multiple nodes. Each instance handles a portion of traffic. Add nodes to increase capacity; remove nodes to reduce costs.
Deployment Pattern
Each client discovers all api instances and distributes requests based on weights.
Implementation
On each worker node:
On coordinator/client nodes:
Scaling Operations
Scale up: Start new node with the same application. It registers with the registrar. Other nodes discover it through events or next resolution.
Scale down: Set weight to 0 (drain), wait for in-flight work, stop the node. Registrar removes the registration when the lease expires.
Reacting to Topology Changes
Subscribe to registrar events to react when instances join or leave:
No polling. No service mesh. Events arrive within milliseconds (Saturn) or at the next poll cycle (etcd).
Running Multiple Instances for Failover
Failover means having standby instances ready to take over when the primary fails. The leader actor implements distributed leader election - exactly one instance is active (leader) while others wait (followers).
The Leader Actor
The leader.Actor from ergo.services/actor/leader implements Raft-based leader election. Embed it in your actor to participate in elections:
Election Mechanics
All instances start as followers
If no heartbeats arrive, a follower becomes candidate
Candidate requests votes from peers
Majority vote wins; candidate becomes leader
Leader sends periodic heartbeats
If leader fails, followers detect timeout and elect new leader
Failover Scenario
Failover happens automatically. No manual intervention. The surviving nodes elect a new leader within the election timeout (150-300ms by default).
Use Cases
Single-writer coordination: Only the leader writes to prevent conflicts.
Task scheduling: Only the leader runs periodic tasks.
Leader election requires a majority (quorum) to prevent split-brain:
Cluster Size
Quorum
Tolerated Failures
3 nodes
2
1
5 nodes
3
2
If a network partition splits 5 nodes into groups of 3 and 2:
The group of 3 can elect a leader (has quorum)
The group of 2 cannot (no quorum)
This prevents both sides from having leaders and making conflicting decisions.
Observability with Metrics
The metrics actor from ergo.services/actor/metrics exposes Prometheus-format metrics. Base metrics are collected automatically; you add custom metrics for application-specific telemetry.
Basic Setup
This starts an HTTP server at :9090/metrics with base metrics:
Metric
Description
ergo_node_uptime_seconds
Node uptime
ergo_processes_total
Total process count
ergo_processes_running
Actively processing
ergo_memory_used_bytes
Custom Metrics
Extend the metrics actor for application-specific telemetry:
Update metrics from your application:
Prometheus Integration
Now you have cluster-wide visibility: process counts, memory usage, network traffic, custom business metrics - all in Prometheus/Grafana.
Inspecting with Observer
Observer is a web UI for cluster inspection. Run it as an application within your node or as a standalone tool.
Configuration: Update settings via etcd; changes propagate immediately
All of this with:
No API gateways
No service mesh
No load balancer services
No orchestration layers
No client libraries with retry logic
Just Ergo nodes communicating directly through message passing.
Summary
Ergo provides integrated technologies for building production clusters:
Technology
Purpose
Package
Registrars
Service discovery
ergo.services/registrar/etcd, registrar/saturn
Applications
Deployment units with weights
Core framework
These components eliminate the integration layers that dominate traditional microservice architectures. Instead of building infrastructure, you build applications.
How to Structure Projects Built with Ergo Framework
The same codebase can run as a monolith on your laptop or as distributed services across a data center. This flexibility comes from one principle: applications are the unit of composition. How you organize your project determines whether you can use this flexibility or fight against it.
This chapter covers project organization, message isolation patterns, deployment strategies, and evolution paths. The goal is a structure that supports both development simplicity and production scalability without code changes.
The Flexibility Promise
Ergo's network transparency means a process doesn't know if it's talking to a neighbor in the same node or a remote process across the network. The same Send() call works either way. But this only helps if your code is organized to take advantage of it.
Consider two deployment scenarios:
Development: All applications in one process for fast iteration.
Production: Applications distributed across nodes for scalability.
The application code is identical in both cases. Only the entry point changes - which applications start on which nodes.
This works because:
Applications are self-contained functional units
Messages define contracts between applications
The framework handles routing transparently
Your project structure must preserve these properties. Mix them up, and you lose deployment flexibility.
Directory Layout
A well-structured project separates entry points from applications from shared code:
Entry Points (cmd/)
Each directory in cmd/ produces a different binary with a different deployment topology.
Monolith - everything together:
Distributed - each application on its own node:
The application code (apps/api, apps/worker) is identical. The entry point decides what runs where.
Applications (apps/)
Each subdirectory in apps/ is a self-contained application. An application is:
A cohesive functional unit
Deployable independently
Composed of actors with a supervision tree
Communicating via messages
Application structure:
Application definition:
Applications should not import each other. If apps/api imports apps/worker, you've created a compile-time dependency that limits deployment flexibility.
Service-Level Types (types/)
When applications need to communicate, they need shared message types. The types/ directory holds these contracts:
Both apps/orders and apps/shipping can import types without importing each other. This breaks the circular dependency while maintaining strong typing.
Shared Libraries (lib/)
Non-actor code that multiple applications use goes in lib/:
Libraries must be:
Stateless - no global variables, no goroutines
Pure - same inputs produce same outputs
Actor-agnostic - no dependency on gen.Process
Libraries are safe to call from actor callbacks because they don't block or manage state.
Message Isolation Levels
Messages define contracts between actors. The visibility of message types controls who can send them and where they can travel. Ergo uses Go's export rules plus EDF serialization requirements to create four isolation levels.
Understanding these levels is critical for proper encapsulation.
Level 1: Application-Internal (Same Node)
Messages used only within a single application instance on one node.
Characteristics:
Type is unexported (scheduleTask)
Fields are unexported (taskID, not TaskID)
Cannot be imported by other packages
Use when:
Communication between actors in the same application
Messages between instances of the same application across nodes.
Characteristics:
Type is unexported (replicateState)
Fields are exported (Version, not version)
Cannot be imported by other packages
Use when:
Replication between application instances
Cluster-internal coordination
Messages that other applications shouldn't see
Level 3: Cross-Application (Same Node Only)
Messages between different applications on the same node.
Characteristics:
Type is exported (StatusQuery)
Fields are unexported (taskID, not TaskID)
CAN be imported by other packages
Use when:
Local service queries
Same-node optimization paths
Explicitly preventing network transmission
This level is intentionally restrictive. If someone tries to send StatusQuery to a remote node, serialization fails. The unexported fields act as a compile-time guard against accidental network use.
Level 4: Service-Level (Everywhere)
Messages that form public contracts between applications across the cluster.
Characteristics:
Type is exported (ProcessTask)
Fields are exported (TaskID)
CAN be imported by any package
Use when:
Public API between applications
Events that multiple applications subscribe to
Commands sent across application boundaries
Summary Table
Level
Scope
Type
Fields
Serializable
Import
Choosing the Right Level
Start with Level 1 (maximum restriction). Only increase visibility when needed:
Does another application need this message?
No → Keep type unexported (Level 1 or 2)
Yes → Export type (Level 3 or 4)
Application Design Patterns
Supervision Structure
Applications typically have a supervision tree:
Configuration via Options
Applications accept configuration through an Options struct:
Entry points configure options based on deployment:
Inter-Application Communication
Applications discover each other through application names, not node names:
When running as monolith, routes returns the local node. When distributed, it returns remote nodes. The code doesn't change.
Event Publishing
Applications publish events for loose coupling:
Events decouple applications. Orders doesn't know who listens. Shipping doesn't know where Orders runs.
Deployment Patterns
Pattern 1: Development Monolith
Everything in one process for fast iteration:
Benefits:
Single binary to run
No network setup
Easy debugging
Fast startup
Pattern 2: Distributed Production
Each application on dedicated nodes:
Each binary runs one application:
Benefits:
Independent scaling per tier
Fault isolation
Resource optimization
Zero-downtime updates
Pattern 3: Hybrid Deployment
Group related applications for efficiency:
Benefits:
Reduced network hops for common paths
Fewer nodes to manage
Right-sized for actual traffic patterns
Testing Strategies
Unit Testing Actors
Test actors in isolation using the testing framework:
Integration Testing Applications
Test complete applications:
Testing Distributed Scenarios
Test multiple nodes:
Evolution and Refactoring
Starting Simple
Begin with a monolith:
Extracting Applications
When the monolith grows, extract bounded contexts:
Step 1: Identify boundaries in the combined application.
Step 2: Create separate application packages.
Step 3: Update the entry point.
Step 4: When ready, create distributed entry points.
The application code never changes. Only entry points and deployment.
Merging Applications
If you over-distributed:
No application code changes. Just different composition.
Best Practices
Application Boundaries
Do:
One application per bounded context
Applications that scale together can be one application
Applications that deploy together can be one application
Don't:
Create applications for single actors
Split applications by technical layer (web/service/data)
Create circular dependencies between applications
Good:
Bad:
Message Design
Do:
Start with Level 1 (most restrictive)
Increase visibility only when needed
Document which level each message uses
Don't:
Default to Level 4 for everything
Mix isolation levels arbitrarily
Use any or interface{} for messages
Dependencies
Do:
Applications import types/ for shared contracts
Applications import lib/ for utilities
Entry points import applications
Don't:
Applications import other applications
Libraries depend on applications
Create import cycles
Configuration
Do:
Use Options structs for application config
Validate in CreateApp or Load
Provide sensible defaults
Don't:
Hard-code configuration in actors
Read os.Getenv directly in actors
Store configuration in global variables
What's Next
This article covered project organization for flexible deployment. As your system grows into a distributed cluster, two topics become essential:
- service discovery, load balancing, failover, and observability
- evolving message contracts during rolling upgrades
// No configuration needed - embedded registrar is the default
node, _ := ergo.StartNode("service@localhost", gen.NodeOptions{})
// On remote node: enable spawn
network := node.Network()
network.EnableSpawn("worker", createWorker, "coordinator@host")
// On coordinator: spawn remotely
remote, _ := coordinator.Network().GetNode("worker@host")
pid, _ := remote.Spawn("worker", gen.ProcessOptions{}, WorkerConfig{BatchSize: 100})
// Send work to the remote process
coordinator.Send(pid, ProcessJob{Data: jobData})
// On remote node: load app and enable remote start
node.ApplicationLoad(&WorkerApp{}, gen.ApplicationSpec{Name: "workers"})
network.EnableApplicationStart("workers", "coordinator@host")
// On coordinator: start remotely
remote, _ := coordinator.Network().GetNode("worker@host")
remote.ApplicationStartPermanent("workers", gen.ApplicationOptions{})
1. Node-specific in cluster: /cluster/{cluster}/config/{node}/{item}
2. Cluster-wide default: /cluster/{cluster}/config/*/{item}
3. Global default: /config/global/{item}
# Set config via etcdctl
etcdctl put services/ergo/cluster/production/config/*/db.pool_size "int:20"
etcdctl put services/ergo/cluster/production/config/*/cache.enabled "bool:true"
etcdctl put services/ergo/cluster/production/config/node1/db.pool_size "int:50"
// Read config in your application
registrar, _ := node.Network().Registrar()
config, _ := registrar.Config("db.pool_size", "cache.enabled")
poolSize := config["db.pool_size"].(int64) // 50 on node1, 20 on others
cacheEnabled := config["cache.enabled"].(bool) // true
func (a *App) HandleEvent(ev gen.MessageEvent) error {
switch msg := ev.Message.(type) {
case etcd.EventConfigUpdate:
a.Log().Info("config changed: %s = %v", msg.Item, msg.Value)
switch msg.Item {
case "log.level":
a.updateLogLevel(msg.Value.(string))
case "cache.size":
a.resizeCache(msg.Value.(int64))
}
}
return nil
}
// types/events.go
package types
import (
"time"
"ergo.services/ergo/net/edf"
)
// Events published by the orders application
type OrderCreated struct {
OrderID string
CustomerID string
Total int64
CreatedAt time.Time
}
type OrderCompleted struct {
OrderID string
CompletedAt time.Time
}
func init() {
// Register for network serialization
edf.RegisterTypeOf(OrderCreated{})
edf.RegisterTypeOf(OrderCompleted{})
}
// lib/config/config.go
package config
import "os"
func DatabaseURL() string {
return os.Getenv("DATABASE_URL")
}
// lib/models/order.go
package models
type Order struct {
ID string
CustomerID string
Items []OrderItem
Total int64
}
// apps/worker/messages.go
package worker
// Unexported type, unexported fields
// Cannot be referenced outside this package
// Cannot be serialized (unexported fields)
type scheduleTask struct {
taskID string
priority int
data []byte
}
type taskCompleted struct {
taskID string
result []byte
}
// apps/worker/messages.go
package worker
// Unexported type, EXPORTED fields
// Cannot be referenced outside this package
// CAN be serialized (exported fields)
type replicateState struct {
Version int64 // Exported for EDF
TaskIDs []string
Positions map[string]int
}
type syncRequest struct {
FromVersion int64
ToVersion int64
}
// apps/worker/messages.go
package worker
// EXPORTED type, unexported fields
// CAN be referenced by other packages
// Cannot be serialized (unexported fields)
type StatusQuery struct {
taskID string // unexported - prevents network use
}
type StatusResponse struct {
taskID string
status string
progress int
}
// types/commands.go
package types
import "ergo.services/ergo/net/edf"
// EXPORTED type, EXPORTED fields
// CAN be referenced by any package
// CAN be serialized
type ProcessTask struct {
TaskID string
Priority int
Payload []byte
}
type TaskResult struct {
TaskID string
Status string
Output []byte
Error string
}
func init() {
edf.RegisterTypeOf(ProcessTask{})
edf.RegisterTypeOf(TaskResult{})
}
apps/
├── order_api/ # Just API handlers
├── order_service/ # Just business logic
├── order_repository/ # Just data access
└── order_events/ # Just events
Unit
A zero-dependency library for testing Ergo Framework actors with fluent API
Introduced for Ergo Framework 3.1.0 and above (not yet released. available in v310 branch)
The Ergo Unit Testing Library makes testing actor-based systems simple and reliable. It provides specialized tools designed specifically for the unique challenges of testing actors, with zero external dependencies and an intuitive, readable API.
What You'll Learn
This guide takes you from simple actor tests to complex distributed scenarios. Here's the journey:
Getting Started (You Are Here!)
Your First Test - Simple echo and counter examples
Built-in Assertions - Simple tools for common checks
Basic Message Testing - Verify actors send the right messages
Intermediate Skills (Next Steps)
Configuration Testing - Test environment-driven behavior
Basic Process Spawning - Test actor creation and lifecycle
Advanced Features (When You Need Them)
Actor Termination - Test error handling and graceful shutdowns
Exit Signals - Manage process lifecycles in supervision trees
Scheduled Operations - Test cron jobs and time-based behavior
Expert Level (Complex Scenarios)
Dynamic Value Capture - Handle generated IDs, timestamps, and random data
Complex Workflows - Test multi-step business processes
Performance & Load Testing - Verify behavior under stress
Tip: The documentation follows this learning path. You can jump to advanced topics if needed, but starting from the beginning ensures you understand the foundations.
Why Testing Actors is Different
Traditional testing tools don't work well with actors. Here's why:
The Challenge: Actors Are Not Functions
Regular code testing follows a simple pattern:
But actors are fundamentally different:
They run asynchronously - you send a message and the response comes later
They maintain state - previous messages affect future behavior
They spawn other actors - creating complex hierarchies
What Makes Actor Testing Hard
Asynchronous Communication
Message Flow Complexity
Dynamic Process Creation
State Changes Over Time
How This Library Solves Actor Testing
The Ergo Unit Testing Library addresses each of these challenges:
Event Capture - See Everything Your Actor Does
Instead of guessing what happened, the library automatically captures every actor operation:
Fluent Assertions - Test What Matters
Express your test intentions clearly:
Dynamic Value Handling - Work With Generated Data
Capture and reuse dynamically generated values:
State Testing Through Behavior - Verify State Changes
Test state indirectly by verifying behavioral changes:
Why Zero Dependencies Matters
Actor testing is complex enough without dependency management headaches:
No version conflicts - Works with any Go testing setup
No external tools - Everything needed is built-in
Simple imports - Just import "ergo.services/ergo/testing/unit"
Core Concepts
Now that you understand why actor testing is different, let's explore the key concepts that make this library work.
The Event-Driven Testing Model
Everything your actor does becomes a testable "event".
When you run this simple test:
Here's what happens behind the scenes:
Your actor receives the message - Normal actor behavior
Your actor sends a response - Normal actor behavior
The library captures a SendEvent - Testing magic
The library automatically captures these events:
SendEvent - When your actor sends a message
SpawnEvent - When your actor creates child processes
LogEvent - When your actor writes log messages
Why Events Matter
Events solve the fundamental challenge of testing asynchronous systems:
Instead of this (impossible):
You do this (works perfectly):
The Fluent Assertion API
The library provides a readable, chainable API that expresses test intentions clearly:
Benefits of the fluent API:
Readable - Tests read like English sentences
Discoverable - IDE autocomplete guides you through options
Flexible - Chain only the validations you need
Installation
Your First Actor Test
Let's start with the simplest possible actor test to understand the basics:
A Simple Echo Actor
Testing the Echo Actor
What Just Happened?
This simple test demonstrates the core pattern:
unit.Spawn() - Creates a test actor in an isolated environment
actor.SendMessage() - Sends a message to your actor (like prod would)
actor.ShouldSend() - Verifies that your actor sent the expected message
Key insight: You're not testing internal state - you're testing behavior. You verify what the actor does (sends messages) rather than what it contains (internal variables).
Why This Works
The testing library automatically captures everything your actor does:
Every message sent by your actor
Every process spawned by your actor
Every log message written by your actor
Then it provides fluent assertions to verify these captured events.
Adding Slightly More Complexity
Let's test an actor that maintains some state:
This shows how you test stateful behavior without accessing internal state - by observing how the actor's responses change over time.
Built-in Assertions
Before diving into complex actor testing, let's cover the simple assertion utilities you'll use throughout your tests.
Why Built-in Assertions Matter for Actor Testing:
Actor tests often need to verify simple conditions alongside complex event assertions. Rather than forcing you to import external testing libraries (which could conflict with your project dependencies), the unit testing library provides everything you need:
Available Assertions
Equality Testing:
Boolean Testing:
Nil Testing:
String Testing:
Type Testing:
Why Zero Dependencies Matter
No Import Conflicts:
Consistent Error Messages: All assertions provide clear, consistent error messages that integrate well with the actor testing output.
Framework Agnostic: Works with any Go testing setup - standard go test, IDE test runners, CI/CD systems, etc.
Basic Message Testing
Now that you understand the fundamentals, let's explore message testing in more depth.
What Comes Next
Now you'll learn how to test different aspects of actor behavior, building from simple to complex:
Fundamentals (You're here!)
Basic message sending and receiving
Simple process creation
Logging and observability
Configuration testing
Intermediate Skills
Complex message patterns
Event inspection and debugging
Actor lifecycle and termination
Error handling and recovery
Advanced Features
Scheduled operations (cron jobs)
Network and distribution
Performance and load testing
Basic Logging Testing
Logging is crucial for production actors - it provides visibility into what your actors are doing and helps with debugging. Let's learn how to test logging behavior.
Why Test Logging?
Logging tests ensure:
Your actors provide sufficient information for monitoring
Debug information is available when needed
Log levels are respected (don't log debug in production)
Simple Logging Test
Testing Different Log Levels
Testing Log Content
Logging Best Practices for Testing
Structure your log messages to make them easy to test:
Test log levels appropriately:
Error - Test that errors are logged when they occur
Warning - Test that concerning but non-fatal events are captured
Info - Test that important business events are recorded
Intermediate Skills
Now that you've mastered the basics, let's tackle more complex testing scenarios.
Configuration and Environment Testing
Real actors often behave differently based on configuration. Let's test this:
The Spawn function creates an isolated testing environment for your actor. Unlike production actors that run in a complex node environment, test actors run in a controlled sandbox where every operation is captured for verification.
Key Benefits:
Isolation: Each test actor runs independently without affecting other tests
Deterministic: Test outcomes are predictable and repeatable
Observable: All actor operations are automatically captured as events
Example Actor:
Test Implementation:
Configuration Options - Fine-Tuning the Test Environment
Test configuration allows you to simulate different runtime conditions without requiring complex setup:
Environment Variables (WithEnv): Test how your actors behave with different configurations without changing production code. Useful for testing feature flags, database URLs, timeout values, and other configuration-driven behavior.
Log Levels (WithLogLevel): Control the verbosity of test output and verify that your actors log appropriately at different levels. Critical for testing monitoring and debugging capabilities.
Process Hierarchy (WithParent, WithRegister): Test actors that need to interact with parent processes or require specific naming for registration-based lookups.
Message Testing
ShouldSend() - Verifying Actor Communication
Message testing is the heart of actor validation. Since actors communicate exclusively through messages, verifying message flow is crucial for ensuring correct behavior.
Why Message Testing Matters:
Validates Integration: Ensures actors communicate correctly with their dependencies
Confirms Business Logic: Verifies that the right messages are sent in response to inputs
Detects Side Effects: Catches unintended message sends that could cause bugs
When testing complex message structures or dynamic content, the library provides powerful matching capabilities:
Pattern Matching Benefits:
Partial Validation: Test only the fields that matter for your specific test case
Dynamic Content Handling: Validate messages with timestamps, UUIDs, or generated IDs
Type Safety: Ensure messages are of the correct type even when content varies
Process Spawning
ShouldSpawn() - Testing Process Lifecycle Management
Process spawning is a fundamental actor pattern for building hierarchical systems. The testing library provides comprehensive tools for verifying that actors create, configure, and manage child processes correctly.
Why Process Spawning Tests Matter:
Resource Management: Ensure actors don't spawn too many or too few processes
Configuration Propagation: Verify that child processes receive correct configuration
Error Handling: Test behavior when process spawning fails
Example Actor:
Test Implementation:
Dynamic Process Testing - Handling Generated Values
Real-world actors often generate dynamic values like session IDs, request tokens, or timestamps. The library provides sophisticated tools for capturing and validating these dynamic values.
Dynamic Value Testing Scenarios:
Session Management: Test actors that create sessions with generated IDs
Request Tracking: Verify that request tokens are properly generated and used
Time-based Operations: Validate actors that schedule work or create timestamps
Remote Spawn Testing
ShouldRemoteSpawn() - Testing Distributed Actor Creation
Remote spawn testing allows you to verify that actors correctly create processes on remote nodes in a distributed system. The testing library captures RemoteSpawnEvent operations and provides fluent assertions for validation.
Why Test Remote Spawning:
Distribution Logic: Ensure actors spawn processes on the correct remote nodes
Load Distribution: Verify round-robin or other distribution strategies work correctly
Error Handling: Test behavior when remote nodes are unavailable
Example Actor:
Test Implementation:
Advanced Remote Spawn Patterns:
Multi-Node Distribution: Test round-robin or other distribution strategies across multiple nodes
Error Scenarios: Verify proper error handling when nodes are unavailable
Event Inspection: Direct inspection of RemoteSpawnEvent for detailed validation
Actor Termination Testing
ShouldTerminate() - Testing Actor Lifecycle Completion
Actor termination is a critical aspect of actor systems. Actors can terminate for various reasons: normal completion, explicit shutdown, or errors. The testing library provides comprehensive tools for validating termination behavior and ensuring proper cleanup.
Why Test Actor Termination:
Resource Cleanup: Ensure actors properly clean up resources when terminating
Error Propagation: Verify that errors are handled correctly and lead to appropriate termination
Graceful Shutdown: Test that actors respond correctly to shutdown signals
Termination Reasons:
gen.TerminateReasonNormal - Normal completion of actor work
Custom errors - Abnormal termination due to specific errors
Example Actor:
Test Implementation:
Advanced Termination Patterns:
Exit Signal Testing
ShouldSendExit() - Testing Graceful Process Termination
Exit signals (SendExit and SendExitMeta) are used to gracefully terminate other processes. This is different from actor self-termination - it's about one actor telling another to exit. The testing library provides comprehensive assertions for validating exit signal behavior.
Why Test Exit Signals:
Graceful Shutdown: Ensure supervisors can properly terminate child processes
Resource Cleanup: Verify that exit signals trigger proper cleanup in target processes
Error Propagation: Test that failure conditions are communicated via exit signals
Cron job testing allows you to validate scheduled operations in your actors without waiting for real time to pass. The testing library provides comprehensive mock time support and detailed cron job lifecycle management.
Why Test Cron Jobs:
Schedule Validation: Ensure cron expressions are correct and jobs run at expected times
Job Management: Test job addition, removal, enabling, and disabling operations
Execution Logic: Verify that scheduled operations perform correctly when triggered
Cron Testing Features:
Mock Time Support: Control time flow for deterministic testing
Job Lifecycle Testing: Validate job creation, scheduling, execution, and cleanup
Event Tracking: Monitor all cron-related operations and state changes
Example Actor:
Test Implementation:
Cron Testing Methods
Job Lifecycle Assertions:
Mock Time Control:
Advanced Cron Patterns:
Built-in Assertions
The library includes a comprehensive set of zero-dependency assertion functions that cover common testing scenarios without requiring external testing frameworks:
Why Built-in Assertions:
Zero Dependencies: Avoid version conflicts and complex dependency management
Consistent Interface: All assertions follow the same pattern and error reporting
Testing Framework Agnostic: Works with any Go testing approach
Advanced Features
Dynamic Value Capture - Testing Generated Content
Real-world actors frequently generate dynamic values like timestamps, UUIDs, session IDs, or auto-incrementing counters. Traditional testing approaches struggle with these values because they're unpredictable. The library provides sophisticated capture mechanisms to handle these scenarios elegantly.
The Challenge of Dynamic Values:
Timestamps: Created at runtime, impossible to predict exact values
UUIDs: Randomly generated, different in every test run
Auto-incrementing IDs: Dependent on execution order and system state
The Solution - Value Capture:
Capture Strategies:
Immediate Capture: Capture values as soon as they're generated
Pattern Matching: Use validation functions to identify and validate dynamic content
Structured Matching: Validate message structure while ignoring specific dynamic fields
Event Inspection - Deep System Analysis
For complex testing scenarios or debugging difficult issues, the library provides direct access to the complete event timeline. This allows you to perform sophisticated analysis of actor behavior beyond what's possible with standard assertions.
Events() - Complete Event History
Access all captured events for detailed analysis:
LastEvent() - Most Recent Operation
Get the most recently captured event:
ClearEvents() - Reset Event History
Clear all captured events, useful for isolating test phases:
Event Inspection Use Cases:
Performance Analysis: Count operations to identify performance bottlenecks
Workflow Validation: Ensure complex multi-step processes execute in the correct order
Error Investigation: Analyze the complete event sequence leading to failures
Timeout Support - Assertion Timing Control
The library provides timeout support for assertions that might need time-based validation:
Timeout Function Usage:
Assertion Wrapping: Wrap assertion functions to add timeout behavior
Integration Testing: Useful when testing with external systems that might have delays
Performance Validation: Ensure assertions complete within expected time limits
Testing Patterns and Best Practices
Test Organization Strategies
Single Responsibility Testing: Each test should focus on one specific behavior or scenario. This makes tests easier to understand, debug, and maintain.
State Isolation: Each test should start with a clean state and not depend on other tests. Use actor.ClearEvents() when needed to reset event history between test phases.
Error Path Testing: Don't just test the happy path. Actor systems need robust error handling, so test failure scenarios thoroughly:
Message Design for Testability
Structured Messages: Design your messages to be easily testable by using structured types rather than primitive values:
Predictable vs Dynamic Content: Separate predictable content from dynamic content in your messages to make testing easier:
Performance Testing Considerations
Event Overhead: While event capture is lightweight, be aware that every operation creates events. For performance-critical tests, you can:
Clear events periodically with ClearEvents()
Focus assertions on specific time windows
Use event inspection to identify performance bottlenecks
Scaling Testing: Test how your actors behave under load by simulating multiple concurrent operations:
Best Practices
Use descriptive test names that clearly indicate what behavior is being tested
Test all message types your actor handles, including edge cases
Capture dynamic values early using the Capture() method for generated IDs
This testing library provides comprehensive coverage for all Ergo Framework actor patterns while maintaining zero external dependencies and excellent readability. By following these patterns and practices, you can build robust, well-tested actor systems that behave correctly in both simple and complex scenarios.
Complete Examples and Use Cases
The library includes comprehensive test examples organized into feature-specific files that demonstrate all capabilities through real-world scenarios:
Feature-Based Test Files
basic_test.go - Fundamental Actor Testing
Basic actor functionality and message handling
Dynamic value capture and validation
Built-in assertions and event tracking
network_test.go - Distributed System Testing
Remote node simulation and connectivity
Network configuration and route management
Remote spawn operations and event capture
workflow_test.go - Complex Business Logic
Multi-step order processing workflows
State machine validation and transitions
Business process orchestration
call_test.go - Synchronous Communication
Call operations and response handling
Async call patterns and timeouts
Send/response communication flows
cron_test.go - Scheduled Operations
Cron job lifecycle management
Mock time control and schedule validation
Job execution tracking and assertions
termination_test.go - Actor Lifecycle Management
Actor termination handling and cleanup
Exit signal testing (SendExit/SendExitMeta)
Normal vs abnormal termination scenarios
Comprehensive Test Examples
Complex State Machine Testing (workflow_test.go)
Multi-step order processing workflow
Validation, payment, and fulfillment pipeline
Getting Started with Examples
Learning Path
Start with Basic Examples: basic_test.go - Core functionality and patterns
Explore Message Testing: basic_test.go - Message flow and assertions
Learn Process Management: basic_test.go
Each test file provides complete, working implementations of specific actor patterns and demonstrates best practices for testing each scenario. All tests include comprehensive comments explaining the testing strategy and validation approach.
Configuration and Environment Testing
Real actors often behave differently based on configuration. Let's test this:
Complex Message Patterns
As your actors become more sophisticated, your message testing needs to handle more complex scenarios:
Testing Message Sequences
Testing Conditional Logic
Basic Process Spawning
Many actors need to create child processes. Here's how to test this:
Capturing Dynamic Process IDs
When actors spawn processes, you often need to use the generated PID in subsequent tests:
Event Inspection for Debugging
When tests fail, you need to understand what actually happened:
Failure Injection Testing
Overview
The Ergo Unit Testing Library includes a failure injection system that allows you to test how your actors handle various error conditions. This is essential for building robust actor systems that can gracefully handle failures in production.
Method Failure Injection
Access failure injection through the actor's Process() method:
Available Failure Methods
The failure injection system provides several methods on TestProcess:
Common Use Cases
Testing Spawn Failures
Testing Message Send Failures
Testing Intermittent Failures
Testing Pattern-Based Failures
Testing One-Time Failures
Advanced Testing Scenarios
Testing Supervisor Restart Strategies
Testing Method Call Tracking
Best Practices
Clear Events Between Test Phases: Use ClearEvents() when transitioning between test phases to avoid assertion confusion.
Test Recovery: Always test that your actors can recover after failures are cleared or when using one-time failures.
Verify Call Counts: Use GetMethodCallCount()
Common Pitfalls
Event Accumulation: Events accumulate across multiple operations. Use ClearEvents() to reset between test phases.
Timing Issues: Some assertions may need time to complete. Use appropriate timeouts and consider async patterns.
Message Ordering: In high-throughput scenarios, message ordering might not be guaranteed. Test for this explicitly.
Conclusion
The Ergo Framework unit testing library provides comprehensive tools for testing actor-based systems. From simple message exchanges to complex distributed workflows, you can validate every aspect of your actor behavior with confidence.
Key Takeaways:
Start Simple: Begin with basic message testing and gradually add complexity
Test Comprehensively: Cover happy paths, error conditions, and edge cases
Use Fluent Assertions: Take advantage of the readable assertion API
The library's zero-dependency design, comprehensive feature set, and integration with Go's testing framework make it the ideal choice for building robust, well-tested actor systems with the Ergo Framework.
Next Steps:
Explore the complete test examples in the framework repository
Start with simple actors and gradually build complexity
Integrate testing into your development workflow
Happy testing!
Basic Logging Testing - Verify your actors provide good observability
Event Inspection - Debug and analyze actor behavior
Network & Distribution - Test multi-node actor systems
They communicate only via messages - no direct access to internal state
They can fail and restart - requiring lifecycle testing
Fast execution - No overhead from external libraries
You verify the captured event - Your assertion
TerminateEvent - When your actor shuts down
Precise - Specify exactly what matters for each test
When your actor terminates
Sensitive operations are properly audited
Debug - Test that detailed troubleshooting info is available
Configurable: Fine-tune the testing environment to match your needs
Tests Message Content: Validates that message payloads contain correct data
Negative Testing: Verify that certain messages are NOT sent in specific scenarios
Supervision Trees: Validate that supervisors manage their children appropriately
Resource Allocation: Test dynamic assignment of resources to processes
Resource Management: Validate that remote spawning respects capacity limits
Negative Assertions: Ensure remote spawns don't happen under certain conditions
Supervision Trees: Validate that supervisors handle child termination appropriately
Supervision Trees: Validate that supervisors manage process lifecycles correctly
Time Control: Use mock time to test time-dependent behavior deterministically
Schedule Simulation: Test complex scheduling scenarios without real time delays
Actor-Specific: Designed specifically for the needs of actor testing
Process IDs: Assigned by the actor system, not controllable in tests
Cross-Reference Testing: Use captured values in multiple assertions to ensure consistency
Integration Testing: Verify that multiple actors interact correctly in complex scenarios
Test error conditions not just the happy path
Use pattern matching for complex message validation
Clear events between test phases when needed with ClearEvents()
Configure appropriate log levels for debugging vs production testing
Test temporal behaviors with timeout mechanisms
Validate distributed scenarios using network simulation
Organize tests by behavior rather than by implementation details
Core testing patterns and best practices
Multi-node interaction patterns
Error handling and recovery scenarios
Concurrent request management
Time-dependent behavior testing
Resource cleanup validation
State transition validation and error handling
Process Management (basic_test.go)
Dynamic worker spawning and management
Resource capacity limits and monitoring
Worker lifecycle (start, stop, restart)
Advanced Pattern Matching (basic_test.go)
Structure matching with partial validation
Dynamic value handling and field validation
Complex conditional message matching
Remote Spawn Testing (network_test.go)
Remote spawn operations on multiple nodes
Round-robin distribution testing
Error handling for unavailable nodes
Event inspection and workflow validation
Cron Job Management (cron_test.go)
Job scheduling and execution validation
Mock time control for deterministic testing
Schedule expression testing and validation
Actor Termination (termination_test.go)
Normal and abnormal termination scenarios
Exit signal testing and process cleanup
Termination reason validation
Post-termination behavior verification
Concurrent Operations (call_test.go)
Multi-client concurrent request handling
Resource contention and capacity management
Load testing and performance validation
Environment & Configuration (basic_test.go)
Environment variable management
Runtime configuration changes
Feature flag and conditional behavior testing
- Spawn operations and lifecycle
Master Synchronous Communication: call_test.go - Calls and responses
Study Complex Workflows: workflow_test.go - Business logic testing
Practice Network Testing: network_test.go - Distributed operations
to ensure methods are called the expected number of times.
Pattern Matching: Use pattern-based failures to test scenarios where only specific inputs should fail.
Combine with Supervision: Test how supervisors handle child failures by injecting spawn failures during restart attempts.
State Leakage: Each test should start with clean state. Don't rely on previous test state.
Failure Persistence: Remember that SetMethodFailure persists until cleared, while SetMethodFailureOnce only fails once.
Inspect Events: Use event inspection for debugging and understanding actor behavior
Organize Tests: Structure tests by behavior and keep them focused
Handle Async Patterns: Use appropriate timeouts and pattern matching for async operations
Use the debugging features when tests fail
Share testing patterns with your team
// Traditional testing - call function, check result
result := calculateTax(income, rate)
assert.Equal(t, 1500.0, result)
// This doesn't work with actors:
actor.SendMessage("process_order")
result := actor.GetResult() // No direct way to get result
// An actor might send multiple messages to different targets:
actor.SendMessage("start_workflow")
// How do you verify it sent the right messages to the right places?
// Actors spawn other actors with generated IDs:
actor.SendMessage("create_worker")
// How do you test the spawned worker when you don't know its PID?
// Actor behavior changes based on message history:
actor.SendMessage("login", user1)
actor.SendMessage("login", user2)
actor.SendMessage("get_users")
// How do you verify the internal state without breaking encapsulation?
actor.SendMessage("process_order")
// Library automatically captures:
// - What messages were sent
// - Which processes were spawned
// - What was logged
// - When the actor terminated
actor.SendMessage("create_session")
sessionResult := actor.ShouldSpawn().Once().Capture()
sessionPID := sessionResult.PID // Use the actual generated PID in further tests
actor.SendMessage("process_order")
result := actor.WaitForResult() // Actors don't work this way
actor.SendMessage("process_order")
// Verify the actor did what it should do:
actor.ShouldSend().To("database").Message(SaveOrder{...}).Assert()
actor.ShouldSend().To("inventory").Message(CheckStock{...}).Assert()
actor.ShouldLog().Level(Info).Containing("Processing order").Assert()
package main
import (
"testing"
"ergo.services/ergo/act"
"ergo.services/ergo/gen"
"ergo.services/ergo/testing/unit"
)
// EchoActor - receives a message and sends it back
type EchoActor struct {
act.Actor
}
func (e *EchoActor) HandleMessage(from gen.PID, message any) error {
// Simply echo the message back to sender
e.Send(from, message)
return nil
}
// Factory function to create the actor
func newEchoActor() gen.ProcessBehavior {
return &EchoActor{}
}
func TestEchoActor_BasicBehavior(t *testing.T) {
// 1. Create a test actor
actor, err := unit.Spawn(t, newEchoActor)
if err != nil {
t.Fatal(err)
}
// 2. Create a sender PID (who is sending the message)
sender := gen.PID{Node: "test", ID: 123}
// 3. Send a message to the actor
actor.SendMessage(sender, "hello world")
// 4. Verify the actor sent the message back
actor.ShouldSend().
To(sender). // Should send to the original sender
Message("hello world"). // Should send back the same message
Once(). // Should happen exactly once
Assert() // Check that it actually happened
}
type CounterActor struct {
act.Actor
count int
}
func (c *CounterActor) HandleMessage(from gen.PID, message any) error {
switch message {
case "increment":
c.count++
c.Send(from, c.count)
case "get":
c.Send(from, c.count)
case "reset":
c.count = 0
c.Send(from, "reset complete")
}
return nil
}
func TestCounterActor_StatefulBehavior(t *testing.T) {
actor, _ := unit.Spawn(t, func() gen.ProcessBehavior { return &CounterActor{} })
client := gen.PID{Node: "test", ID: 456}
// Test incrementing
actor.SendMessage(client, "increment")
actor.ShouldSend().To(client).Message(1).Once().Assert()
actor.SendMessage(client, "increment")
actor.ShouldSend().To(client).Message(2).Once().Assert()
// Test getting current value
actor.SendMessage(client, "get")
actor.ShouldSend().To(client).Message(2).Once().Assert()
// Test reset
actor.SendMessage(client, "reset")
actor.ShouldSend().To(client).Message("reset complete").Once().Assert()
// Verify reset worked
actor.SendMessage(client, "get")
actor.ShouldSend().To(client).Message(0).Once().Assert()
}
func TestActorWithBuiltInAssertions(t *testing.T) {
actor, _ := unit.Spawn(t, newEchoActor)
// Use built-in assertions for simple checks
unit.NotNil(t, actor, "Actor should be created successfully")
unit.Equal(t, false, actor.IsTerminated(), "New actor should not be terminated")
// Combine with actor-specific assertions
actor.SendMessage(gen.PID{Node: "test", ID: 1}, "hello")
actor.ShouldSend().Message("hello").Once().Assert()
}
unit.Equal(t, expected, actual) // Values must be equal
unit.NotEqual(t, unexpected, actual) // Values must be different
unit.True(t, condition) // Condition must be true
unit.False(t, condition) // Condition must be false
unit.Nil(t, value) // Value must be nil
unit.NotNil(t, value) // Value must not be nil
unit.Contains(t, "hello world", "world") // String must contain substring
unit.IsType(t, "", actualValue) // Value must be of specific type
// This could cause version conflicts:
import "github.com/stretchr/testify/assert"
import "github.com/other/testing/lib"
// This always works:
import "ergo.services/ergo/testing/unit"
func TestGreeter_LogsWelcomeMessage(t *testing.T) {
actor, _ := unit.Spawn(t, newGreeter, unit.WithLogLevel(gen.LogLevelInfo))
actor.SendMessage(gen.PID{}, Welcome{Name: "Alice"})
// Verify the actor logged the welcome
actor.ShouldLog().
Level(gen.LogLevelInfo).
Containing("Welcome Alice").
Once().
Assert()
}
func TestDataProcessor_LogLevels(t *testing.T) {
actor, _ := unit.Spawn(t, newDataProcessor, unit.WithLogLevel(gen.LogLevelDebug))
actor.SendMessage(gen.PID{}, ProcessData{Data: "sample"})
// Should log at info level for important events
actor.ShouldLog().Level(gen.LogLevelInfo).Containing("Processing started").Once().Assert()
// Should log at debug level for detailed info
actor.ShouldLog().Level(gen.LogLevelDebug).Containing("Processing sample data").Once().Assert()
// Should never log at error level for normal operations
actor.ShouldLog().Level(gen.LogLevelError).Times(0).Assert()
}
// Good: Structured, predictable format
log.Info("User login: user=%s success=%t", userID, success)
// Poor: Hard to test reliably
log.Info("User " + userID + " tried to login and it " + result)
type messageCounter struct {
act.Actor
count int
}
func (m *messageCounter) Init(args ...any) error {
m.count = 0
m.Log().Info("Counter initialized")
return nil
}
func (m *messageCounter) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case "increment":
m.count++
m.Send("output", CountChanged{Count: m.count})
m.Log().Debug("Count incremented to %d", m.count)
return nil
case "get_count":
m.Send(from, CountResponse{Count: m.count})
return nil
case "reset":
m.count = 0
m.Send("output", CountReset{})
return nil
}
return nil
}
type CountChanged struct{ Count int }
type CountResponse struct{ Count int }
type CountReset struct{}
func factoryMessageCounter() gen.ProcessBehavior {
return &messageCounter{}
}
func TestMessageCounter_BasicUsage(t *testing.T) {
// Create test actor with configuration
actor, err := unit.Spawn(t, factoryMessageCounter,
unit.WithLogLevel(gen.LogLevelDebug),
unit.WithEnv(map[gen.Env]any{
"test_mode": true,
"timeout": 30,
}),
)
if err != nil {
t.Fatal(err)
}
// Test initialization
actor.ShouldLog().Level(gen.LogLevelInfo).Containing("Counter initialized").Once().Assert()
// Test message handling
actor.SendMessage(gen.PID{}, "increment")
actor.ShouldSend().To("output").Message(CountChanged{Count: 1}).Once().Assert()
actor.ShouldLog().Level(gen.LogLevelDebug).Containing("Count incremented to 1").Once().Assert()
// Test state query
actor.SendMessage(gen.PID{Node: "test", ID: 123}, "get_count")
actor.ShouldSend().To(gen.PID{Node: "test", ID: 123}).Message(CountResponse{Count: 1}).Once().Assert()
// Test reset
actor.SendMessage(gen.PID{}, "reset")
actor.ShouldSend().To("output").Message(CountReset{}).Once().Assert()
}
// Available options for unit.Spawn()
unit.WithLogLevel(gen.LogLevelDebug) // Set log level
unit.WithEnv(map[gen.Env]any{"key": "value"}) // Environment variables
unit.WithParent(gen.PID{Node: "parent", ID: 100}) // Parent process
unit.WithRegister(gen.Atom("registered_name")) // Register with name
unit.WithNodeName(gen.Atom("test_node@localhost")) // Node name
type notificationService struct {
act.Actor
subscribers []gen.PID
}
func (n *notificationService) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case Subscribe:
n.subscribers = append(n.subscribers, msg.PID)
n.Send(msg.PID, SubscriptionConfirmed{})
return nil
case Broadcast:
for _, subscriber := range n.subscribers {
n.Send(subscriber, Notification{
ID: msg.ID,
Message: msg.Message,
Sender: from,
})
}
n.Send("analytics", BroadcastSent{
ID: msg.ID,
Subscribers: len(n.subscribers),
})
return nil
}
return nil
}
type Subscribe struct{ PID gen.PID }
type SubscriptionConfirmed struct{}
type Broadcast struct{ ID string; Message string }
type Notification struct{ ID, Message string; Sender gen.PID }
type BroadcastSent struct{ ID string; Subscribers int }
func TestNotificationService_MessageSending(t *testing.T) {
actor, _ := unit.Spawn(t, factoryNotificationService)
subscriber1 := gen.PID{Node: "test", ID: 101}
subscriber2 := gen.PID{Node: "test", ID: 102}
// Test subscription
actor.SendMessage(gen.PID{}, Subscribe{PID: subscriber1})
actor.SendMessage(gen.PID{}, Subscribe{PID: subscriber2})
// Verify subscription confirmations
actor.ShouldSend().To(subscriber1).Message(SubscriptionConfirmed{}).Once().Assert()
actor.ShouldSend().To(subscriber2).Message(SubscriptionConfirmed{}).Once().Assert()
// Test broadcast
broadcaster := gen.PID{Node: "test", ID: 200}
actor.SendMessage(broadcaster, Broadcast{ID: "msg-123", Message: "Hello World"})
// Verify notifications sent to all subscribers
actor.ShouldSend().To(subscriber1).MessageMatching(func(msg any) bool {
if notif, ok := msg.(Notification); ok {
return notif.ID == "msg-123" &&
notif.Message == "Hello World" &&
notif.Sender == broadcaster
}
return false
}).Once().Assert()
actor.ShouldSend().To(subscriber2).MessageMatching(func(msg any) bool {
if notif, ok := msg.(Notification); ok {
return notif.ID == "msg-123" && notif.Message == "Hello World"
}
return false
}).Once().Assert()
// Verify analytics
actor.ShouldSend().To("analytics").Message(BroadcastSent{
ID: "msg-123",
Subscribers: 2,
}).Once().Assert()
// Test multiple sends to same target
actor.SendMessage(broadcaster, Broadcast{ID: "msg-124", Message: "Second message"})
actor.ShouldSend().To("analytics").Times(2).Assert() // Total of 2 analytics messages
}
// Message type matching
actor.ShouldSend().MessageMatching(unit.IsTypeGeneric[CountChanged]()).Assert()
// Field-based matching
actor.ShouldSend().MessageMatching(unit.HasField("Count", unit.Equals(5))).Assert()
// Structure matching with custom field validation
actor.ShouldSend().MessageMatching(
unit.StructureMatching(Notification{}, map[string]unit.Matcher{
"ID": unit.Equals("msg-123"),
"Sender": unit.IsValidPID(),
}),
).Assert()
// Never sent verification
actor.ShouldNotSend().To("error_handler").Message("error").Assert()
type workerSupervisor struct {
act.Actor
workers map[string]gen.PID
maxWorkers int
}
func (w *workerSupervisor) Init(args ...any) error {
w.workers = make(map[string]gen.PID)
w.maxWorkers = 3
return nil
}
func (w *workerSupervisor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case StartWorker:
if len(w.workers) >= w.maxWorkers {
w.Send(from, WorkerError{Error: "max workers reached"})
return nil
}
// Spawn worker with dynamic name
workerPID, err := w.Spawn(factoryWorker, gen.ProcessOptions{}, msg.WorkerID)
if err != nil {
w.Send(from, WorkerError{Error: err.Error()})
return nil
}
w.workers[msg.WorkerID] = workerPID
w.Send(from, WorkerStarted{WorkerID: msg.WorkerID, PID: workerPID})
w.Send("monitor", SupervisorStatus{
ActiveWorkers: len(w.workers),
MaxWorkers: w.maxWorkers,
})
return nil
case StopWorker:
if pid, exists := w.workers[msg.WorkerID]; exists {
w.SendExit(pid, gen.TerminateReasonShutdown)
delete(w.workers, msg.WorkerID)
w.Send(from, WorkerStopped{WorkerID: msg.WorkerID})
}
return nil
case StopAllWorkers:
for workerID, pid := range w.workers {
w.SendExit(pid, gen.TerminateReasonShutdown)
delete(w.workers, workerID)
}
w.Send(from, AllWorkersStopped{Count: len(w.workers)})
return nil
}
return nil
}
type StartWorker struct{ WorkerID string }
type StopWorker struct{ WorkerID string }
type StopAllWorkers struct{}
type WorkerStarted struct{ WorkerID string; PID gen.PID }
type WorkerStopped struct{ WorkerID string }
type WorkerError struct{ Error string }
type AllWorkersStopped struct{ Count int }
type SupervisorStatus struct{ ActiveWorkers, MaxWorkers int }
func factoryWorker() gen.ProcessBehavior { return &worker{} }
func factoryWorkerSupervisor() gen.ProcessBehavior { return &workerSupervisor{} }
type worker struct{ act.Actor }
func (w *worker) HandleMessage(from gen.PID, message any) error { return nil }
func TestWorkerSupervisor_SpawnManagement(t *testing.T) {
actor, _ := unit.Spawn(t, factoryWorkerSupervisor)
client := gen.PID{Node: "test", ID: 999}
// Test worker spawning
actor.SendMessage(client, StartWorker{WorkerID: "worker-1"})
// Capture the spawn event to get the PID
spawnResult := actor.ShouldSpawn().Factory(factoryWorker).Once().Capture()
unit.NotNil(t, spawnResult)
// Verify worker started response
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if started, ok := msg.(WorkerStarted); ok {
return started.WorkerID == "worker-1" && started.PID == spawnResult.PID
}
return false
}).Once().Assert()
// Verify monitor notification
actor.ShouldSend().To("monitor").Message(SupervisorStatus{
ActiveWorkers: 1,
MaxWorkers: 3,
}).Once().Assert()
// Test multiple workers
actor.SendMessage(client, StartWorker{WorkerID: "worker-2"})
actor.SendMessage(client, StartWorker{WorkerID: "worker-3"})
// Should have spawned 3 workers total
actor.ShouldSpawn().Factory(factoryWorker).Times(3).Assert()
// Test max worker limit
actor.SendMessage(client, StartWorker{WorkerID: "worker-4"})
actor.ShouldSend().To(client).Message(WorkerError{Error: "max workers reached"}).Once().Assert()
// Should still only have 3 spawned workers
actor.ShouldSpawn().Factory(factoryWorker).Times(3).Assert()
// Test stopping a worker
actor.SendMessage(client, StopWorker{WorkerID: "worker-1"})
actor.ShouldSend().To(client).Message(WorkerStopped{WorkerID: "worker-1"}).Once().Assert()
}
func TestDynamicProcessCreation(t *testing.T) {
actor, _ := unit.Spawn(t, factoryTaskProcessor)
// Test dynamic process creation with captured PIDs
actor.SendMessage(gen.PID{}, CreateSessionWorker{UserID: "user123"})
// Capture the spawn to get dynamic PID
spawnResult := actor.ShouldSpawn().Once().Capture()
sessionPID := spawnResult.PID
// Verify session was registered with the dynamic PID
actor.ShouldSend().To("session_registry").MessageMatching(func(msg any) bool {
if reg, ok := msg.(SessionRegistered); ok {
return reg.UserID == "user123" && reg.SessionPID == sessionPID
}
return false
}).Once().Assert()
// Test sending work to the dynamic session
actor.SendMessage(gen.PID{}, SendToSession{
UserID: "user123",
Task: "process_data",
})
// Should route to the captured session PID
actor.ShouldSend().To(sessionPID).MessageMatching(func(msg any) bool {
if task, ok := msg.(SessionTask); ok {
return task.Task == "process_data"
}
return false
}).Once().Assert()
}
// Required message types for this example:
type CreateSessionWorker struct{ UserID string }
type SessionRegistered struct{ UserID string; SessionPID gen.PID }
type SendToSession struct{ UserID, Task string }
type SessionTask struct{ Task string }
// factoryTaskProcessor() gen.ProcessBehavior function would be defined separately
type distributedCoordinator struct {
act.Actor
nodeAvailability map[gen.Atom]bool
roundRobin int
}
func (dc *distributedCoordinator) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case SpawnRemoteWorker:
if !dc.isNodeAvailable(msg.NodeName) {
dc.Send(from, RemoteSpawnError{
NodeName: msg.NodeName,
Error: "node not available",
})
return nil
}
// Use RemoteSpawn which generates RemoteSpawnEvent
pid, err := dc.RemoteSpawn(msg.NodeName, msg.WorkerName, gen.ProcessOptions{}, msg.Config)
if err != nil {
dc.Send(from, RemoteSpawnError{NodeName: msg.NodeName, Error: err.Error()})
return nil
}
dc.Send(from, RemoteWorkerSpawned{
NodeName: msg.NodeName,
WorkerName: msg.WorkerName,
PID: pid,
})
return nil
case SpawnRemoteService:
// Use RemoteSpawnRegister which generates RemoteSpawnEvent with registration
pid, err := dc.RemoteSpawnRegister(msg.NodeName, msg.ServiceName, msg.RegisterName, gen.ProcessOptions{})
if err != nil {
dc.Send(from, RemoteSpawnError{NodeName: msg.NodeName, Error: err.Error()})
return nil
}
dc.Send(from, RemoteServiceSpawned{
NodeName: msg.NodeName,
ServiceName: msg.ServiceName,
RegisterName: msg.RegisterName,
PID: pid,
})
return nil
}
return nil
}
type SpawnRemoteWorker struct{ NodeName, WorkerName gen.Atom; Config map[string]any }
type SpawnRemoteService struct{ NodeName, ServiceName, RegisterName gen.Atom }
type RemoteWorkerSpawned struct{ NodeName, WorkerName gen.Atom; PID gen.PID }
type RemoteServiceSpawned struct{ NodeName, ServiceName, RegisterName gen.Atom; PID gen.PID }
type RemoteSpawnError struct{ NodeName gen.Atom; Error string }
func TestDistributedCoordinator_RemoteSpawn(t *testing.T) {
actor, _ := unit.Spawn(t, factoryDistributedCoordinator)
// Setup remote nodes for testing
actor.CreateRemoteNode("worker@node1", true) // Available
actor.CreateRemoteNode("worker@node2", false) // Unavailable
clientPID := gen.PID{Node: "test", ID: 100}
actor.ClearEvents() // Clear initialization events
// Test basic remote spawn
actor.SendMessage(clientPID, SpawnRemoteWorker{
NodeName: "worker@node1",
WorkerName: "data-processor",
Config: map[string]any{"timeout": 30},
})
// Verify remote spawn event
actor.ShouldRemoteSpawn().
ToNode("worker@node1").
WithName("data-processor").
Once().
Assert()
// Test remote spawn with registration
actor.SendMessage(clientPID, SpawnRemoteService{
NodeName: "worker@node1",
ServiceName: "user-service",
RegisterName: "users",
})
// Verify remote spawn with register
actor.ShouldRemoteSpawn().
ToNode("worker@node1").
WithName("user-service").
WithRegister("users").
Once().
Assert()
// Test total remote spawns
actor.ShouldRemoteSpawn().Times(2).Assert()
// Test negative assertion - should not spawn on unavailable node
actor.SendMessage(clientPID, SpawnRemoteWorker{
NodeName: "worker@node2",
WorkerName: "test-worker",
})
actor.ShouldNotRemoteSpawn().ToNode("worker@node2").Assert()
}
type connectionManager struct {
act.Actor
connections map[string]*Connection
maxRetries int
}
func (c *connectionManager) Init(args ...any) error {
c.connections = make(map[string]*Connection)
c.maxRetries = 3
c.Log().Info("Connection manager started")
return nil
}
func (c *connectionManager) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case CreateConnection:
conn := &Connection{ID: msg.ID, Status: "active"}
c.connections[msg.ID] = conn
c.Send(from, ConnectionCreated{ID: msg.ID})
c.Log().Info("Created connection %s", msg.ID)
return nil
case CloseConnection:
if conn, exists := c.connections[msg.ID]; exists {
conn.Close()
delete(c.connections, msg.ID)
c.Send(from, ConnectionClosed{ID: msg.ID})
c.Log().Info("Closed connection %s", msg.ID)
}
return nil
case "shutdown":
// Graceful shutdown - close all connections
for id, conn := range c.connections {
conn.Close()
c.Log().Info("Shutdown: closed connection %s", id)
}
c.Send("monitor", ShutdownComplete{ConnectionsClosed: len(c.connections)})
return gen.TerminateReasonShutdown
case ConnectionError:
c.Log().Error("Connection error for %s: %s", msg.ID, msg.Error)
msg.RetryCount++
if msg.RetryCount >= c.maxRetries {
c.Log().Error("Max retries exceeded for connection %s", msg.ID)
return fmt.Errorf("connection failed after %d retries: %s", c.maxRetries, msg.Error)
}
// Retry the connection
c.Send(c.PID(), CreateConnection{ID: msg.ID})
return nil
case "force_error":
// Simulate critical error
return fmt.Errorf("critical system error: database unavailable")
}
return nil
}
type CreateConnection struct{ ID string }
type CloseConnection struct{ ID string }
type ConnectionCreated struct{ ID string }
type ConnectionClosed struct{ ID string }
type ConnectionError struct{ ID, Error string; RetryCount int }
type ShutdownComplete struct{ ConnectionsClosed int }
type Connection struct {
ID string
Status string
}
func (c *Connection) Close() { c.Status = "closed" }
func factoryConnectionManager() gen.ProcessBehavior {
return &connectionManager{}
}
func TestConnectionManager_TerminationHandling(t *testing.T) {
actor, _ := unit.Spawn(t, factoryConnectionManager)
client := gen.PID{Node: "test", ID: 100}
// Test normal operation first
actor.SendMessage(client, CreateConnection{ID: "conn-1"})
actor.ShouldSend().To(client).Message(ConnectionCreated{ID: "conn-1"}).Once().Assert()
// Verify actor is not terminated during normal operation
unit.Equal(t, false, actor.IsTerminated())
unit.Nil(t, actor.TerminationReason())
// Test graceful shutdown
actor.SendMessage(client, "shutdown")
// Verify shutdown message sent
actor.ShouldSend().To("monitor").MessageMatching(func(msg any) bool {
if shutdown, ok := msg.(ShutdownComplete); ok {
return shutdown.ConnectionsClosed == 1
}
return false
}).Once().Assert()
// Verify graceful termination
unit.Equal(t, true, actor.IsTerminated())
unit.Equal(t, gen.TerminateReasonShutdown, actor.TerminationReason())
// Verify termination event was captured
actor.ShouldTerminate().
WithReason(gen.TerminateReasonShutdown).
Once().
Assert()
}
func TestConnectionManager_ErrorTermination(t *testing.T) {
actor, _ := unit.Spawn(t, factoryConnectionManager)
// Test abnormal termination due to critical error
actor.SendMessage(gen.PID{}, "force_error")
// Verify actor terminated with error
unit.Equal(t, true, actor.IsTerminated())
unit.NotNil(t, actor.TerminationReason())
unit.Contains(t, actor.TerminationReason().Error(), "critical system error")
// Verify termination event with specific error
actor.ShouldTerminate().
ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "database unavailable")
}).
Once().
Assert()
}
func TestConnectionManager_RetryBeforeTermination(t *testing.T) {
actor, _ := unit.Spawn(t, factoryConnectionManager)
// Test retry logic before termination
actor.SendMessage(gen.PID{}, CreateConnection{ID: "conn-retry"})
actor.ClearEvents() // Clear creation events
// Send connection errors that should trigger retries
for i := 0; i < 2; i++ {
actor.SendMessage(gen.PID{}, ConnectionError{
ID: "conn-retry",
Error: "network timeout",
RetryCount: i,
})
// Should not terminate yet
unit.Equal(t, false, actor.IsTerminated())
// Should retry by sending CreateConnection
actor.ShouldSend().To(actor.PID()).MessageMatching(func(msg any) bool {
if create, ok := msg.(CreateConnection); ok {
return create.ID == "conn-retry"
}
return false
}).Once().Assert()
}
// Final error that exceeds max retries
actor.SendMessage(gen.PID{}, ConnectionError{
ID: "conn-retry",
Error: "network timeout",
RetryCount: 3, // Exceeds maxRetries
})
// Now should terminate with error
unit.Equal(t, true, actor.IsTerminated())
unit.Contains(t, actor.TerminationReason().Error(), "connection failed after 3 retries")
// Verify termination assertion
actor.ShouldTerminate().
ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "retries") &&
strings.Contains(reason.Error(), "network timeout")
}).
Once().
Assert()
}
func TestTerminatedActor_NoFurtherProcessing(t *testing.T) {
actor, _ := unit.Spawn(t, factoryConnectionManager)
// Terminate the actor
actor.SendMessage(gen.PID{}, "force_error")
unit.Equal(t, true, actor.IsTerminated())
actor.ClearEvents() // Clear termination events
// Try to send more messages - should not be processed
actor.SendMessage(gen.PID{}, CreateConnection{ID: "should-not-work"})
// Should not process the message (no CreateConnection response)
actor.ShouldNotSend().To(gen.PID{}).Message(ConnectionCreated{ID: "should-not-work"}).Assert()
// Should not create any new events
events := actor.Events()
unit.Equal(t, 0, len(events), "Terminated actor should not process messages")
}
#### Termination Testing Methods
**TestActor Termination Status:**
```go
// Check if actor is terminated
isTerminated := actor.IsTerminated() // bool
// Get termination reason (nil if not terminated)
reason := actor.TerminationReason() // error or nil
// Test that actor should terminate
actor.ShouldTerminate().Once().Assert()
// Test with specific reason
actor.ShouldTerminate().WithReason(gen.TerminateReasonShutdown).Assert()
// Test with reason matching
actor.ShouldTerminate().ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "expected error")
}).Assert()
// Test that actor should NOT terminate
actor.ShouldNotTerminate().Assert()
// Test multiple termination attempts
actor.ShouldTerminate().Times(1).Assert() // Should terminate exactly once
// Capture termination for detailed analysis
terminationResult := actor.ShouldTerminate().Once().Capture()
unit.NotNil(t, terminationResult)
unit.Equal(t, expectedReason, terminationResult.Reason)
// Test termination with timeout
success := unit.WithTimeout(func() {
actor.SendMessage(gen.PID{}, "shutdown")
actor.ShouldTerminate().Once().Assert()
}, 5*time.Second)
unit.True(t, success(), "Actor should terminate within timeout")
type processSupervisor struct {
act.Actor
workers map[string]gen.PID
maxWorkers int
}
func (p *processSupervisor) Init(args ...any) error {
p.workers = make(map[string]gen.PID)
p.maxWorkers = 5
return nil
}
func (p *processSupervisor) HandleMessage(from gen.PID, message any) error {
switch msg := message.(type) {
case StartWorker:
if len(p.workers) >= p.maxWorkers {
p.Send(from, WorkerStartError{Error: "max workers reached"})
return nil
}
workerPID, err := p.Spawn(factoryWorkerProcess, gen.ProcessOptions{}, msg.WorkerID)
if err != nil {
p.Send(from, WorkerStartError{Error: err.Error()})
return nil
}
p.workers[msg.WorkerID] = workerPID
p.Send(from, WorkerStarted{WorkerID: msg.WorkerID, PID: workerPID})
return nil
case StopWorker:
if workerPID, exists := p.workers[msg.WorkerID]; exists {
// Send exit signal to worker
p.SendExit(workerPID, gen.TerminateReasonShutdown)
delete(p.workers, msg.WorkerID)
p.Send(from, WorkerStopped{WorkerID: msg.WorkerID})
p.Log().Info("Sent exit signal to worker %s", msg.WorkerID)
} else {
p.Send(from, WorkerStopError{WorkerID: msg.WorkerID, Error: "worker not found"})
}
return nil
case EmergencyShutdown:
// Send exit signals to all workers with error reason
shutdownReason := fmt.Errorf("emergency shutdown: %s", msg.Reason)
for workerID, workerPID := range p.workers {
p.SendExit(workerPID, shutdownReason)
p.Log().Warning("Emergency shutdown: sent exit to worker %s", workerID)
}
// Send meta exit signal to monitoring system
p.SendExitMeta(gen.PID{Node: "monitor", ID: 999}, shutdownReason)
p.Send(from, EmergencyShutdownComplete{
WorkersTerminated: len(p.workers),
Reason: msg.Reason,
})
p.workers = make(map[string]gen.PID) // Clear workers map
return nil
case TerminateWorkerWithError:
if workerPID, exists := p.workers[msg.WorkerID]; exists {
errorReason := fmt.Errorf("worker error: %s", msg.Error)
p.SendExit(workerPID, errorReason)
delete(p.workers, msg.WorkerID)
p.Send(from, WorkerTerminated{
WorkerID: msg.WorkerID,
Reason: msg.Error,
})
}
return nil
}
return nil
}
type StartWorker struct{ WorkerID string }
type StopWorker struct{ WorkerID string }
type EmergencyShutdown struct{ Reason string }
type TerminateWorkerWithError struct{ WorkerID, Error string }
type WorkerStarted struct{ WorkerID string; PID gen.PID }
type WorkerStopped struct{ WorkerID string }
type WorkerStartError struct{ Error string }
type WorkerStopError struct{ WorkerID, Error string }
type EmergencyShutdownComplete struct{ WorkersTerminated int; Reason string }
type WorkerTerminated struct{ WorkerID, Reason string }
type workerProcess struct{ act.Actor }
func (w *workerProcess) HandleMessage(from gen.PID, message any) error { return nil }
func factoryWorkerProcess() gen.ProcessBehavior { return &workerProcess{} }
func factoryProcessSupervisor() gen.ProcessBehavior { return &processSupervisor{} }
func TestProcessSupervisor_ExitSignals(t *testing.T) {
actor, _ := unit.Spawn(t, factoryProcessSupervisor)
client := gen.PID{Node: "test", ID: 100}
// Start some workers
actor.SendMessage(client, StartWorker{WorkerID: "worker-1"})
actor.SendMessage(client, StartWorker{WorkerID: "worker-2"})
// Capture worker PIDs for validation
spawn1 := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
spawn2 := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
worker1PID := spawn1.PID
worker2PID := spawn2.PID
actor.ClearEvents() // Clear spawn events
// Test graceful worker stop
actor.SendMessage(client, StopWorker{WorkerID: "worker-1"})
// Verify exit signal sent to worker
actor.ShouldSendExit().
To(worker1PID).
WithReason(gen.TerminateReasonShutdown).
Once().
Assert()
// Verify stop confirmation
actor.ShouldSend().To(client).Message(WorkerStopped{WorkerID: "worker-1"}).Once().Assert()
// Test worker termination with custom error
actor.SendMessage(client, TerminateWorkerWithError{
WorkerID: "worker-2",
Error: "memory leak detected",
})
// Verify exit signal with custom error reason
actor.ShouldSendExit().
To(worker2PID).
ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "memory leak detected")
}).
Once().
Assert()
// Verify termination response
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if terminated, ok := msg.(WorkerTerminated); ok {
return terminated.WorkerID == "worker-2" &&
terminated.Reason == "memory leak detected"
}
return false
}).Once().Assert()
}
func TestProcessSupervisor_EmergencyShutdown(t *testing.T) {
actor, _ := unit.Spawn(t, factoryProcessSupervisor)
client := gen.PID{Node: "test", ID: 100}
// Start multiple workers
for i := 1; i <= 3; i++ {
actor.SendMessage(client, StartWorker{WorkerID: fmt.Sprintf("worker-%d", i)})
}
// Capture all worker PIDs
workers := make([]gen.PID, 3)
for i := 0; i < 3; i++ {
spawn := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
workers[i] = spawn.PID
}
actor.ClearEvents() // Clear spawn events
// Trigger emergency shutdown
actor.SendMessage(client, EmergencyShutdown{Reason: "system overload"})
// Verify exit signals sent to all workers
for _, workerPID := range workers {
actor.ShouldSendExit().
To(workerPID).
ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "emergency shutdown") &&
strings.Contains(reason.Error(), "system overload")
}).
Once().
Assert()
}
// Verify meta exit signal sent to monitoring
monitorPID := gen.PID{Node: "monitor", ID: 999}
actor.ShouldSendExitMeta().
To(monitorPID).
ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "system overload")
}).
Once().
Assert()
// Verify shutdown completion message
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if complete, ok := msg.(EmergencyShutdownComplete); ok {
return complete.WorkersTerminated == 3 &&
complete.Reason == "system overload"
}
return false
}).Once().Assert()
// Verify total exit signals (3 workers + 1 meta)
actor.ShouldSendExit().Times(3).Assert()
actor.ShouldSendExitMeta().Times(1).Assert()
}
func TestExitSignal_NegativeAssertions(t *testing.T) {
actor, _ := unit.Spawn(t, factoryProcessSupervisor)
client := gen.PID{Node: "test", ID: 100}
// Try to stop non-existent worker
actor.SendMessage(client, StopWorker{WorkerID: "non-existent"})
// Should not send any exit signals
actor.ShouldNotSendExit().Assert()
actor.ShouldNotSendExitMeta().Assert()
// Should send error response instead
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if stopError, ok := msg.(WorkerStopError); ok {
return stopError.WorkerID == "non-existent" &&
stopError.Error == "worker not found"
}
return false
}).Once().Assert()
}
// Test that exit signal was sent
actor.ShouldSendExit().To(targetPID).Once().Assert()
// Test with specific reason
actor.ShouldSendExit().To(targetPID).WithReason(gen.TerminateReasonShutdown).Assert()
// Test with reason matching
actor.ShouldSendExit().ReasonMatching(func(reason error) bool {
return strings.Contains(reason.Error(), "expected error")
}).Assert()
// Test meta exit signals
actor.ShouldSendExitMeta().To(monitorPID).WithReason(errorReason).Assert()
// Negative assertions
actor.ShouldNotSendExit().To(targetPID).Assert()
actor.ShouldNotSendExitMeta().Assert()
// Test multiple exit signals
actor.ShouldSendExit().Times(3).Assert() // Should send exactly 3 exit signals
// Test exit signals to specific targets
actor.ShouldSendExit().To(worker1PID).Once().Assert()
actor.ShouldSendExit().To(worker2PID).Once().Assert()
// Capture exit signal for detailed analysis
exitResult := actor.ShouldSendExit().Once().Capture()
unit.NotNil(t, exitResult)
unit.Equal(t, expectedPID, exitResult.To)
unit.Equal(t, expectedReason, exitResult.Reason)
// Combined assertions
actor.ShouldSendExit().To(workerPID).WithReason(gen.TerminateReasonShutdown).Once().Assert()
actor.ShouldSendExitMeta().To(monitorPID).ReasonMatching(func(r error) bool {
return strings.Contains(r.Error(), "shutdown complete")
}).Once().Assert()
func TestTaskScheduler_CronJobs(t *testing.T) {
actor, _ := unit.Spawn(t, factoryTaskScheduler)
client := gen.PID{Node: "test", ID: 100}
// Test basic job scheduling
actor.SendMessage(client, ScheduleTask{
TaskID: "daily-backup",
Schedule: "0 2 * * *", // Daily at 2 AM
})
// Verify cron job was added
actor.ShouldAddCronJob().
WithSchedule("0 2 * * *").
Once().
Assert()
// Verify scheduling response
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if scheduled, ok := msg.(TaskScheduled); ok {
return scheduled.TaskID == "daily-backup" && scheduled.JobID != ""
}
return false
}).Once().Assert()
// Test job execution by triggering it
actor.TriggerCronJob("0 2 * * *") // Manually trigger the scheduled job
// Verify job execution
actor.ShouldExecuteCronJob().
WithSchedule("0 2 * * *").
Once().
Assert()
// Verify task execution message
actor.ShouldSend().To("output").MessageMatching(func(msg any) bool {
if executed, ok := msg.(TaskExecuted); ok {
return executed.TaskID == "daily-backup" && executed.Count == 1
}
return false
}).Once().Assert()
}
func TestTaskScheduler_MockTimeControl(t *testing.T) {
actor, _ := unit.Spawn(t, factoryTaskScheduler)
client := gen.PID{Node: "test", ID: 100}
// Set initial mock time
baseTime := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
actor.SetCronMockTime(baseTime)
// Schedule a job for every minute
actor.SendMessage(client, ScheduleTask{
TaskID: "minute-task",
Schedule: "* * * * *", // Every minute
})
cronJob := actor.ShouldAddCronJob().Once().Capture()
actor.ClearEvents()
// Advance time by 1 minute - should trigger the job
actor.SetCronMockTime(baseTime.Add(1 * time.Minute))
// Verify job executed
actor.ShouldExecuteCronJob().
WithJobID(cronJob.ID).
Once().
Assert()
// Advance time by another minute
actor.SetCronMockTime(baseTime.Add(2 * time.Minute))
// Should execute again
actor.ShouldExecuteCronJob().
WithJobID(cronJob.ID).
Times(2). // Total of 2 executions
Assert()
}
// Test that cron job was added
actor.ShouldAddCronJob().WithSchedule("0 2 * * *").Once().Assert()
// Test job execution
actor.ShouldExecuteCronJob().WithSchedule("0 * * * *").Times(3).Assert()
// Test job removal
actor.ShouldRemoveCronJob().WithJobID("job-123").Once().Assert()
// Test job enable/disable
actor.ShouldEnableCronJob().WithJobID("job-123").Once().Assert()
actor.ShouldDisableCronJob().WithJobID("job-123").Once().Assert()
// Negative assertions
actor.ShouldNotAddCronJob().Assert()
actor.ShouldNotExecuteCronJob().Assert()
// Set mock time for deterministic testing
baseTime := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
actor.SetCronMockTime(baseTime)
// Advance time to trigger scheduled jobs
actor.SetCronMockTime(baseTime.Add(1 * time.Hour))
// Manually trigger cron jobs for testing
actor.TriggerCronJob("0 * * * *") // Trigger hourly job
actor.TriggerCronJob("job-id-123") // Trigger by job ID
// Capture cron job for detailed analysis
cronJob := actor.ShouldAddCronJob().Once().Capture()
jobID := cronJob.ID
schedule := cronJob.Schedule
// Test multiple job executions with time control
for i := 0; i < 5; i++ {
actor.SetCronMockTime(baseTime.Add(time.Duration(i) * time.Minute))
actor.TriggerCronJob("* * * * *") // Every minute
}
actor.ShouldExecuteCronJob().Times(5).Assert()
func TestDynamicValues(t *testing.T) {
actor, _ := unit.Spawn(t, factorySessionManager)
// Send request that will generate dynamic session ID
actor.SendMessage(gen.PID{}, CreateSession{UserID: "user123"})
// Capture the spawn to get the dynamic session PID
spawnResult := actor.ShouldSpawn().Once().Capture()
sessionPID := spawnResult.PID
// Use captured PID in subsequent assertions
actor.ShouldSend().MessageMatching(func(msg any) bool {
if created, ok := msg.(SessionCreated); ok {
return created.SessionPID == sessionPID && created.UserID == "user123"
}
return false
}).Once().Assert()
}
func TestEventInspection(t *testing.T) {
actor, _ := unit.Spawn(t, factoryComplexActor)
// Perform operations
actor.SendMessage(gen.PID{}, ComplexOperation{})
// Get all events for inspection
events := actor.Events()
var sendCount, spawnCount, logCount, remoteSpawnCount int
for _, event := range events {
switch event.(type) {
case unit.SendEvent:
sendCount++
case unit.SpawnEvent:
spawnCount++
case unit.LogEvent:
logCount++
case unit.RemoteSpawnEvent:
remoteSpawnCount++
}
}
unit.True(t, sendCount > 0, "Should have send events")
unit.True(t, spawnCount == 2, "Should spawn exactly 2 processes")
unit.True(t, logCount >= 1, "Should have log events")
}
func TestLastEvent(t *testing.T) {
actor, _ := unit.Spawn(t, factoryExampleActor)
actor.SendMessage(gen.PID{}, "test")
// Get the most recent event
lastEvent := actor.LastEvent()
unit.NotNil(t, lastEvent, "Should have a last event")
unit.Equal(t, "send", lastEvent.Type())
if sendEvent, ok := lastEvent.(unit.SendEvent); ok {
unit.Equal(t, "test", sendEvent.Message)
}
}
func TestClearEvents(t *testing.T) {
actor, _ := unit.Spawn(t, factoryExampleActor)
// Perform some operations
actor.SendMessage(gen.PID{}, "setup")
actor.ShouldSend().Once().Assert()
// Clear events before main test
actor.ClearEvents()
// Now test the main functionality
actor.SendMessage(gen.PID{}, "main_operation")
// Only the main operation events are captured
events := actor.Events()
unit.Equal(t, 1, len(events), "Should only have main operation event")
}
import (
"testing"
"time"
"ergo.services/ergo/testing/unit"
)
func TestWithTimeout(t *testing.T) {
actor, _ := unit.Spawn(t, factoryExampleActor)
// Test that assertion completes within timeout
success := unit.WithTimeout(func() {
actor.SendMessage(gen.PID{}, "test")
actor.ShouldSend().Once().Assert()
}, 5*time.Second)
unit.True(t, success(), "Assertion should complete within timeout")
}
// Good: Tests one specific behavior
func TestUserManager_CreateUser_Success(t *testing.T) { ... }
func TestUserManager_CreateUser_DuplicateEmail(t *testing.T) { ... }
func TestUserManager_CreateUser_InvalidData(t *testing.T) { ... }
// Poor: Tests multiple behaviors in one test
func TestUserManager_AllOperations(t *testing.T) { ... }
func TestWorkerSupervisor_MaxWorkersReached(t *testing.T) {
// Test that supervisor properly rejects requests when at capacity
// Test that appropriate error messages are sent
// Test that the supervisor remains functional after rejecting requests
}
// Good: Easy to test with pattern matching
type UserCreated struct {
UserID string
Email string
Created time.Time
}
// Poor: Hard to validate in tests
type GenericMessage struct {
Type string
Data map[string]interface{}
}
type OrderProcessed struct {
OrderID string // Predictable - can be set in test
Total float64 // Predictable - can be set in test
ProcessedAt time.Time // Dynamic - use pattern matching
RequestID string // Dynamic - capture and validate
}
import (
"fmt"
"testing"
"ergo.services/ergo/testing/unit"
)
func TestWorkerPool_ConcurrentRequests(t *testing.T) {
actor, _ := unit.Spawn(t, factoryWorkerPool)
// Send multiple requests concurrently
for i := 0; i < 100; i++ {
actor.SendMessage(gen.PID{}, ProcessRequest{ID: fmt.Sprintf("req-%d", i)})
}
// Verify all requests were processed
actor.ShouldSend().To("output").Times(100).Assert()
}
// Note: This example assumes you have defined:
// - type ProcessRequest struct{ ID string }
// - factoryWorkerPool() gen.ProcessBehavior function
// Import the testing library
import "ergo.services/ergo/testing/unit"
// Run all tests
go test -v ergo.services/ergo/testing/unit
// Run feature-specific tests
go test -v -run TestBasic ergo.services/ergo/testing/unit
go test -v -run TestNetwork ergo.services/ergo/testing/unit
go test -v -run TestWorkflow ergo.services/ergo/testing/unit
go test -v -run TestCall ergo.services/ergo/testing/unit
go test -v -run TestCron ergo.services/ergo/testing/unit
go test -v -run TestTermination ergo.services/ergo/testing/unit
func TestDatabaseActor_ConfigurationBehavior(t *testing.T) {
// Test with different configurations
// Development configuration
devActor, _ := unit.Spawn(t, newDatabaseActor,
unit.WithEnv(map[gen.Env]any{
"DB_POOL_SIZE": 5,
"LOG_QUERIES": true,
}))
devActor.SendMessage(gen.PID{}, ExecuteQuery{SQL: "SELECT * FROM users"})
devActor.ShouldLog().Level(gen.LogLevelDebug).Containing("SELECT * FROM users").Assert()
// Production configuration
prodActor, _ := unit.Spawn(t, newDatabaseActor,
unit.WithEnv(map[gen.Env]any{
"DB_POOL_SIZE": 50,
"LOG_QUERIES": false,
}))
prodActor.SendMessage(gen.PID{}, ExecuteQuery{SQL: "SELECT * FROM users"})
prodActor.ShouldLog().Level(gen.LogLevelDebug).Times(0).Assert() // No query logging in prod
}
func TestOrderProcessor_WorkflowSteps(t *testing.T) {
actor, _ := unit.Spawn(t, newOrderProcessor)
client := gen.PID{Node: "client", ID: 1}
// Start an order
actor.SendMessage(client, CreateOrder{Items: []string{"book", "pen"}})
// Should trigger a sequence of operations
actor.ShouldSend().To("inventory").Message("check_availability").Once().Assert()
actor.ShouldSend().To("payment").Message("calculate_total").Once().Assert()
actor.ShouldSend().To("shipping").Message("estimate_delivery").Once().Assert()
// Should send status back to client
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if status, ok := msg.(OrderStatus); ok {
return status.Status == "processing"
}
return false
}).Once().Assert()
}
func TestSecurityGate_AccessControl(t *testing.T) {
actor, _ := unit.Spawn(t, newSecurityGate)
// Test admin access
admin := gen.PID{Node: "admin", ID: 1}
actor.SendMessage(admin, AccessRequest{Resource: "admin_panel", User: "admin"})
actor.ShouldSend().To(admin).Message(AccessGranted{}).Once().Assert()
// Test regular user access to admin panel
user := gen.PID{Node: "user", ID: 2}
actor.SendMessage(user, AccessRequest{Resource: "admin_panel", User: "regular_user"})
actor.ShouldSend().To(user).Message(AccessDenied{Reason: "insufficient privileges"}).Once().Assert()
// Test regular user access to public resources
actor.SendMessage(user, AccessRequest{Resource: "public_content", User: "regular_user"})
actor.ShouldSend().To(user).Message(AccessGranted{}).Once().Assert()
}
func TestTaskManager_WorkerCreation(t *testing.T) {
actor, _ := unit.Spawn(t, newTaskManager)
client := gen.PID{Node: "client", ID: 1}
// Request a new worker
actor.SendMessage(client, CreateWorker{TaskType: "data_processing"})
// Should spawn a worker process
actor.ShouldSpawn().Once().Assert()
// Should confirm to client
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if response, ok := msg.(WorkerCreated); ok {
return response.TaskType == "data_processing"
}
return false
}).Once().Assert()
}
func TestSessionManager_UserSessions(t *testing.T) {
actor, _ := unit.Spawn(t, newSessionManager)
client := gen.PID{Node: "client", ID: 1}
// Create a session for a user
actor.SendMessage(client, CreateSession{UserID: "alice"})
// Capture the spawned session process
sessionSpawn := actor.ShouldSpawn().Once().Capture()
sessionPID := sessionSpawn.PID
// Verify session was registered
actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
if response, ok := msg.(SessionCreated); ok {
return response.UserID == "alice" && response.SessionPID == sessionPID
}
return false
}).Once().Assert()
// Send work to the session
actor.SendMessage(client, SendToSession{UserID: "alice", Data: "important_data"})
// Should route to the captured session PID
actor.ShouldSend().To(sessionPID).Message("important_data").Once().Assert()
}
func TestComplexActor_DebugFailures(t *testing.T) {
actor, _ := unit.Spawn(t, newComplexActor)
// Perform some operations
actor.SendMessage(gen.PID{}, TriggerComplexWorkflow{})
// If something goes wrong, inspect all events
events := actor.Events()
t.Logf("Total events captured: %d", len(events))
for i, event := range events {
t.Logf("Event %d: %s - %s", i, event.Type(), event.String())
}
// Clear events and test specific behavior
actor.ClearEvents()
actor.SendMessage(gen.PID{}, SimpleBehavior{})
// Now only simple behavior events are captured
simpleEvents := actor.Events()
unit.Equal(t, 1, len(simpleEvents), "Should only have one event after clearing")
}
func TestActorWithFailureInjection(t *testing.T) {
actor, err := unit.Spawn(t, factoryMyActor)
if err != nil {
t.Fatal(err)
}
// Inject failure for spawn operations
actor.Process().SetMethodFailure("Spawn", errors.New("resource limit exceeded"))
// Test how the actor handles spawn failures
actor.SendMessage(gen.PID{}, CreateWorker{WorkerType: "data_processor"})
// Verify the actor handles the failure gracefully
actor.ShouldSend().MessageMatching(func(msg any) bool {
if err, ok := msg.(WorkerCreationError); ok {
return strings.Contains(err.Error, "resource limit exceeded")
}
return false
}).Once().Assert()
}
// Fail every call to the method
actor.Process().SetMethodFailure("Send", errors.New("network error"))
// Fail only once
actor.Process().SetMethodFailureOnce("Spawn", errors.New("temporary failure"))
// Fail after N successful calls
actor.Process().SetMethodFailureAfter("Send", 3, errors.New("rate limit"))
// Fail when arguments match a pattern
actor.Process().SetMethodFailurePattern("RegisterName", "worker", errors.New("pattern match"))
// Clear specific failure
actor.Process().ClearMethodFailure("Send")
// Clear all failures
actor.Process().ClearMethodFailures()
// Get call count for a method
count := actor.Process().GetMethodCallCount("Spawn")
func TestProcessor_IntermittentFailures(t *testing.T) {
processor, _ := unit.Spawn(t, factoryDataProcessor)
// Fail after 2 successful operations
processor.Process().SetMethodFailureAfter("Send", 2, errors.New("network timeout"))
// First two sends succeed
processor.SendMessage(gen.PID{}, ProcessData{ID: "1"})
processor.SendMessage(gen.PID{}, ProcessData{ID: "2"})
processor.ShouldSend().Times(2).Assert()
// Third send fails
processor.SendMessage(gen.PID{}, ProcessData{ID: "3"})
processor.ShouldSend().MessageMatching(func(msg any) bool {
if err, ok := msg.(ProcessingError); ok {
return strings.Contains(err.Error, "network timeout")
}
return false
}).Once().Assert()
}
func TestRegistry_PatternFailures(t *testing.T) {
registry, _ := unit.Spawn(t, factoryRegistry)
// Fail registration for names containing "temp"
registry.Process().SetMethodFailurePattern("RegisterName", "temp", errors.New("temporary names not allowed"))
// Normal registration succeeds
registry.SendMessage(gen.PID{}, Register{Name: "service"})
registry.ShouldSend().Message(RegisterSuccess{Name: "service"}).Once().Assert()
// Temporary registration fails
registry.SendMessage(gen.PID{}, Register{Name: "temp_worker"})
registry.ShouldSend().MessageMatching(func(msg any) bool {
if err, ok := msg.(RegisterError); ok {
return strings.Contains(err.Error, "temporary names not allowed")
}
return false
}).Once().Assert()
}
func TestResilience_RecoveryFromFailure(t *testing.T) {
actor, _ := unit.Spawn(t, factoryResilientActor)
// Inject one-time failure
actor.Process().SetMethodFailureOnce("Send", errors.New("temporary network error"))
// First attempt fails
actor.SendMessage(gen.PID{}, SendData{Data: "attempt1"})
actor.ShouldSend().MessageMatching(func(msg any) bool {
if err, ok := msg.(SendError); ok {
return strings.Contains(err.Error, "temporary network error")
}
return false
}).Once().Assert()
// Second attempt succeeds (failure was one-time only)
actor.SendMessage(gen.PID{}, SendData{Data: "attempt2"})
actor.ShouldSend().Message(SendSuccess{Data: "attempt2"}).Once().Assert()
}
func TestSupervisor_RestartBehavior(t *testing.T) {
supervisor, _ := unit.Spawn(t, factoryOneForOneSupervisor)
// Start children
supervisor.SendMessage(gen.PID{}, StartChildren{Count: 3})
supervisor.ShouldSpawn().Times(3).Assert()
// Clear events before failure injection
supervisor.ClearEvents()
// Make child restarts fail after first success
supervisor.Process().SetMethodFailureAfter("Spawn", 1, errors.New("restart failed"))
// Simulate child failure requiring restart
supervisor.SendMessage(gen.PID{}, ChildFailed{ID: "child-2"})
// Verify supervisor attempts restart and handles failure
supervisor.ShouldSpawn().Once().Assert() // First restart attempt
supervisor.ShouldSend().MessageMatching(func(msg any) bool {
if status, ok := msg.(SupervisorStatus); ok {
return status.RestartsFailed == 1
}
return false
}).Once().Assert()
}
func TestRateLimiter_CallCounting(t *testing.T) {
limiter, _ := unit.Spawn(t, factoryRateLimiter)
// Send multiple requests
for i := 0; i < 5; i++ {
limiter.SendMessage(gen.PID{}, Request{ID: i})
}
// Check how many times Send was called
sendCount := limiter.Process().GetMethodCallCount("Send")
unit.Equal(t, 5, sendCount, "Should have called Send 5 times")
// Inject failure after checking count
limiter.Process().SetMethodFailure("Send", errors.New("rate limit exceeded"))
// Next request should fail
limiter.SendMessage(gen.PID{}, Request{ID: 6})
limiter.ShouldSend().MessageMatching(func(msg any) bool {
if err, ok := msg.(RateLimitError); ok {
return err.CallCount == 6 // Should include the failed attempt
}
return false
}).Once().Assert()
}