arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 64

Ergo Framework Documentation

Loading...

Basics

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Actors

Loading...

Loading...

Loading...

Loading...

Meta Processes

Loading...

Loading...

Loading...

Loading...

Networking

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Testing

Loading...

Advanced

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

extra library

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tools

Loading...

Loading...

Loading...

Overview

Building reliable concurrent and distributed systems is hard. In Go, you might start with goroutines and channels. As the system grows, you add mutexes to protect shared state. Then you need to coordinate across multiple services, so you introduce message queues or RPC. Before long, you're managing synchronization primitives, handling partial failures, and debugging race conditions that only appear under load.

Ergo Framework offers a different foundation. Think of it as making goroutines addressable and message-passing-only, then extending that model across a cluster. Processes are like goroutines - lightweight, multiplexed onto OS threads - but isolated and communicating only through messages. Each process has an identifier that works whether the process is local or on a remote node. Sending a message looks the same either way.

The actor model isn't new. Erlang proved these patterns work for systems requiring massive concurrency and high reliability. Ergo brings them to Go: no external dependencies, familiar Go idioms, and performance that doesn't sacrifice correctness for speed.

hashtag
Core Components

The framework consists of a few fundamental pieces that work together.

A node provides the runtime environment. It manages process lifecycles, routes messages, handles network connections, and provides services like logging and scheduled tasks. When you start a node, you get infrastructure. When you spawn a process, the node handles the mechanics.

Processes are lightweight actors. Each has a mailbox where messages queue up, priority-sorted into urgent, system, main, and log queues. The process handles messages one at a time in its own goroutine. When the mailbox empties, the goroutine sleeps. This makes processes efficient - you can have thousands without resource problems. It also makes them safe - sequential message handling means no race conditions within a process.

Supervision trees provide fault tolerance. Supervisors monitor worker processes. When a worker crashes, the supervisor restarts it according to a configured strategy. Supervisors can supervise other supervisors, creating a hierarchy. Failures are isolated to subtrees. The rest of the system continues running while the failed part recovers.

Meta processes solve a specific problem: integrating blocking I/O with the actor model. HTTP servers block waiting for requests. TCP servers block accepting connections. A meta process uses two goroutines - one runs your blocking code (like http.ListenAndServe), the other handles messages from other actors. This bridges synchronous APIs with asynchronous actor communication.

hashtag
Network Transparency

The framework treats local and remote processes identically. Send a message to a process on the same node or a process on a remote node - the code is the same. The framework handles the difference.

When you send to a remote process, the node extracts the target node from the process identifier, discovers that node's address (through static routes or a registrar), establishes a connection if needed, encodes the message, and sends it. The remote node receives it, decodes it, and delivers it to the target process's mailbox. This happens automatically. Your code just sends a message.

This transparency extends to failure detection. Use the Important delivery flag and you get the same error semantics for remote processes as for local ones. Without it, a message to a missing remote process times out (was it slow or dead?). With it, you get immediate error notification (process doesn't exist), just like local delivery. The network becomes transparent not just for success cases but for failures too.

Nodes discover each other through a registrar. By default, each node runs a minimal registrar. Nodes on the same host find each other through localhost. For remote nodes, the framework queries the registrar on the remote host. For production clusters, configure an external registrar like etcd or Saturn for centralized discovery, cluster configuration, and application deployment tracking.

hashtag
What This Enables

You write business logic using message passing between processes. The framework handles concurrency (processes run in parallel but each is sequential internally), fault tolerance (supervisors restart failures), and distribution (messages route automatically to remote processes). You're not writing code to manage connections, encode messages, or handle network failures explicitly. Those are solved problems handled by the framework.

Systems built this way have useful properties. They scale by adding nodes and distributing processes across them. The code doesn't change - deployment topology is operational configuration. They handle failures through supervision rather than defensive programming everywhere. They evolve through composition - add new process types, adjust supervision strategies, change message flows - without restructuring the foundation.

The development experience differs from typical microservices. No REST endpoints to define. No service discovery to configure (it's built in). No serialization libraries to manage (the framework handles it). No retry logic scattered throughout (supervision handles recovery). You model your domain as processes exchanging messages, and the framework provides the infrastructure.

hashtag
Performance

Lock-free queues in process mailboxes avoid contention. Processes sleep when idle, consuming no CPU. Connection pooling uses multiple TCP connections per remote node for parallel delivery. These design choices add up to performance comparable to hand-written concurrent code, but without the complexity.

The real performance benefit is development velocity. You're not debugging race conditions or deadlocks. You're not coordinating distributed transactions. You're not managing connection pools or implementing retry logic. The framework handles those concerns, leaving you to focus on what your system does.

Benchmarks measuring message passing, network communication, and serialization performance are available at .

hashtag
Zero Dependencies

The framework uses only the Go standard library. No external dependencies means no version conflicts, no supply chain vulnerabilities, no surprise breaking changes from third-party packages. The requirement is just Go 1.20 or higher.

This isn't ideological purity. It's practical stability. The framework's behavior depends only on Go itself. Updates are predictable. Supply chain is simple. The code you write today will compile and run the same way years from now, assuming Go maintains backward compatibility (which it does).

For detailed explanations of these concepts, start with and explore the section. For API documentation, see the godoc comments in the source code.

github.com/ergo-services/benchmarksarrow-up-right
Actor Model
Basics

Network Protocols

The Ergo Framework allows nodes to run with various network stacks. You can replace the default network stack or add it as an additional stack. For more information, refer to the Network Stack section.

This library contains implementations of network stacks that are not part of the standard Ergo Framework library.

Loggers

An extra library of logger implementations that are not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages that have external dependencies, as Ergo Framework adheres to a "zero dependency" policy

Meta-Processes

An extra library of meta-process implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework adheres to a "zero dependency" policy.

Applications

The additional application library for Ergo Framework contains packages with a narrow specialization or external dependencies since Ergo Framework adheres to the "zero dependencies" principle.

You can find the source code of these applications in the application library repository at https://github.com/ergo-services/applicationarrow-up-right.

Registrars

An extra library of registrars or client implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework follows a "zero dependency" policy.

hashtag
Available Registrars

hashtag

A client library for the central registrar. Provides service discovery, configuration management, and real-time cluster event notifications through a centralized registrar service.

Features:

  • Centralized service discovery

  • Real-time event notifications

  • Configuration management

  • TLS security support

hashtag

A client library for , a distributed key-value store. Provides decentralized service discovery, hierarchical configuration management with type conversion, and automatic lease management.

Features:

  • Distributed service discovery

  • Hierarchical configuration with type conversion from strings ("int:123", "float:3.14")

  • Automatic lease management and cleanup

Choose Saturn for centralized management with a dedicated registrar service, or etcd for a distributed approach with built-in consensus and reliability guarantees.

Actors

An extra library of actors implementations not included in the standard Ergo Framework library. This library contains packages with a narrow specialization. It also includes packages with external dependencies, as Ergo Framework adheres to a "zero dependency" policy.

hashtag
Leader

Distributed leader election actor implementing Raft-inspired consensus algorithm. Provides coordination primitives for building systems that require single leader selection across a cluster.

Use cases: Task schedulers, resource managers, single-writer databases, distributed locks, cluster coordinators.

hashtag

Prometheus metrics exporter actor that automatically collects and exposes Ergo node and network telemetry via HTTP endpoint. Provides observability for monitoring cluster health and resource usage.

Use cases: Production monitoring, performance analysis, capacity planning, debugging distributed systems.

CertManager

TLS Certificate Management

Network communication in production systems needs encryption. TLS provides this, but managing TLS certificates introduces operational challenges. Certificates expire. Security incidents require rotation. Updating certificates traditionally means restarting services, causing downtime.

The naive approach loads certificates at startup from files. When you need to update a certificate, you replace the file and restart the service. For a single service, this works. For distributed systems with dozens of nodes and services, coordinating restarts for certificate updates becomes an operational burden.

Ergo Framework provides gen.CertManager for live certificate updates. Load a certificate at startup, and you can update it later without restarting. All components using that certificate manager - node acceptors, web servers, TCP servers - automatically use the updated certificate for new connections.

  • Token-based authentication

  • Real-time cluster change notifications

  • TLS/authentication support

  • Four-level configuration priority system

  • Saturn Client
    Saturn
    etcd Client
    etcdarrow-up-right
    Metrics

    Observer

    The Application Observer provides a convenient web interface to view node status, network activity, and running processes in the node built with Ergo Framework. Additionally, it allows you to inspect the internal state of processes or meta-processes. The application is can also be used as a standalone tool Observer. For more details, see the section Inspecting With Observer. You can add the Observer application to your node during startup by including it in the node's startup options:

    import (
    	"ergo.services/ergo"
    	"ergo.services/application/observer"
    	"ergo.services/ergo/gen"
    )
    
    func main() {
    	opt := gen.NodeOptions{
    		Applications: []gen.ApplicationBehavior {
    			observer.CreateApp(observer.Options{}),
    		}
    	}
    	node, err := ergo.StartNode("example@localhost", opt)
    	if err != nil {
    		panic(err)
    	}
    	node.Wait()
    }

    The function observer.CreateApp takes observer.Options as an argument, allowing you to configure the Observer application. You can set:

    • Port: The port number for the web server (default: 9911 if not specified).

    • Host: The interface name (default: localhost).

    • LogLevel: The logging level for the Observer application (useful for debugging). The default is gen.LogLevelInfo

    hashtag
    Creating a Certificate Manager

    Create a certificate manager with an initial certificate:

    For development or testing, generate a self-signed certificate:

    Note: Self-signed certificates require setting InsecureSkipVerify: true in network options to bypass certificate validation. This is acceptable for development but never use it in production.

    Pass the certificate manager to node options:

    The node's network stack uses this certificate manager for TLS connections. Acceptors use it for incoming connections. Outgoing connections use it for client certificates if needed.

    hashtag
    Updating Certificates

    Update the certificate while the node is running:

    The update takes effect immediately for new connections. Existing connections continue using the old certificate until they close. This allows graceful rotation - new connections get the new certificate, old connections finish naturally.

    Components using the certificate manager obtain certificates through GetCertificate or GetCertificateFunc. These methods return the current certificate, so updates automatically propagate to all users of the manager.

    hashtag
    Certificate Lifecycle

    The typical pattern involves periodic certificate renewal. A cron job or external process watches for approaching expiration. When renewal is needed, it obtains a new certificate (from Let's Encrypt, an internal CA, or however your infrastructure manages certificates) and calls Update on the certificate manager.

    The certificate manager is passive - it doesn't handle renewal itself. It provides the mechanism for live updates. Your renewal logic is external, allowing integration with whatever certificate provisioning system you use.

    This separation is intentional. Certificate renewal policies vary widely. Some organizations use Let's Encrypt with automated renewal. Others use internal CAs with manual processes. Some rotate certificates on a schedule, others only when necessary. The certificate manager doesn't impose policy - it just enables live updates however you choose to implement them.

    hashtag
    Mutual TLS

    For scenarios requiring client certificate authentication, use gen.CertAuthManager. It extends CertManager with CA pool management for verifying certificates on both sides of the connection. This enables mutual TLS (mTLS) where servers verify client certificates and clients verify server certificates.

    All settings support runtime updates, just like certificate rotation.

    For detailed configuration and examples, see Mutual TLS.

    For complete certificate manager methods and usage, refer to the gen.CertManager interface documentation in the code.

    cert, err := tls.LoadX509KeyPair("cert.pem", "key.pem")
    if err != nil {
        panic(err)
    }
    
    certManager := gen.CreateCertManager(cert)
    cert, err := lib.GenerateSelfSignedCert("MyService v1.0")
    if err != nil {
        panic(err)
    }
    
    certManager := gen.CreateCertManager(cert)
    options.CertManager = certManager
    node, err := ergo.StartNode("node@localhost", options)
    newCert, err := tls.LoadX509KeyPair("new_cert.pem", "new_key.pem")
    if err != nil {
        return err
    }
    
    certManager.Update(newCert)
    certManager := gen.CreateCertAuthManager(cert)
    certManager.SetClientCAs(clientCAPool)    // server verifies clients
    certManager.SetRootCAs(serverCAPool)      // client verifies servers
    certManager.SetClientAuth(tls.RequireAndVerifyClientCert)

    Supervision Tree

    Process control and fault tolerance

    Building reliable systems means accepting an uncomfortable truth: failures will happen. Hardware fails. Networks partition. Bugs exist in code. The question isn't whether your processes will crash, but what happens when they do.

    The supervision tree model provides an answer. Instead of trying to prevent all failures, you structure your system so failures are expected, isolated, and automatically recovered from.

    hashtag
    The Supervision Principle

    The model divides processes into two distinct roles:

    Workers do the actual work. They handle requests, process data, manage state, and inevitably, sometimes crash when things go wrong.

    Supervisors watch over workers. Their only job is to start child processes and restart them when they fail. Supervisors don't do application work - they manage lifecycle.

    This separation is crucial. If workers handled their own restart logic, a bug in that logic would prevent recovery. By moving restart responsibility to a separate supervisor, you ensure that failures in workers can always be recovered.

    hashtag
    How Supervision Works

    A supervisor starts its children and monitors them. When a child crashes, the supervisor decides what to do based on its restart strategy. Should it restart just this one child? Restart all children? Restart all children in a specific order?

    The strategy depends on the relationships between children. If they're independent, restart just the failed one. If they depend on each other, restart all of them to ensure consistent state. If they have startup dependencies, restart in order.

    Supervisors can supervise other supervisors, forming a tree. At the top might be an application supervisor. Below it, supervisors for different subsystems. Below those, the actual workers. When a worker crashes, only its portion of the tree is affected. The rest of the system continues running.

    hashtag
    Fault Tolerance Through Isolation

    This tree structure creates fault isolation boundaries. A crashed database worker doesn't affect the HTTP handler workers. A failed cache process doesn't take down the authentication processes. Each supervision subtree handles its own failures without cascading them upward.

    The Erlang community calls this "let it crash." It sounds reckless, but it's actually disciplined. Instead of defensive programming trying to handle every possible error, you let processes fail and rely on supervisors to restart them in a clean state. Often, a fresh restart clears transient problems that would be difficult to handle explicitly.

    hashtag
    Supervision in Ergo Framework

    Ergo Framework implements supervision through the act.Supervisor actor. When you create a supervisor, you specify its children and restart strategy. The framework handles the monitoring and restart logic.

    Workers are typically act.Actor implementations - regular actors that do application work. Supervisors are act.Supervisor implementations - actors whose behavior is managing children.

    Because supervisors are also actors, they can be supervised. This is how you build the tree: supervisors supervising supervisors supervising workers, all the way down.

    The tree structure emerges from how you compose supervisors and workers. There's no special tree-building API. You just nest supervisors, and the tree forms naturally.

    hashtag
    Building Reliable Systems

    The supervision tree model leads to systems with interesting properties.

    Self-healing - Failures trigger automatic recovery. Most transient problems resolve themselves through restart.

    Graceful degradation - When a subsystem fails, only that part stops working. The rest continues serving requests.

    Operational simplicity - Instead of complex error handling throughout your code, you centralize recovery logic in supervisors.

    The trade-off is that you need to design processes that can restart cleanly. State that must survive restarts needs to be externalized - in databases, in other processes, or rebuilt from messages. But this discipline leads to more robust designs anyway.

    hashtag
    Where to Go From Here

    Understanding supervision requires seeing it in practice. The chapter covers the specifics: restart strategies, child specifications, and practical patterns for structuring your application.

    The combination of the actor model (isolated processes, message passing) and supervision trees (automatic recovery) gives you the tools to build systems that handle failures gracefully. It's a different approach than traditional error handling, but one that scales well to distributed systems where failures are inevitable.

    Links And Monitors

    Linking and Monitoring Mechanisms

    Building reliable systems from independent processes requires solving a fundamental coordination problem. When a process terminates - whether from a crash, graceful shutdown, or network failure - other processes that depend on it or supervise it need to know. Without this knowledge, a supervisor can't restart failed workers, dependent processes continue attempting to use unavailable services, and the system degrades silently.

    The challenge is detecting termination without breaking isolation. Processes can't share memory or directly observe each other's state. The traditional approach in distributed systems uses heartbeats: processes periodically signal they're alive, and silence implies failure. But heartbeats introduce overhead, timing sensitivity, and the fundamental ambiguity of distinguishing "slow" from "dead."

    Ergo Framework provides a different mechanism. Processes explicitly declare relationships - links and monitors - and the framework delivers termination notifications through these channels. When a process terminates, the node automatically notifies all processes that established relationships with it. The notification is immediate, deterministic, and part of the normal message flow.

    Links and monitors both deliver termination notifications, but they differ in what happens next. A link couples your lifecycle to the target's - when it terminates, you terminate. A monitor simply informs you of termination, leaving the response up to you. The choice depends on whether you need failure propagation or just failure awareness.

    Saturn - Central Registrar

    Ergo Service Registry and Discovery

    saturn is a tool designed to simplify the management of clusters of nodes created using the Ergo Framework. It offers the following features:

    • A unified registry for node registration within a cluster.

    • The ability to manage multiple clusters simultaneously.

    Actor Model

    The Actor Model and Its Properties

    The actor model is a computational approach to building concurrent systems, first proposed in the 1970s. At its core is a simple yet powerful idea: instead of having program components share memory and coordinate through locks, they communicate by sending messages to each other.

    hashtag
    The Fundamental Concept

    In the actor model, everything is an actor. An actor is an independent entity that has its own private state and processes incoming messages one at a time. Actors never directly access each other's state. Instead, they send messages and wait for responses if needed.

    This might seem like a constraint, but it's actually what makes the model powerful. By eliminating shared state, we eliminate entire classes of concurrency bugs that plague traditional multi-threaded programs.

    hashtag
    Links: Coupling Lifecycles

    Creating a link to another process declares a dependency. You're stating that your operation depends on the target's continued existence. When the target terminates, you receive an exit signal - a high-priority message that typically causes your termination as well.

    Exit signals arrive in the Urgent queue, bypassing normal message ordering. The default behavior is immediate termination when an exit signal arrives. This cascading failure makes sense in many scenarios. If a worker's connection to a critical service is gone, the worker has nothing useful to do and should terminate cleanly.

    But sometimes you want to handle exit signals explicitly. Actors can enable exit signal trapping through act.Actor. When trapping is enabled, exit signals are delivered as gen.MessageExit* messages to your HandleMessage callback. You can examine the signal, check the termination reason, and decide whether to terminate or attempt recovery.

    Each exit message type carries the termination reason in its Reason field. The reason tells you what happened: normal shutdown (gen.TerminateReasonNormal), abnormal crash, panic (gen.TerminateReasonPanic), forced kill (gen.TerminateReasonKill), or network failure (gen.ErrNoConnection). This context lets you make informed decisions about how to react.

    The framework provides linking methods for different identification schemes. LinkPID takes a process identifier and links to that specific process instance. When it terminates, you receive gen.MessageExitPID. LinkProcessID links to a registered name rather than a specific instance. If the process terminates or unregisters the name, you receive gen.MessageExitProcessID. LinkAlias works with process aliases - termination or alias deletion triggers gen.MessageExitAlias.

    You can also link to node connections with LinkNode. If the connection to the specified node is lost, you receive gen.MessageExitNode. This is useful for processes that can't operate when a particular remote node is unavailable.

    The generic Link method accepts any target type and dispatches to the appropriate typed method. Use it when the target type varies, or use the specific methods when you know the type.

    hashtag
    The Unidirectional Nature

    Links in Ergo are unidirectional, and this deserves emphasis because it differs from Erlang.

    When you execute process.LinkPID(target), you establish a relationship where target's termination affects you. The link points from you to the target. If the target terminates, you receive an exit signal. But if you terminate, the target is unaffected. The link doesn't point backward.

    Erlang's links are bidirectional. If process A links to process B in Erlang, either terminating causes the other to terminate. This symmetry can be useful, but it also creates unexpected cascading failures. In Ergo, if you want bidirectional coupling, you create two links: A links to B, and B links to A.

    The unidirectional design gives you precise control. Consider a shared service with multiple workers. Each worker links to the service (if the service dies, workers should too). But the service doesn't link back to workers (a worker crash shouldn't kill the service). Unidirectional links express this asymmetric dependency naturally.

    hashtag
    Monitors: Observation Without Coupling

    Monitors provide lifecycle awareness without lifecycle coupling. You track when something terminates, but you don't terminate yourself.

    The quintessential monitor use case is supervision. A supervisor monitors worker processes. When a worker terminates, the supervisor receives a down message. The message includes the worker's PID or identifier and the termination reason. The supervisor examines this information, consults its restart strategy, and decides whether to spawn a replacement. The supervisor continues running regardless of how many workers have crashed.

    Down messages arrive in the System queue with high priority (but lower than Urgent exit signals). Each gen.MessageDown* type includes a Reason field. For MonitorPID, you receive gen.MessageDownPID with the target's PID and reason. For MonitorProcessID, you receive gen.MessageDownProcessID with the registered name and reason. The reason might indicate normal termination, a crash, or a special case like name unregistration (gen.ErrUnregistered).

    Monitoring registered names or aliases handles invalidation gracefully. If you monitor a process by name and that process unregisters its name, you receive a down message with reason gen.ErrUnregistered. The process might still be running, but it's no longer accessible by that name, which is what you were monitoring. Same logic applies to alias deletion - you're notified that the thing you were monitoring is no longer valid.

    Node monitoring tracks connection health. MonitorNode sends you gen.MessageDownNode when the connection to a remote node is lost. The reason is gen.ErrNoConnection. This is useful for detecting network partitions or remote node crashes without linking (which would terminate your process).

    hashtag
    Network Transparency in Practice

    Links and monitors work across nodes without changing their semantics or your code.

    When you link or monitor a remote target, the framework sends a request to the remote node. The remote node records that your process is watching the target. This setup happens during your Link* or Monitor* call and involves a network round-trip. The operation can fail if the remote node is unreachable or the target doesn't exist - check the error return.

    Once established, the remote node tracks your subscription. When the target terminates on the remote node, the remote node sends a notification message back to your node. Your node routes it to your process's mailbox. From your perspective, it's just another message - you don't see the network mechanics.

    Network failures complicate this. If the connection to the remote node fails while your link or monitor is active, your local node detects the disconnection. It looks up which local processes had links or monitors to targets on that failed node. For links, it sends exit signals with reason gen.ErrNoConnection. For monitors, it sends down messages with the same reason.

    This unified handling means you write the same error handling code for local and remote targets. The notification mechanism is consistent. The reason field distinguishes between target termination and network failure, but the notification path is identical.

    hashtag
    Removing Links and Monitors

    Links and monitors aren't permanent. You can remove them explicitly or they're removed automatically when participants terminate.

    To remove a link, use the corresponding Unlink* method with the same target. UnlinkPID, UnlinkProcessID, UnlinkAlias, UnlinkNode each remove the link created by their Link* counterpart. If you never created the link, unlinking returns an error. For monitors, the Demonitor* methods work the same way.

    When the target terminates and you receive notification, the link or monitor is automatically removed. You receive one notification per relationship. If the target is later restarted (by a supervisor), you won't receive notification about that new instance unless you create a new link or monitor to it.

    When you terminate (the process that created the link or monitor), your relationships are cleaned up automatically. The target doesn't receive notification that you stopped watching. This asymmetry is intentional - the target doesn't track who's watching it, so it doesn't care when watchers go away.

    hashtag
    Practical Usage Patterns

    Several common patterns emerge from combining links and monitors.

    Workers often link to infrastructure processes they depend on. A worker processing HTTP requests might link to a database connection pool process. If the pool terminates (perhaps during a deployment), the worker receives an exit signal and terminates. The worker's supervisor detects the termination, waits a moment (hoping the database pool restarts), and spawns a new worker. The new worker links to the (now running) pool and resumes processing.

    Supervisors monitor their children. Each worker termination triggers a down message. The supervisor checks the reason. If it's gen.TerminateReasonNormal, the worker finished its task and doesn't need restart. If it's an error or panic, the supervisor spawns a replacement. The supervisor's continued operation despite worker failures is the whole point of the supervisor pattern.

    Load balancers monitor backend processes. Each backend termination updates the balancer's routing table. The balancer continues routing to available backends. When a backend restarts, it might need to register with the balancer, which would then monitor it again.

    Parent-child relationships often use LinkChild and LinkParent options in gen.ProcessOptions. These provide a convenient way to create links automatically after process initialization completes. You can also call Link methods directly during initialization if needed. If either participant terminates, the other receives an exit signal.

    hashtag
    The Difference That Matters

    Links propagate failure. Monitors report failure. Choose based on whether the watcher should terminate when the target terminates.

    If continued operation without the target is meaningless, use a link. If you can adapt to the target's absence (by finding a replacement, degrading gracefully, or restarting the target), use a monitor.

    The unidirectional nature of links matters more than you might initially think. It lets you express asymmetric dependencies precisely. Workers depend on services, but services don't depend on individual workers. Clients depend on servers, but servers don't depend on individual clients. Links point from the dependent to the dependency, making the relationship clear.

    For event-based publish/subscribe patterns using links and monitors, see the Events chapter. For supervision trees built on monitors, see Supervisor.

    hashtag
    What Makes an Actor

    An actor consists of three things:

    Private State - Data that belongs exclusively to this actor. No other actor can read or modify it directly.

    Behavior - The logic that determines how the actor responds to messages. This can change over time as the actor processes different messages.

    Mailbox - A queue where incoming messages wait to be processed. The actor pulls messages from this queue one at a time.

    When an actor receives a message, it can do three things: send messages to other actors, create new actors, or decide how to handle the next message. That's it. Simple, but sufficient to build complex systems.

    hashtag
    Why Sequential Processing Matters

    Each actor processes messages sequentially, one after another. This is not a limitation but a design choice that provides important guarantees.

    Consider what happens in traditional concurrent programming: multiple threads might access the same data simultaneously. To prevent corruption, you need locks. But locks introduce their own problems - deadlocks, race conditions, and complex reasoning about what state the data is in at any given moment.

    Actors sidestep this entirely. Since only one message is processed at a time, the actor's state can only be in one of a finite number of well-defined states. There are no race conditions because there's no race - only one thing happens at a time within an actor.

    hashtag
    Location Transparency

    One of the most powerful aspects of the actor model is location transparency. When you send a message to an actor, you don't need to know whether it's running in the same process, on the same machine, or halfway around the world. The semantics are the same.

    This makes distribution almost trivial. Code written for a single machine can scale to a distributed system without fundamental changes. The complexity of network communication is handled by the framework, not by your application logic.

    hashtag
    Real-World Implementations

    The actor model isn't just theory. It powers real production systems handling massive scale.

    Erlang pioneered the practical application of the actor model. The language and its BEAM virtual machine have been running telecommunications systems since the 1980s. Systems that need to handle millions of concurrent connections with high reliability naturally gravitate toward Erlang's actor model implementation.

    Akka brought the actor model to the Java ecosystem. It's used in systems that need to process high-volume transactions, manage complex workflows, or handle real-time data streams. Companies building reactive systems often choose Akka for its proven scalability patterns.

    Orleans demonstrated that the actor model works well in cloud environments. Its virtual actor pattern, where actors are automatically created and destroyed based on demand, showed how the model adapts to modern distributed computing challenges.

    hashtag
    How This Applies to Go

    Go has goroutines and channels, which seem similar to actors and message passing. But there's a crucial difference: goroutines are not isolated. They can share memory, which means you still need locks and face the same concurrency challenges as traditional threading.

    Ergo Framework brings true actor model semantics to Go. Each process is an isolated actor. The framework enforces the constraint that actors don't share memory and communicate only through messages. This gives you the benefits of the actor model - no race conditions, simpler concurrent logic, natural distribution - while writing Go code.

    The single-goroutine-per-actor constraint might seem limiting at first. In practice, it's liberating. You write sequential code within each actor, and concurrency emerges naturally from having many actors processing messages in parallel.

    hashtag
    The Actor Mindset

    Working with the actor model requires a shift in thinking. Instead of thinking about shared data structures protected by locks, you think about independent entities sending messages to each other.

    A typical pattern: instead of having multiple threads access a shared cache, you have a cache actor. Want to read from the cache? Send it a message. Want to write? Send a different message. The cache actor processes these requests sequentially, so there's no possibility of corruption. No locks needed.

    This pattern scales beautifully. Need more throughput? Add more cache actors, each handling a portion of the key space. Need fault tolerance? Supervise the cache actors, so they restart if they crash. Need distribution? Put cache actors on different machines. The code structure remains the same.

    hashtag
    Moving Forward

    The actor model offers a different way to think about concurrent programming. Rather than wrestling with locks and shared memory, you design systems as independent actors exchanging messages. The constraints of the model - sequential processing, message passing only, isolated state - eliminate the complexity that makes traditional concurrent programming difficult.

    Ergo Framework brings this programming model to Go. It enforces actor model principles while leveraging Go's strengths: lightweight goroutines, efficient scheduling, and a simple language. The result is a way to build concurrent and distributed systems that's both powerful and approachable.

    The following chapters explore how these concepts manifest in Ergo Framework's implementation. Process covers the lifecycle and capabilities of actors. Node explains how actors are managed and how they communicate across networks.

    Supervisor

    The capability to manage the configuration of the entire cluster without restarting the nodes connected to Saturn (configuration changes are applied on the fly).

  • Notifications to all cluster participants about changes in the status of applications running on nodes connected to Saturn.

  • The source code of the saturn tool is available on the project's page: https://github.com/ergo-services/toolsarrow-up-right.

    hashtag
    Installation

    To install saturn, use the following command:

    Available arguments:

    • host: Specifies the hostname to use for incoming connections.

    • port: Port number for incoming connections. The default value is 4499.

    • path: Path to the configuration file saturn.yaml.

    • debug: Enables debug mode for outputting detailed information.

    • version: Displays the current version of Saturn.

    hashtag
    Starting Saturn

    To start Saturn, a configuration file named saturn.yaml is required. By default, Saturn expects this file to be located in the current directory. You can specify a different location for the configuration file using the -path argument.

    You can find an example configuration file in the project's Git repository.

    hashtag
    Configuration file structure

    The saturn.yaml configuration file contains two root elements:

    1. Saturn: This section includes settings for the Saturn server.

      • You can configure the Token for access by remote nodes and specify certificate files for TLS connections.

      • By default, a self-signed certificate is used. For clients to accept this certificate, they must enable the InsecureSkipVerify option when creating the client.

      • Changes to this section require a restart of the Saturn server.

    2. Clusters: This section includes the configurations for clusters.

      • Changes in this section are automatically reloaded and sent to the registered nodes as updated configuration messages, without requiring a restart of Saturn.

      • The settings can target:

    If the name of a configuration element ends with the suffix .file, the value of that element is treated as a file. The content of this file is then sent to the nodes as a []byte.

    To configure settings for all nodes in all clusters, use the Clusters section in the saturn.yaml configuration file. Here, you can define global settings that will apply to every node within every cluster managed by Saturn:

    in this example:

    • Var1, Var2, Var3, and Var4 will be applied to all nodes in all clusters.

    • However, the value of Var1 for nodes named [email protected] in any cluster will be overridden with the value 456.

    If nodes are registered without specifying a Cluster in saturn.Options, they become part of the general cluster. Configuration for the general cluster should be provided in the Cluster@ section

    In the example above:

    • The variable Var1 is set to 789 for the general cluster (all nodes in the general cluster will receive Var1: 789).

    • However, for the node [email protected] within the general cluster, Var1 will be overridden to 456.

    Thus, all nodes in the general cluster will inherit Var1: 789, except for [email protected], which will specifically have Var1: 456. Other nodes in the general cluster will retain the default values from the Cluster@ section unless they are explicitly overridden in the configuration.

    To specify settings for a particular cluster, use the element name Cluster@<cluster name> in the configuration file:

    hashtag
    Service Discovery

    Saturn can manage multiple clusters simultaneously, but resolve requests from nodes are handled only within their own cluster.

    The name of a registered node must be unique within its cluster.

    When a node registers, it informs the registrar which cluster it belongs to. Additionally, the node reports the applications running on it. Other nodes in the same cluster receive notifications about the newly connected node and its applications. Any changes in application statuses are also reported to the registrar, which in turn notifies all participants in the cluster.

    For more details, see the Saturn Client section.

    Events

    Publish/Subscribe Event Mechanism

    The actor model excels at point-to-point communication. Process A sends a message to process B. Process C makes a request to process D. Each interaction has a specific sender and receiver.

    But some scenarios need one-to-many communication. A price feed updates and dozens of trading strategies need the new price. A user logs in and multiple subsystems need notification. A sensor reading arrives and various monitoring processes need to react. You could send individual messages to each interested process, but then the producer needs to track all consumers. When consumers come and go, the producer's consumer list becomes a maintenance burden.

    Events solve this with publish/subscribe semantics. A producer registers an event and publishes values to it. Consumers subscribe to the event without the producer knowing who they are. The framework handles message distribution - when the producer publishes an event, all current subscribers receive it. Subscribers can come and go dynamically, and the producer's code doesn't change.

    hashtag
    Registering Events

    A process becomes an event producer by calling RegisterEvent with an event name and options. The call returns a token - a unique reference that proves ownership. Only the process holding this token (or a process it delegates to) can publish events under this name.

    The Notify option controls whether the producer receives notifications about subscriber changes. When enabled, the producer receives gen.MessageEventStart when the first subscriber appears and gen.MessageEventStop when the last subscriber leaves. This allows the producer to start or stop expensive operations based on demand. If nobody's watching the price feed, why fetch prices?

    The Buffer option specifies how many recent events to keep. When a new subscriber joins, it receives the buffered events as a catch-up mechanism. Set this to zero if events are only relevant at the moment they're published. Set it to a reasonable number if new subscribers should see recent history.

    Events are identified by name and node. The combination must be unique. Two processes on the same node can't register events with the same name. But processes on different nodes can register events with the same name - they're different events.

    hashtag
    Publishing Events

    Publishing an event sends it to all current subscribers.

    You pass your application data directly. The framework wraps it in gen.MessageEvent automatically, adding the event identifier and timestamp. Subscribers receive the complete gen.MessageEvent structure containing your data.

    The producer uses the token obtained during registration. If you try to publish with an incorrect token, the operation fails. This prevents unauthorized processes from publishing events they don't own.

    Event publishing is fire-and-forget. The producer doesn't wait for acknowledgment or know how many subscribers received the event. The framework handles distribution asynchronously.

    hashtag
    Subscribing to Events

    Processes subscribe to events through links or monitors, the same mechanisms used for process lifecycle tracking.

    LinkEvent creates a link to an event. You receive event messages as they're published. If the event producer terminates or unregisters the event, you receive an exit signal. The link semantics apply - by default, you'd terminate too.

    MonitorEvent creates a monitor on an event. You receive event messages and a down notification if the producer terminates or the event is unregistered, but you don't terminate automatically.

    Both methods return buffered events upon successful subscription:

    The buffered events let subscribers catch up on what happened before they joined. If the buffer size was 10 and 5 events have been published, new subscribers receive those 5 events immediately.

    For local events, you can omit the node name: gen.Event{Name: "price_update"}. The framework fills in the local node name. For remote events, specify the full event identifier including the remote node name.

    hashtag
    Event Lifecycle

    Events exist from registration until unregistration or producer termination.

    When you register an event, it becomes available for subscription. Processes on any node can subscribe if they know the event name and node. The framework tracks all subscribers and distributes published events to them.

    When the producer terminates, the event is automatically unregistered. All subscribers receive termination notifications (exit signals for links, down messages for monitors). The event name becomes available for registration again.

    The producer can explicitly unregister an event with UnregisterEvent. This triggers the same notifications to subscribers. Use this when you're done publishing events but your process continues running.

    If a subscriber terminates or unsubscribes (via UnlinkEvent or DemonitorEvent), the producer doesn't receive notification unless Notify was enabled. With Notify, the producer receives gen.MessageEventStop when the last subscriber leaves.

    hashtag
    Network Transparency

    Events work across nodes seamlessly. A producer on node A can publish events that subscribers on nodes B, C, and D receive. The framework handles the network distribution.

    When you subscribe to a remote event, the framework sends a subscribe request to the remote node. The remote node records your subscription. When the producer publishes an event on the remote node, the remote node sends it to all remote subscribers, including you.

    If the network connection fails, subscribers receive termination notifications with reason gen.ErrNoConnection. This is consistent with how links and monitors handle network failures for processes.

    The buffered events work across nodes too. When you subscribe to a remote event, the remote node sends you the buffered events as part of the subscription response. This catch-up mechanism works regardless of where the producer and subscribers are located.

    hashtag
    Token Delegation

    Event tokens can be delegated. The producer can give its token to another process, allowing that process to publish events under the producer's event registration.

    This enables patterns where event generation is separated from event registration. A coordinator registers the event and distributes the token to worker processes. Workers publish events as data becomes available. Subscribers don't know or care which process instance published each event - they just receive events on the registered event name.

    Token delegation also allows rotating producers. A primary process registers an event and holds the token. A backup process can take over using the same token if the primary fails. Subscribers see a continuous event stream even as the producing process changes.

    hashtag
    Event Messages

    Event messages have a specific structure:

    Each gen.MessageEvent contains:

    • Event - The event identifier (name and node)

    • Message - Your application data (any type)

    • Timestamp - When the event was published (nanoseconds since epoch)

    Subscribers receive these wrapped messages and extract the application data. The wrapping provides context: which event this came from, when it was published, allowing subscribers to handle events from multiple sources or correlate timing.

    hashtag
    Practical Patterns

    Events fit several common scenarios.

    Data streaming - A sensor process registers an event and publishes readings. Multiple monitoring processes subscribe. Each reading goes to all monitors. If a monitor crashes and restarts, it subscribes again and receives recent buffered readings to catch up.

    State change notification - A user session process registers an event and publishes state changes (login, logout, permission change). Authorization processes subscribe and update their caches. The session process doesn't track who's interested in its state changes.

    System telemetry - Processes publish metrics as events. Monitoring processes subscribe and aggregate. If the monitoring process restarts, buffered events provide recent history to rebuild state.

    Workflow coordination - An order processing system publishes order state events. Inventory, shipping, and billing processes subscribe. Each subsystem reacts to relevant state changes. The order process doesn't orchestrate the subsystems - they coordinate through events.

    For more information on links and monitors as they apply to processes and nodes, see the chapter.

    Rotate

    The rotate logger writes log messages to files with automatic rotation based on time intervals. Instead of a single growing log file that eventually fills the disk, the logger creates new files periodically and optionally compresses old ones. This keeps disk usage predictable and makes log files manageable for analysis and archival.

    The logger operates asynchronously - log messages enter a queue and a background goroutine writes them to the file. This design prevents blocking your processes when disk I/O is slow. Logging happens in the background while your actors continue processing messages without waiting for disk writes to complete.

    hashtag
    File Rotation Mechanics

    Rotation happens based on time periods. You configure a duration - one minute, one hour, one day - and the logger creates a new file every period. The active file is always named <Prefix>.log. When the period ends, the logger:

    1. Copies the active file to a timestamped filename: <Prefix>.YYYYMMDDHHmi.log

    2. Optionally compresses it with gzip: <Prefix>.YYYYMMDDHHmi.log.gz

    3. Truncates the active file to start fresh for the new period

    This approach ensures the active file always has the same name. You can tail it (tail -f <Prefix>.log) and it works across rotations. The timestamped copies accumulate in the log directory, creating a chronological archive.

    The timestamp format is YYYYMMDDHHmi - year, month, day, hour, minute. This format sorts lexicographically, so ls -l shows files in chronological order. It's compact but human-readable.

    hashtag
    Asynchronous Writing

    The logger uses an internal lock-free queue (MPSC - multi-producer single-consumer). When any process logs a message, it pushes to the queue and returns immediately. A single background goroutine pops messages from the queue and writes them to the file.

    This design has several advantages:

    Non-blocking - Logging never blocks your process. If the disk is slow or the file system stalls, your actors continue running. The queue absorbs bursts of messages.

    Ordering - Messages from a single producer maintain order. The queue preserves submission order, so logs reflect the actual sequence of events within each process.

    Batching - The background goroutine processes messages continuously. If multiple messages arrive quickly, it writes them in a tight loop, reducing syscall overhead.

    hashtag
    Configuration

    The logger requires a rotation period and accepts several optional parameters:

    Period - The rotation interval. Minimum is time.Minute. Smaller periods create more files with less data each. Larger periods create fewer files with more data each. Choose based on how you analyze logs - if you search specific time ranges, shorter periods help. If you archive logs by day, use 24 * time.Hour.

    Path - Directory for log files. Defaults to ./logs relative to the executable. The logger creates the directory if it doesn't exist. Supports ~ for home directory expansion (~/logs becomes /home/user/logs). Use absolute paths in production to avoid ambiguity.

    Prefix - Filename prefix. Defaults to the executable name. The active file is <Prefix>.log, rotated files are <Prefix>.YYYYMMDDHHmi.log[.gz]. Use meaningful prefixes if multiple services log to the same directory.

    Compress - Enables gzip compression for rotated files. The active file stays uncompressed for fast writing. When rotating, the logger compresses the copy, reducing disk usage by 5-10x for text logs. Compressed files have .log.gz extension. Use compression if disk space matters more than CPU for compression.

    Depth - Limits the number of retained log files. When rotating, if the number of files exceeds Depth, the logger deletes the oldest file. Set to 0 (default) for unlimited retention. Set to a specific number (e.g., 24) to keep the last 24 periods. This prevents unbounded disk usage.

    TimeFormat - Timestamp format in log messages. Same as colored logger - any format from time package or custom layout. Empty string uses nanosecond timestamps. Choose based on readability vs. precision.

    IncludeName - Includes registered process names in log messages. Helps identify which process logged what.

    IncludeBehavior - Includes behavior type names in log messages. Useful during development to understand code flow.

    ShortLevelName - Uses abbreviated level names ([TRC], [DBG], etc.) instead of full names. Saves space in log files.

    hashtag
    Basic Usage

    Configure the rotate logger in node options:

    This configuration:

    • Rotates every hour

    • Stores logs in /var/log/myapp/

    • Names files myapp.log (active) and myapp.202411191200.log.gz (rotated with compression)

    For detailed logger configuration options, see the rotate.Options struct in the package. For understanding how loggers integrate with the framework, see .

    Mutual TLS

    Mutual TLS authentication between nodes

    Standard TLS provides server authentication - the client verifies the server's certificate. Mutual TLS (mTLS) adds client authentication - both sides present and verify certificates. Only clients with certificates signed by a trusted CA can connect.

    hashtag
    Configuration

    NodeOptions.CertManager is used for:

    • Default acceptor (created automatically on port 15000)

    • All outgoing connections

    To override per-acceptor, use AcceptorOptions.CertManager.

    hashtag
    CertAuthManager

    gen.CertAuthManager extends CertManager with CA pool and authentication settings:

    Server-side settings:

    Setting
    Purpose

    ClientAuth values:

    Value
    Behavior

    Client-side settings:

    Setting
    Purpose

    hashtag
    Runtime Certificate Rotation

    Certificates can be rotated without restart:

    New connections use the updated certificate. Existing connections keep their original certificate.

    CA pools and ClientAuth are fixed at startup. Restart the node to change these settings.

    To use different certificates for specific destinations, see .

    hashtag
    Troubleshooting

    Connection rejected with certificate error

    Verify the client certificate is signed by a CA in the server's ClientCAs pool. Check certificate expiration dates.

    Server certificate verification failed

    The server's certificate must be signed by a CA in the client's RootCAs pool. For development, disable verification with NetworkOptions.InsecureSkipVerify: true.

    SNI mismatch

    Set ServerName on the client's CertAuthManager if the certificate's Common Name doesn't match the connection address.

    Certificate rotation not taking effect

    Updates apply to new connections only. Close existing connections to force reconnection with new certificate.

    CA pool changes not taking effect

    CA pools are fixed at startup. Restart the node to apply changes.

    Node

    What is a Node in Ergo Framework?

    A node is the runtime environment where your actors live. Think of it as the container that hosts processes, routes messages between them, and handles the complexities of distributed communication.

    When you start a node, you're launching a complete system with several subsystems working together: process management, message routing, networking, and logging. Each subsystem has a specific responsibility, and they coordinate to provide the foundation for your application.

    hashtag
    What a Node Provides

    Process Management - The node tracks every process running on it. When you spawn a process, the node assigns it a unique PID, registers it in the process table, and manages its lifecycle. When a process terminates, the node cleans up its resources and notifies any processes that were linked or monitoring it.

    Process

    What is a Process in Ergo Framework

    A process is an actor - a lightweight entity that handles messages sequentially in its own goroutine. It's the fundamental building block of an Ergo application.

    Every process has a mailbox where incoming messages wait to be processed. The mailbox contains four queues with different priorities: Urgent for critical system messages, System for framework control, Main for regular application messages, and Log for logging. When the process wakes up to handle messages, it processes them in priority order, taking from Urgent first, then System, then Main, and finally Log.

    The process runs only when it has messages to handle. When the mailbox is empty, the process sleeps, consuming no CPU. When a message arrives, the process wakes, handles the message, and sleeps again if nothing else is waiting. This efficiency is why you can have thousands of processes in a single application.

    hashtag

    Generic Types

    Data Types and Interfaces Used in Ergo Framework

    Ergo Framework uses several specialized types for identifying and addressing processes, nodes, and other entities in the system. Understanding these types is essential for working with the framework.

    hashtag
    Identifiers and Names

    hashtag

    Colored

    The colored logger provides visual clarity for console output by applying color highlighting to log messages. Instead of monochrome text where errors blend with informational messages, each log level gets a distinct color, and framework types are highlighted automatically. This makes it easier to scan logs during development and debugging.

    The logger writes directly to standard output with immediate formatting - no buffering, no delays. When a process logs a message, it appears instantly in your terminal with colors applied. This synchronous approach keeps logs simple and predictable during interactive development.

    hashtag
    Visual Organization

    Color helps your eyes parse logs quickly. Log levels use consistent colors:

    Saturn Сlient

    This package implements the gen.Registrar interface and serves as a client library for the central registrar, . In addition to the primary Service Discovery function, it automatically notifies all connected nodes about cluster configuration changes.

    To create a client, use the Create function from the saturn package. The function requires:

    • The hostname where the central registrar is running (default port: 4499

    Cron

    Schedule tasks on a repetitive basis

    Applications often need tasks to run periodically. Generate a daily report at midnight. Clean up expired sessions every hour. Send weekly summary emails. Poll an external API every five minutes.

    You could implement this yourself - spawn a process that sleeps, wakes up, performs the task, and sleeps again. But then you're managing wake times, handling timezone changes, accounting for daylight saving time transitions, and ensuring the scheduler itself stays alive. The scheduling logic becomes scattered across your application.

    Cron provides scheduled task execution as a framework service. You declare what should run and when using the familiar crontab syntax. The framework handles timing, execution, and all the edge cases around time-based scheduling.

    hashtag

    $ go install ergo.services/tools/saturn@latest
    Clusters:
        Var1: 123
        Var2: 12.3
        Var3: "123"
        Var4.file: "./myfile.txt"
        
        [email protected]:
            Var1: 456
    Clusters:
        Var1: 123
        Cluster@:
            Var1: 789
            node@host:
                Var1: 456
    Clusters:
        Var1: 123
        Cluster@mycluster:
            Var1: 321
            node@host: 654
    func startSecureNode(name string) (gen.Node, error) {
        // Load node certificate (signed by cluster CA)
        cert, err := tls.LoadX509KeyPair(
            fmt.Sprintf("%s.pem", name),
            fmt.Sprintf("%s-key.pem", name),
        )
        if err != nil {
            return nil, err
        }
    
        // Load cluster CA
        caCert, err := os.ReadFile("cluster-ca.pem")
        if err != nil {
            return nil, err
        }
        caPool := x509.NewCertPool()
        caPool.AppendCertsFromPEM(caCert)
    
        certManager := gen.CreateCertAuthManager(cert)
        certManager.SetClientCAs(caPool)                          // verify incoming
        certManager.SetClientAuth(tls.RequireAndVerifyClientCert) // require client cert
        certManager.SetRootCAs(caPool)                            // verify outgoing
    
        return ergo.StartNode(gen.Atom(name), gen.NodeOptions{
            CertManager: certManager,
        })
    }

    All nodes in all clusters.

  • Only nodes with a specified name in all clusters.

  • Only nodes within a specific cluster.

  • Only a node with a specified name within a specific cluster.

  • Links and Monitors

    Deletes old files if depth limit is configured

    Keeps last 24 hourly files (24 hours of history)

  • Deletes files older than 24 hours automatically

  • Logging

    tls.RequireAndVerifyClientCert

    Require and verify against CA

    ClientCAs

    CA pool to verify client certificates

    ClientAuth

    How strictly to enforce client certificates

    tls.NoClientCert

    Don't request client certificate (default)

    tls.RequestClientCert

    Request but don't require

    tls.RequireAnyClientCert

    Require certificate, don't verify against CA

    tls.VerifyClientCertIfGiven

    RootCAs

    CA pool to verify server certificates

    ServerName

    Server name for SNI (if different from host)

    Static Routes

    Verify against CA if provided

    Message Routing - When a process sends a message, the node figures out where it needs to go. Local process? Route it directly to the mailbox. Remote process? Establish a network connection if needed and send it there. The sender doesn't need to know these details.

    Network Stack - The node handles all network communication. It discovers other nodes, establishes connections, encodes messages, and manages the complexity of distributed communication. This is what makes network transparency possible.

    Pub/Sub System - Links, monitors, and events all work through a publisher/subscriber mechanism in the node core. When a process terminates or an event fires, the node knows who's subscribed and delivers the notifications.

    Logging - Every log message goes through the node, which fans it out to registered loggers. This centralized logging makes it easy to capture, filter, and route log output.

    hashtag
    Starting a Node

    A node needs a name. The format is name@hostname, where the hostname determines which network interface to use for incoming connections.

    The name must be unique on the host. Two nodes with the same name can't run on the same machine, but nodes with different names can coexist.

    The gen.NodeOptions parameter configures the node: which applications to start, environment variables, network settings, logging configuration. If you specify applications in the options, the node loads and starts them automatically. If any application fails to start, the entire node startup fails - this ensures you don't end up in a partially initialized state.

    hashtag
    Process Lifecycle

    The node manages the complete process lifecycle.

    When you spawn a process, the node creates it, registers it in the process table, calls its ProcessInit callback, and transitions it to the sleep state. The process is now live and can receive messages.

    When the process terminates (either naturally or through an exit signal), the node calls ProcessTerminate, removes it from the process table, and notifies any processes that were linked or monitoring. Resources are cleaned up, and the gen.PID becomes invalid.

    Processes can register names, making them addressable by name rather than PID. This is useful for well-known processes that other parts of the system need to find. The node maintains a name registry, ensuring each name maps to exactly one process.

    hashtag
    Message Routing

    Message routing is one of the node's core responsibilities.

    When a process sends a message locally, the node simply places it in the recipient's mailbox. The recipient's goroutine wakes up (if it was sleeping), processes the message, and goes back to sleep if no more messages are waiting.

    When the message goes to a remote process, things are more interesting. The node checks if a connection exists to the remote node. If not, it discovers the remote node's address (through the registrar or static routes) and establishes a connection. The message is encoded into the Ergo Data Format, optionally compressed, and sent over the network. The remote node receives it, decodes it, and delivers it to the recipient's mailbox.

    From the sender's perspective, both paths look identical. That's network transparency.

    hashtag
    Network Communication

    Making remote message delivery work like local delivery requires solving three problems: finding remote nodes, establishing connections, and ensuring compatibility.

    The first problem is discovery. When you send to a remote process, the node extracts which node that process belongs to from its identifier. Every node runs a small registrar service by default. For nodes on the same host, you query the local registrar. For nodes on different hosts, you query the registrar on that remote host - the framework derives the hostname from the node name and sends the query there. The registrar responds with connection information.

    This default approach works for simple setups but has limitations. You're querying individual hosts, which requires them to be directly reachable. There's no cluster-wide view, no centralized configuration, no way to discover which applications are running where.

    That's where etcd or Saturn come in. Instead of each node being its own island with a local registrar, you run a centralized registry service. All nodes register there when they start. All discovery queries go there. The central registrar becomes the source of truth for the cluster, providing not just discovery but configuration management, application tracking, and topology change notifications. It transforms independent nodes into a coordinated cluster.

    Once a node is discovered, connections are established. Multiple TCP connections form a pool to that node, enabling parallel message delivery. The connections negotiate protocol details during handshake: which protocol version to use, whether compression is supported, what features are enabled. This negotiation allows nodes with different capabilities to work together.

    hashtag
    Environment and Configuration

    Nodes have environment variables that all processes inherit. This provides a way to configure behavior without hardcoding values. A process can override inherited variables or add its own, creating a hierarchy: process environment overrides parent, which overrides leader, which overrides node.

    Environment variables are case-insensitive. Whether you set "database_url" or "DATABASE_URL", the process sees the same value. This eliminates a common source of configuration bugs.

    hashtag
    Shutdown

    Stopping a node can be graceful or forced.

    Graceful shutdown sends exit signals to all processes and waits for them to clean up. Processes receive gen.TerminateReasonShutdown and can save state, close connections, or send final messages before terminating. Once all processes have stopped, the network stack shuts down, and the node exits.

    Forced shutdown kills all processes immediately without waiting for cleanup. This is useful when you need to stop quickly, but processes don't get a chance to clean up properly.

    One subtlety: if you call Stop from within a process, you create a deadlock. The process can't terminate because it's waiting for Stop to complete, but Stop is waiting for all processes (including this one) to terminate. The solution is either to call Stop in a separate goroutine or use StopForce, which doesn't wait.

    hashtag
    Shutdown Timeout

    Graceful shutdown can hang indefinitely if a process is stuck - perhaps blocked on a channel, waiting for an external resource, or caught in incorrect logic. To prevent this, the node has a shutdown timeout. If processes don't terminate within this period, the node force exits with error code 1.

    The default timeout is 3 minutes. You can change it through gen.NodeOptions:

    During shutdown, the node logs which processes are still running. Every 5 seconds, it prints a warning with the first 10 pending processes, showing their PID, registered name (if any), behavior type, state, and mailbox queue length. This diagnostic output helps identify what's blocking the shutdown:

    The state tells you what the process is doing: running means it's handling a message, sleep means it's idle waiting for messages. The queue count shows how many messages are waiting. A process stuck in running with a growing queue indicates it's blocked in a callback and not processing its mailbox.

    hashtag
    Node Incarnation

    Every node has a creation timestamp assigned when it starts. This timestamp is embedded in every gen.PID, gen.Ref, and gen.Alias that the node creates.

    When two nodes connect, they exchange their creation timestamps during the handshake. Each connection stores the remote node's creation value.

    Before sending any message to a remote process, the framework compares the target's Creation field against the stored creation of that remote node. If they differ, the operation returns gen.ErrProcessIncarnation immediately - no network message is sent.

    This mechanism handles a common distributed systems problem: what happens when a remote node restarts? After restart, the node gets a new creation timestamp. Any gen.PID or gen.Alias from before the restart now contains the old creation value. When you try to send a message using that stale identifier, the framework detects the mismatch and returns an error instead of delivering the message to a wrong process.

    The check applies to all remote operations: Send, Call, Link, Unlink, Monitor, Demonitor, SendExit, and SendResponse.

    hashtag
    The Node's Role

    The node is infrastructure, not application logic. It provides the mechanisms - process management, message routing, networking - that your actors use to accomplish work.

    This separation is important. Your actors focus on application logic: handling requests, processing data, managing state. The node handles the plumbing: routing messages, establishing connections, managing lifecycles. You don't write code to discover remote nodes or encode messages. The node does that.

    This is what makes the framework approachable. You write actors that send and receive messages, and the node makes it all work, whether processes are local or distributed across a cluster.

    The following chapters dive into specific node capabilities. Process explains the actor lifecycle and operations. Networking covers distributed communication. Links and Monitors explains how processes track each other.

    Identifying Processes

    A process identifier (gen.PID) uniquely identifies a process across the entire distributed system. It contains three components: the node name where the process runs, a unique sequential number within that node, and a creation timestamp.

    The creation timestamp is the node's startup time. If a node restarts, the creation value changes, which means PIDs from before the restart are distinguishable from PIDs after. If you try to send a message to a gen.PID with an old creation value, you get an error. This prevents messages from being delivered to the wrong process after a node restart.

    Besides PIDs, processes can be identified by registered names. A process can register one name, making it addressable as gen.ProcessID{Name: "worker", Node: "node@host"}. This is useful for well-known processes that other parts of the system need to find without knowing their gen.PID.

    Processes can also create aliases - temporary identifiers that provide additional addressing options. Unlike registered names (one per process), a process can create unlimited aliases using gen.Alias. They're useful when you need multiple ways to address the same process, such as in request-response patterns or when implementing services with multiple endpoints.

    hashtag
    Process Lifecycle

    A process goes through several states during its lifetime.

    It starts in Init, where the ProcessInit callback runs. In this state, the process can spawn children, send messages, register names, create aliases, register events, establish links and monitors, and make synchronous calls.

    After initialization succeeds, the process enters Sleep and is ready to receive messages. When a message arrives, the process transitions to Running, handles the message, and returns to Sleep.

    If the process makes a synchronous call, it enters WaitResponse while waiting for the reply. Once the response arrives, it returns to Running and continues processing.

    Eventually the process terminates. This can happen in several ways: it returns an error from its message handler, it receives an exit signal, the node kills it, or a panic occurs. The ProcessTerminate callback runs, allowing cleanup. Then the process is removed from the node, and its resources are freed.

    hashtag
    Starting Processes

    You spawn processes through a factory function that creates instances of your actor.

    The factory is called each time you spawn - each process gets a fresh instance. This isolation is important for the actor model.

    gen.ProcessOptions configures the new process: mailbox size, environment variables, compression settings, message priority, linking behavior, and initialization timeout. Most options have sensible defaults. The main ones you'll configure are MailboxSize (to limit memory) and Env (to pass configuration).

    InitTimeout limits how long ProcessInit can take. Zero uses the default (5 seconds). If initialization exceeds this timeout, the process is terminated with gen.ErrTimeout and spawn returns an error. For remote spawn and application processes, the maximum allowed value is 15 seconds - exceeding this limit returns gen.ErrNotAllowed.

    Two options deserve explanation: LinkParent and LinkChild. These options provide a convenient way to establish links automatically after initialization completes. If LinkChild is set, the parent links to the child. If LinkParent is set, the child links to the parent. These links only work for process-spawned children, not node-spawned processes. Note that you can also call Link methods directly during initialization if needed.

    hashtag
    Message Handling

    Processes are defined by implementing the gen.ProcessBehavior interface. This is a low-level interface with three callbacks: ProcessInit for initialization, ProcessRun for the message processing loop, and ProcessTerminate for cleanup.

    In practice, you rarely implement gen.ProcessBehavior directly. Instead, you use act.Actor, which implements gen.ProcessBehavior and provides a more convenient abstraction. act.Actor gives you HandleMessage and HandleCall callbacks - straightforward methods where you write your message handling logic without worrying about the mailbox mechanics.

    The ProcessInit callback runs once during startup. Use it to initialize state, spawn children, configure properties. If it returns an error or exceeds the InitTimeout, the process is cleaned up and removed - it terminates immediately.

    The ProcessTerminate callback runs during shutdown. Use it for cleanup: close files, send final messages, log termination. It receives the termination reason, so you can distinguish between normal shutdown and errors.

    act.Actor handles the ProcessRun loop for you, calling your HandleMessage and HandleCall methods as messages arrive. This separation between the low-level interface (gen.ProcessBehavior) and the high-level abstraction (act.Actor) keeps the framework flexible while making common cases simple.

    hashtag
    Environment Variables

    Processes inherit environment variables when they spawn. At that moment, variables are copied from multiple sources and merged with a priority order: node variables (lowest priority), then application, then leader, then parent, then variables specified in gen.ProcessOptions (highest priority). If the same variable exists in multiple sources, the higher priority value wins.

    Once a process is running, its environment is independent. If the node changes an environment variable, running processes don't see the change. Only newly spawned processes inherit the updated values. This isolation is important - it means a process's configuration is stable for its lifetime.

    When a process queries a variable with Env or EnvList, it looks only in its own environment - the merged copy created at spawn time. The hierarchy (Process > Parent > Leader > Application > Node) determines what was copied during spawning, not what's queried during lookup.

    Variables are case-insensitive. "database_url", "DATABASE_URL", and "Database_Url" are all the same variable. This eliminates configuration mistakes from case mismatches.

    Use SetEnv to modify variables during Init or Running states. Pass nil as the value to delete a variable. Changes affect only this process - they don't propagate to children, parents, or the node.

    hashtag
    Termination

    Processes typically terminate themselves by returning an error from ProcessRun. In act.Actor, this manifests as returning an error from HandleMessage, HandleCall, or other handler callbacks. Return gen.TerminateReasonNormal for clean shutdown, or any other error to indicate why termination occurred. The process transitions to Terminated, runs its ProcessTerminate callback for cleanup, and is removed from the node.

    If a panic occurs during message handling, the framework catches it, logs the stack trace, and terminates the process with gen.TerminateReasonPanic. The ProcessTerminate callback still runs, giving the process a chance to clean up despite the panic.

    Processes can also be terminated externally. Sending an exit signal with SendExit delivers a high-priority termination request to the process's Urgent queue. Actors can trap these signals and handle them as regular messages, allowing graceful shutdown. This is how supervision trees restart workers - send an exit signal, wait for clean termination, then spawn a replacement.

    The most forceful option is Kill. If the process is idle (Sleep state), it transitions directly to Terminated and ProcessTerminate is called. If the process is actively handling a message (Running or WaitResponse states), it's marked as Zombee. In Zombee state, all operations return gen.ErrNotAllowed. The process finishes its current message, then terminates and calls ProcessTerminate. Use Kill when you need to stop a process that isn't responding to exit signals.

    Regardless of how termination happens, the node performs comprehensive cleanup. Events the process registered are unregistered. Its registered name becomes available for reuse. Aliases are deleted. Links and monitors are removed. If the process was acting as a logger, it's removed from the logging system. Meta processes spawned by this process are terminated. This ensures no dangling references remain after a process is gone.

    hashtag
    State-Based Access Control

    Not all Process interface methods work in all states. This isn't arbitrary - it reflects what's actually possible.

    During Init, the process can spawn children, send messages, register names, create aliases, register events, establish links and monitors, and make synchronous calls.

    During Running, everything is available. The process is fully operational.

    During Terminated, only sending messages works. You can't spawn new children or create new resources - the process is shutting down.

    These restrictions are enforced by the framework. If you call a method in the wrong state, you get gen.ErrNotAllowed. This prevents subtle bugs where operations appear to succeed but silently fail because the process isn't in the right state.

    The details of which methods work in which states are documented in the gen.Process godoc. In practice, you rarely hit these restrictions unless you're doing unusual things during initialization or shutdown.

    For a deeper understanding of process operations and lifecycle management, refer to the gen.Process interface documentation in the code.

    gen.Atom

    gen.Atom is a specialized string used for names - node names, process names, event names. While technically just a string, treating it as a distinct type allows the framework to optimize how these names are handled in the network stack.

    Atoms appear in single quotes when printed:

    The network stack caches atoms and maps them to numeric IDs to reduce bandwidth when the same names appear repeatedly in messages.

    hashtag
    gen.PID

    A gen.PID uniquely identifies a process. It contains the node name where the process lives, a unique sequential ID, and a creation timestamp. The creation timestamp changes when a node restarts, allowing you to detect if you're talking to a reincarnation of a node rather than the original.

    gen.PID values print with the node name hashed for brevity:

    The hash (90A29F11) is a CRC32 of the node name. This keeps the printed form compact while remaining unique.

    hashtag
    gen.ProcessID

    A gen.ProcessID identifies a process by its registered name rather than gen.PID. This is useful when you need to address a process but don't know its gen.PID, or when the gen.PID might change across restarts but the name remains constant.

    hashtag
    gen.Ref

    gen.Ref values are unique identifiers generated by nodes. They're used for correlating requests and responses in synchronous calls, and as tokens when registering events.

    A gen.Ref is guaranteed unique within a node for its lifetime. The structure includes the node name, creation time, and a unique ID array.

    References can also embed deadlines (stored in ID[2]) for timeout tracking. Recipients can check ref.IsAlive() to see if a request is still valid.

    hashtag
    gen.Alias

    gen.Alias is like a temporary gen.PID. Processes create aliases for additional addressability without registering names. Meta processes use aliases as their primary identifier.

    Aliases use the same structure as references but print with a different prefix:

    hashtag
    gen.Event

    gen.Event values represent named message streams that processes can subscribe to. A gen.Event identifier consists of a name and the node where it's registered.

    hashtag
    gen.Env

    Environment variable names in Ergo are case-insensitive. The gen.Env type ensures this by converting to uppercase.

    This allows processes to inherit environment variables from parents, leaders, and the node, with consistent naming regardless of how they're specified.

    hashtag
    Core Interfaces

    The framework defines several interfaces that provide access to different parts of the system.

    hashtag
    gen.Node

    The gen.Node interface is what you get when you start a node. It provides methods for spawning processes, managing applications, configuring networking, and controlling the node lifecycle.

    Node operations can be called from any goroutine. The node manages processes but isn't itself an actor.

    hashtag
    gen.Process

    The gen.Process interface represents a running actor. It provides methods for sending messages, spawning children, linking to other processes, and managing the actor's lifecycle.

    Actors typically embed this interface:

    Process methods enforce state-based access control. Some operations are only available when the process is in certain states, ensuring actor model constraints are maintained.

    hashtag
    gen.Network

    The gen.Network interface manages distributed communication. It handles connections to remote nodes, routing, and service discovery.

    Network transparency means sending messages to remote processes uses the same API as local processes. The gen.Network interface is where you configure how that transparency is achieved.

    hashtag
    gen.RemoteNode

    A gen.RemoteNode represents a connection to another Ergo node. Through this interface, you can spawn processes on the remote node or start applications there.

    The remote operations require the target node to have enabled the corresponding permissions.

    hashtag
    Type Design Philosophy

    These types reflect a few design decisions worth understanding.

    Hashing for readability - Node names are hashed in output to keep logs and traces readable while maintaining uniqueness. Full names can be verbose, especially in distributed systems with descriptive naming.

    Separate types for concepts - gen.PID, gen.ProcessID, gen.Alias, and gen.Event are distinct types even though they could have been unified. Each represents a different way of addressing or identifying something in the system, and the type system helps keep these concepts clear.

    Network-aware design - Many types include the node name. This isn't just for completeness - it's what enables network transparency. A gen.PID tells you not just which process, but which node, allowing the framework to route messages appropriately.

    For detailed API documentation of these interfaces and types, refer to the godoc comments in the source code.

    , unless specified in
    saturn.Options
    )
  • A token for connecting to Saturn

  • a set of options saturn.Options

  • Then, set this client in the gen.NetworkOption.Registrar options

    Using saturn.Options, you can specify:

    • Cluster - The cluster name for your node

    • Port - The port number for the central Saturn registrar

    • KeepAlive - The keep-alive parameter for the TCP connection with Saturn

    • InsecureSkipVerify - Option to ignore TLS certificate verification

    When the node starts, it will register with the Saturn central registrar in the specified cluster.

    Additionally, this library registers a gen.Event and generates messages based on events received from the central Saturn registrar within the specified cluster. This allows the node to stay informed of any updates or changes within the cluster, ensuring real-time event-driven communication and responsiveness to cluster configurations:

    • saturn.EventNodeJoined - Triggered when another node is registered in the same cluster.

    • saturn.EventNodeLeft - Triggered when a node disconnects from the central registrar

    • saturn.EventApplicationLoaded - An application was loaded on a remote node. Use ResolveApplication from the gen.Resolver interface to get application details

    • saturn.EventApplicationStarted - Triggered when an application starts on a remote node.

    • saturn.EventApplicationStopping - Triggered when an application begins stopping on a remote node.

    • satrun.EventApplicationStopped - Triggered when an application is stopped on a remote node.

    • saturn.EventApplicationUnloaded - Triggered when an application is unloaded on a remote node

    • saturn.EventConfigUpdate - The node's configuration was updated

    To receive such messages, you need to subscribe to Saturn client events using the LinkEvent or MonitorEvent methods from the gen.Process interface. You can obtain the name of the registered event using the Event method from the gen.Registrar interface. This allows your node to listen for important cluster events like node joins, application starts, configuration updates, and more, ensuring real-time updates and handling of cluster changes.

    Using the saturn.EventApplication* events and the Remote Start Application feature, you can dynamically manage the functionality of your cluster. The saturn.EventConfigUpdate events allow you to adjust the cluster configuration on the fly without restarting nodes, such as updating the cookie value for all nodes or refreshing the TLS certificate. Refer to the Saturn - Central Registrar section for more details.

    You can also use the Config and ConfigItem methods from the gen.Registrar interface to retrieve configuration parameters from the registrar.

    To get information about available applications in the cluster, use the ResolveApplication method from the gen.Resolver interface, which returns a list of gen.ApplicationRoute structures:

    • Name The name of the application

    • Node The name of the node where the application is loaded or running

    • Weight The weight assigned to the application in gen.ApplicationSpec

    • Mode The application's startup mode (gen.ApplicationModeTemporary, gen.ApplicationModePermanent, gen.ApplicationModeTransient)..

    • State The current state of the application (gen.ApplicationStateLoaded, gen.ApplicationStateRunning, gen.ApplicationStateStopping)

    You can access the gen.Resolver interface using the Resolver method from the gen.Registrar interface.

    Saturn
    How It Works

    Every minute, the cron system wakes up and evaluates all job specifications against the current time. Jobs whose specifications match the current minute are queued for execution. Each queued job then runs in its own goroutine.

    This design is stateless - no pre-calculated schedules, no complex data structures to maintain. When you add a job, it participates in the next evaluation. When you remove a job, it stops participating. Timezone and daylight saving time transitions are handled naturally because each evaluation uses current time rules.

    The stateless approach has implications. Multiple executions of the same job can run concurrently if the job takes longer than its interval. A job scheduled every minute that takes two minutes to complete will have two instances running simultaneously. If your job can't handle concurrent execution, implement serialization in the action itself - for example, send a message to a named process that processes requests sequentially.

    hashtag
    Defining Jobs

    A job specification declares what should run and when:

    The Name identifies the job uniquely within the node. The Spec uses crontab format to define the schedule. The Location specifies which timezone to use when interpreting the schedule. The Action defines what happens when the schedule triggers.

    Optionally, Fallback can specify a process to notify if the action fails, providing centralized error handling for scheduled tasks.

    hashtag
    Actions

    Actions define what happens when a job runs.

    The simplest action sends a message. The job triggers, the cron system sends gen.MessageCron to the specified process, and the process handles it through normal message processing. This integrates cleanly with the actor model - the scheduled work happens inside an actor's message handler.

    For work that needs isolation per execution, spawn a process. Each time the job triggers, a fresh process spawns, performs the work, and terminates. If one execution crashes, the next starts clean. The spawned process receives environment variables identifying which job spawned it and when (gen.CronEnvNodeName, gen.CronEnvJobName, gen.CronEnvJobActionTime).

    For distributed systems, spawn on a remote node. A job on the coordinator can trigger work on data nodes. The remote node must have enabled spawn permissions for the process name. This pattern centralizes scheduling while distributing execution.

    Custom actions implement the gen.CronAction interface. The Do method receives the job name, node reference, and execution time in the job's timezone. Return an error to trigger fallback handling.

    hashtag
    Crontab Format

    Cron uses standard crontab syntax: five fields specifying minute, hour, day-of-month, month, and day-of-week.

    Common patterns:

    • 0 * * * * - Every hour

    • 0 0 * * * - Every day at midnight

    • */15 * * * * - Every 15 minutes

    • 0 9-17 * * 1-5 - Every hour from 9-5 on weekdays

    • 0 0 1 * * - First day of each month

    • 0 0 * * 5#2 - Second Friday of each month

    • 0 0 L * * - Last day of each month

    Macros provide common schedules: @hourly, @daily, @weekly, @monthly.

    hashtag
    Managing Jobs

    Jobs can be defined at node startup in gen.NodeOptions.Cron.Jobs, or managed dynamically through the gen.Cron interface.

    Add jobs with AddJob. Remove them with RemoveJob. Temporarily disable with DisableJob (useful for maintenance windows), and resume with EnableJob. Query status with Info and JobInfo, which show execution history and errors.

    The Schedule and JobSchedule methods preview upcoming executions. Since the implementation evaluates specifications on-demand rather than maintaining pre-calculated schedules, these methods perform the same evaluation logic for a future time range. Use them to verify your crontab specs are correct or to detect scheduling conflicts.

    hashtag
    Timezone Handling

    Each job has its own timezone. A job with Location: time.UTC scheduled for midnight runs at UTC midnight. A job with a New York timezone runs at New York midnight. The physical location of the node doesn't matter - jobs run in their configured timezone.

    This matters for distributed systems where jobs serve different regions. One node can run jobs for multiple timezones. A cleanup job for European users runs at European midnight. A report job for Asian users runs at Asian business hours. Same node, different timezones, correct local timing.

    hashtag
    Daylight Saving Time

    Timezone transitions are handled carefully.

    When clocks spring forward, an hour disappears. A job scheduled for 2:00 AM doesn't run on the spring-forward date because 2:00 AM doesn't exist that day. The cron system detects the time adjustment and skips execution rather than running at the wrong time.

    When clocks fall back, an hour repeats. A job scheduled during that hour runs once, not twice. The system tracks actual wall clock progression to avoid duplicate execution.

    This behavior ensures jobs run when intended, not at arbitrary times that happen to match the specification after time adjustments.

    hashtag
    Error Handling

    If a job action returns an error and the job has a configured fallback, the system sends gen.MessageCronFallback to the fallback process. The message includes the job name, execution time, error, and an optional tag for identifying the job source.

    This allows centralizing monitoring of failed scheduled tasks. A single fallback process can receive failures from all jobs, log them, send alerts, or take corrective action.

    For complete crontab specification syntax and additional examples, refer to the gen.Cron interface documentation in the code.

    token, err := process.RegisterEvent("price_update", gen.EventOptions{
        Notify: true,
        Buffer: 10,
    })
    process.SendEvent("price_update", token, PriceUpdate{Symbol: "BTC", Price: 42000})
    lastEvents, err := process.LinkEvent(gen.Event{Name: "price_update", Node: "node@host"})
    for _, event := range lastEvents {
        // Process historical events
        price := event.Message.(PriceUpdate)
    }
    package main
    
    import (
        "time"
    
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/logger/rotate"
    )
    
    func main() {
        options := rotate.Options{
            Period:   time.Hour,
            Path:     "/var/log/myapp",
            Prefix:   "myapp",
            Compress: true,
            Depth:    24,
        }
    
        logger, err := rotate.CreateLogger(options)
        if err != nil {
            panic(err)
        }
    
        nodeOpts := gen.NodeOptions{
            Log: gen.LogOptions{
                Loggers: []gen.Logger{
                    {Name: "rotate", Logger: logger},
                },
            },
        }
    
        node, err := ergo.StartNode("demo@localhost", nodeOpts)
        if err != nil {
            panic(err)
        }
    
        node.Log().Info("Node started, logging to /var/log/myapp/myapp.log")
        node.Wait()
    }
    type CertAuthManager interface {
        CertManager
    
        // server-side
        SetClientCAs(pool *x509.CertPool)
        SetClientAuth(auth tls.ClientAuthType)
    
        // client-side
        SetRootCAs(pool *x509.CertPool)
        SetServerName(name string) // for SNI
    }
    newCert, _ := tls.LoadX509KeyPair("new.pem", "new-key.pem")
    certManager.Update(newCert)
    node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{})
    if err != nil {
        panic(err)
    }
    defer node.Wait()
    options := gen.NodeOptions{
        ShutdownTimeout: 30 * time.Second,
    }
    [warning] node 'myapp@localhost' is still waiting for 3 process(es) to terminate:
    [warning]   <ABC123.0.1004> ('worker_1', main.Worker) state: running, queue: 1
    [warning]   <ABC123.0.1005> ('worker_2', main.Worker) state: running, queue: 0
    [warning]   <ABC123.0.1006> (main.Worker) state: running, queue: 5
    type Worker struct {
        act.Actor
    }
    
    func createWorker() gen.ProcessBehavior {
        return &Worker{}
    }
    
    pid, err := node.Spawn(createWorker, gen.ProcessOptions{})
    type Worker struct {
        act.Actor
    }
    
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        // Handle the message
        // Return nil to continue, return error to terminate
        return nil
    }
    fmt.Printf("%s", gen.Atom("myprocess"))
    // Output: 'myprocess'
    pid := gen.PID{Node: "node@localhost", ID: 1001, Creation: 1685523227}
    fmt.Printf("%s", pid)
    // Output: <90A29F11.0.1001>
    processID := gen.ProcessID{Name: "worker", Node: "node@localhost"}
    fmt.Printf("%s", processID)
    // Output: <90A29F11.'worker'>
    ref := node.MakeRef()
    fmt.Printf("%s", ref)
    // Output: Ref#<90A29F11.128194.23952.0>
    alias := process.CreateAlias()
    fmt.Printf("%s", alias)
    // Output: Alias#<90A29F11.128194.23952.0>
    event := gen.Event{Name: "user_login", Node: "node@localhost"}
    fmt.Printf("%s", event)
    // Output: Event#<90A29F11:'user_login'>
    env := gen.Env("database_url")
    fmt.Printf("%s", env)
    // Output: DATABASE_URL
    node, err := ergo.StartNode("mynode@localhost", gen.NodeOptions{})
    // node implements gen.Node interface
    type MyActor struct {
        act.Actor  // Embeds gen.Process
    }
    
    func (a *MyActor) HandleMessage(from gen.PID, message any) error {
        // Process methods are available directly
        a.Send(from, "reply")
        return nil
    }
    network := node.Network()
    remoteNode, err := network.GetNode("remote@otherhost")
    remoteNode, err := network.GetNode("node@remotehost")
    pid, err := remoteNode.Spawn("worker", gen.ProcessOptions{}, args)
    import (
         "ergo.services/ergo"
         "ergo.services/ergo/gen"
    
         "ergo.services/registrar/saturn"
    )
    
    func main() {
         var options gen.NodeOptions
         ...
         host := "localhost"
         token := "IwOBhgAEAGzPt"
         options.Network.Registrar = saturn.Create(host, token, saturn.Options{})
         ...
         node, err := ergo.StartNode("demo@localhost", options)
         ...
    }
    type myActor struct {
        act.Actor
    }
    
    func (m *myActor) HandleMessage(from gen.PID, message any) error {
        reg, e := a.Node().Network().Registrar()
        if e != nil {
    	a.Log().Error("unable to get Registrar interface %s", e)
    	return nil
        }
        ev, e := reg.Event()
        if e != nil {
    	a.Log().Error("Registrar has no registered Event: %s", e)
    	return nil
        }
        
        a.MonitorEvent(ev)
        return nil
    }
    
    func (m *myActor) HandleEvent(event gen.MessageEvent) error {
        m.Log().Info("got event message: %v", event)
        return nil
    }
    type ApplicationRoute struct {
    	Node   Atom
    	Name   Atom
    	Weight int
    	Mode   ApplicationMode
    	State  ApplicationState
    }
    job := gen.CronJob{
        Name:     "daily_report",
        Spec:     "0 0 * * *",
        Location: time.UTC,
        Action:   gen.CreateCronActionMessage("reporter", gen.MessagePriorityNormal),
    }
    action := gen.CreateCronActionMessage("worker", gen.MessagePriorityNormal)
    action := gen.CreateCronActionSpawn(createReportWorker, gen.CronActionSpawnOptions{})
    action := gen.CreateCronActionRemoteSpawn("worker@datanode", "report_worker", gen.CronActionSpawnOptions{})
  • Trace - Faint white (low importance, background noise)

  • Debug - Magenta (development information)

  • Info - White (normal operation)

  • Warning - Yellow (attention needed)

  • Error - Red bold (problems occurred)

  • Panic - White on red background bold (critical failures)

  • Framework types also get color highlighting:

    • gen.Atom - Green (names and identifiers)

    • gen.PID - Blue (process identifiers)

    • gen.ProcessID - Blue (named processes)

    • gen.Ref - Cyan (references)

    • gen.Alias - Cyan (meta-process identifiers)

    • gen.Event - Cyan (event names)

    When you log process.Log().Info("started %s", pid), the PID renders in blue automatically. You don't annotate it - the logger detects the type and applies color. This works for any framework type used as an argument.

    hashtag
    Log Format

    Each log message follows a consistent structure:

    Timestamp appears first. By default, it's the Unix timestamp in nanoseconds. You can configure any format from Go's time package, or define your own. Nanosecond timestamps are sortable and precise, useful when correlating logs with traces or metrics.

    Level shows the severity. The bracket format [INFO] or short form [INF] makes levels easy to grep. Color reinforces the level visually - you don't need to read the text to know something is an error.

    Source identifies where the message originated:

    • Node logs - Show the node name in green (CRC32 hash for compactness)

    • Network logs - Show both local and peer node names

    • Process logs - Show PID in blue, optionally the registered name in green, optionally the behavior type

    • Meta-process logs - Show alias in cyan, optionally the behavior type

    The optional components (name, behavior) are controlled by configuration. During development, you might want behavior names to understand which actor logged something. In production, you might omit them to reduce output.

    Message is your formatted string with arguments. Framework types in arguments get color highlighting automatically.

    hashtag
    Configuration

    The logger accepts several options during creation:

    TimeFormat - Sets timestamp format. Any format from time package works (time.RFC3339, time.Kitchen, custom layouts). Leave empty for nanosecond timestamps. Nanoseconds are precise but hard to read. RFC3339 is human-friendly but verbose. Choose based on your use case.

    ShortLevelName - Uses abbreviated level names: [TRC], [DBG], [INF], [WRN], [ERR], [PNC]. Saves horizontal space in the terminal. Full names are clearer for people unfamiliar with the abbreviations.

    IncludeName - Adds the registered process name to the source. If a process registers as "worker", logs show the name in green next to the PID. Helpful when you have many processes and want to identify them by role rather than PID.

    IncludeBehavior - Adds the behavior type name to the source. Logs show which actor implementation generated the message. Useful during development to understand code flow. In production, this adds noise if you have good message content.

    IncludeFields - Includes structured logging fields in the output. Fields appear below the message with faint color. Useful when your log messages use context fields for correlation (request IDs, user IDs, etc.).

    DisableBanner - Disables the Ergo logo banner on startup. The banner announces framework version and adds visual flair. Disable it in production or when running tests where the banner clutters output.

    hashtag
    Basic Usage

    Register the colored logger in node options:

    The default logger writes to stdout too, but without colors. If you don't disable it, you get each message twice - once colored, once plain. Disabling the default logger ensures only the colored version appears.

    For detailed logger configuration options, see the colored.Options struct in the package. For understanding how loggers integrate with the framework, see Logging.

    Application

    Grouping and Managing Actors as a Unit

    An application groups related actors and manages them as a unit. Instead of starting individual processes and tracking their lifecycles manually, you define an application that specifies which actors to start, in what order, and how the group should behave if individual actors fail.

    Think of an application as a recipe. It lists the components (actors and supervisors), describes their startup order, and specifies the rules for what happens when things go wrong. The node follows this recipe when starting the application and monitors the running components according to the specified mode.

    hashtag
    The Need for Applications

    Starting processes one at a time works for simple systems. But as complexity grows, you face coordination problems. Which processes should start first? What if one fails to start - do you continue or abort? If a critical component terminates, should the service keep running in a degraded state or shut down cleanly?

    These aren't implementation details - they're architectural decisions about your service's structure and fault tolerance policy. Applications let you declare these decisions explicitly rather than scattering the logic throughout your code. The specification documents what your service consists of. The mode declares your termination policy. The framework enforces both.

    hashtag
    Defining an Application

    Applications implement the gen.ApplicationBehavior interface:

    The Load callback returns the application specification - what this application consists of and how it should behave. The Start callback runs after all processes start successfully. The Terminate callback runs when the application stops.

    A typical application specification:

    The Group lists processes to start. Processes start in the order listed. If a process has a Name, it's registered with that name, making it discoverable. Processes without names are anonymous.

    Application names and process names exist in separate namespaces. An application named "api" and a process named "api" do not conflict - you can have both registered simultaneously. However, using the same name for both creates confusion when reading code or debugging. Avoid identical names even though the framework allows it.

    hashtag
    Application Modes

    The mode determines what happens when a process in the application terminates.

    Temporary Mode - The application continues running despite individual process terminations. Only when all processes have stopped does the application itself terminate. This mode is for applications where components can fail and restart independently (typically via supervisors) without stopping the whole application.

    Transient Mode - The application stops if any process terminates abnormally (crashes, panics, errors). Normal termination doesn't trigger shutdown. When an abnormal termination occurs, all remaining processes receive exit signals and the application shuts down. Use this mode when abnormal failures indicate a systemic problem that requires stopping the entire service.

    Permanent Mode - The application stops if any process terminates, regardless of reason. Even normal termination of one process triggers shutdown of all others and the application itself. This mode is for applications where all components must run together - if one stops, the whole application is incomplete.

    hashtag
    Loading and Starting

    Applications go through two phases: loading and starting.

    Loading calls your Load callback, validates the specification, and registers the application with the node. The application is loaded but not running. This separation allows you to load multiple applications and resolve dependencies before starting any of them.

    Starting launches the processes in the Group according to their order. If dependencies are specified in ApplicationSpec.Depends, the node ensures those applications are running first. If any process fails to start (including initialization timeout), previously started processes are killed and the application fails to start.

    Application processes have a maximum InitTimeout of 15 seconds (3x DefaultRequestTimeout). Setting a higher value in gen.ProcessOptions returns gen.ErrNotAllowed and prevents the application from starting.

    Once all processes are running, the Start callback is called and the application enters the running state.

    hashtag
    Dependencies

    Applications can depend on other applications or network services. If application B depends on application A, the node ensures A is running before starting B. Dependencies are declared in ApplicationSpec.Depends.

    This allows you to structure complex systems with clear startup ordering. A database connection pool application starts before the API server application. The API server starts before the web frontend application. The framework handles the ordering automatically.

    hashtag
    Stopping Applications

    Applications stop in three ways.

    You can call ApplicationStop, which sends exit signals to all processes and waits for them to terminate gracefully (5 second timeout by default). Once all processes have stopped, the Terminate callback runs and the application transitions to the loaded state.

    You can call ApplicationStopForce, which kills all processes immediately without waiting. Less graceful, but guaranteed to stop quickly.

    The application can stop itself based on its mode. In Transient or Permanent mode, process failures trigger automatic shutdown according to the mode's rules.

    hashtag
    Environment and Configuration

    Applications have environment variables that all their processes inherit. These override node-level variables but are overridden by process-specific variables. This creates a natural layering: node provides defaults, application provides service-specific values, processes can override for their specific needs.

    hashtag
    Tags for Instance Selection

    Running multiple instances of the same application across a cluster creates a selection problem. Which instance should handle the request? In blue/green deployments, you run two versions and route traffic based on readiness. Canary deployments send a percentage to the new version. Some instances enter maintenance mode while others serve production traffic.

    Tags provide metadata for making these decisions. Label each application instance with tags describing its deployment state, version, or role:

    Tags are always available through node.ApplicationInfo() or remoteNode.ApplicationInfo(). For clusters using centralized registrars (etcd, Saturn), tags are also published during application route registration. This enables cluster-wide discovery: query the registrar and receive all application instances with their tags.

    The embedded in-memory registrar does not support application route registration, so tags in single-node or statically-routed deployments are only accessible via direct ApplicationInfo() calls, not through resolver queries.

    In clusters with centralized registrars:

    Common tag patterns:

    • Blue/green deployment: "blue", "green"

    • Canary rollout: "canary", "stable"

    • Maintenance state: "maintenance", "active", "draining"

    Tags separate deployment strategy from application code. Your application doesn't know it's the "blue" deployment - that's configuration. The routing logic queries tags and makes decisions based on current cluster state.

    hashtag
    Process Role Mapping

    Applications contain multiple processes with specific responsibilities. An API server handles requests. A connection pool manages database connections. A cache manager stores frequently accessed data. These are logical roles, but the actual process names might be versioned, generated, or environment-specific.

    The Map field bridges this gap. Define a mapping from logical role (string) to actual process name (Atom):

    To communicate with a process by role, get the application info, look up the role in the map, then use the returned name:

    This works for both local and remote applications. When querying a remote application, RemoteNode.ApplicationInfo() retrieves the map from the remote node, letting you discover process names without prior knowledge of the remote application's internal structure.

    Why use mapping:

    • Version changes: Update "api_server_v2" to "api_server_v3" without changing client code

    • Implementation swaps: Map "db" to different pool implementations based on deployment

    • Remote discovery: Remote nodes query the map to find process names in foreign applications

    The map provides a service contract. External code knows the application has an "api" role and a "db" role. The actual implementations can change as long as the roles remain consistent.

    hashtag
    The Application Pattern

    Applications provide structure to your actor system. Instead of scattered process creation throughout your code, applications centralize the "what runs in this service" question. The specification documents your system's structure. The mode declares your fault tolerance policy. The dependency mechanism ensures correct startup ordering.

    This organization becomes especially valuable in distributed systems where services start on different nodes. An application can be started remotely on another node, bringing all its components with the correct configuration and dependencies.

    For more details on application lifecycle and options, refer to the gen.ApplicationBehavior and gen.ApplicationSpec documentation in the code.

    Boilerplate Code Generation

    The ergo tool allows you to generate the structure and source code for a project based on the Ergo Framework. To install it, use the following command:

    go install ergo.tools/ergo@latest

    Alternatively, you can build it from the source code available at https://github.com/ergo-services/toolsarrow-up-right.

    When using ergo tool, you need to follow the specific template for providing arguments:

    Parent:Actor{param1:value1,param2:value2...}

    • Parent can be a supervisor (specified earlier with -with-sup) or an application (specified earlier with -with-app).

    • Actor can be an actor (added earlier with -with-actor) or a supervisor (specified earlier with -with-sup).

    This structured approach ensures the proper hierarchy and parameters are defined for your actors and supervisors

    hashtag
    Available Arguments and Parameters :

    • -init <node_name>: a required argument that sets the name of the node for your service. Available parameters:

      • tls: enables encryption for network connections (a self-signed certificate will be used).

    hashtag
    Example

    For clarity, let's use all available arguments for ergo in the following example:

    Pay attention to the values of the -with-tcp and -with-web arguments — they are enclosed in double quotes. If an argument has multiple parameters, they are separated by commas without spaces. However, since commas are argument delimiters for the shell interpreter, we enclose the entire value of the argument in double quotes to ensure the shell correctly processes the parameters.

    In our example, we specified two loggers: colored and rotate. This allows for colored log messages in the standard output as well as logging to files with log rotation functionality. In this case, the default logger is disabled to prevent duplicate log messages from appearing on the standard output.

    Additionally, we included the observer application. By default, this interface is accessible at http://localhost:9911.

    As a result of the generation process, we have a well-structured project source code that is ready for execution:

    The generated code is ready for compilation and execution:

    Since this example includes the , you can open http://localhost:9911 in your browser to access the web interface for and its running processes.

    WebWorker

    WebWorker is a specialized actor for handling HTTP requests sent as meta.MessageWebRequest messages. It automatically routes requests to HTTP-method-specific callbacks and ensures the request completion signal is called.

    Used with meta.WebHandler to convert HTTP requests into actor messages. See for integration approaches.

    hashtag
    Purpose

    <timestamp> <level> <source> [name] [behavior]: <message>
    package main
    
    import (
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/logger/colored"
    )
    
    func main() {
        logger := gen.Logger{
            Name:   "colored",
            Logger: colored.CreateLogger(colored.Options{}),
        }
    
        options := gen.NodeOptions{}
        options.Log.Loggers = []gen.Logger{logger}
        
        // Disable default logger to avoid duplicate output
        options.Log.DefaultLogger.Disable = true
    
        node, err := ergo.StartNode("demo@localhost", options)
        if err != nil {
            panic(err)
        }
        
        node.Log().Info("Node started with colored logger")
        node.Wait()
    }
    Version tracking: "v1.0.0", "v2.0.0"
  • Geographic region: "us-east", "eu-west"

  • Stable interface: Clients depend on roles ("api", "db"), not implementation details
    When WebHandler sends meta.MessageWebRequest to an actor, that actor must:
    1. Extract the message from mailbox

    2. Determine HTTP method

    3. Process the request

    4. Write response to http.ResponseWriter

    5. Call Done() to unblock the waiting HTTP handler

    WebWorker automates this. Embed it, implement method-specific callbacks, and the framework handles routing and cleanup.

    hashtag
    Basic Usage

    Embed act.WebWorker and implement callbacks for HTTP methods you handle:

    Spawn worker with registered name:

    When HTTP request arrives:

    1. WebHandler sends meta.MessageWebRequest to "api-worker"

    2. WebWorker detects message type, extracts HTTP method

    3. WebWorker calls appropriate Handle* method

    4. Your callback processes request, writes response

    5. WebWorker calls Done() automatically

    6. HTTP handler unblocks, response sent to client

    hashtag
    Available Callbacks

    All callbacks are optional. Implement only the methods you need:

    HTTP methods:

    • HandleGet(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandlePost(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandlePut(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandlePatch(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandleDelete(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandleHead(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    • HandleOptions(from gen.PID, writer http.ResponseWriter, request *http.Request) error

    Actor callbacks:

    • Init(args ...any) error - initialization

    • HandleMessage(from gen.PID, message any) error - non-HTTP messages

    • HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) - synchronous requests

    • HandleEvent(message gen.MessageEvent) error - event subscriptions

    • Terminate(reason error) - cleanup

    • HandleInspect(from gen.PID, item ...string) map[string]string - introspection

    Unimplemented HTTP methods return 501 Not Implemented automatically.

    hashtag
    Error Handling

    Return nil to continue processing requests. Return non-nil error to terminate the worker:

    Returning error terminates the worker. Use this for fatal errors only (database connection lost, critical resource unavailable). For transient errors (validation, not found, conflict), write error response and return nil.

    hashtag
    Using with act.Pool

    Single worker processes one request at a time. Use act.Pool for concurrent processing:

    Spawn pool instead of single worker:

    WebHandler sends requests to pool. Pool distributes across 10 workers. System handles 10 concurrent requests.

    For details on pools, see Pool.

    hashtag
    Handling Non-HTTP Messages

    WebWorker processes meta.MessageWebRequest specially, but also receives regular messages:

    This allows workers to receive configuration updates, control messages, or other actor communication while processing HTTP requests.

    hashtag
    Implementation Details

    WebWorker implements gen.ProcessBehavior at low level. It manages the mailbox loop, detects meta.MessageWebRequest, routes by HTTP method, and calls Done() after processing.

    The Done() call is critical. It cancels the context that WebHandler blocks on. Without it, HTTP request would timeout. WebWorker guarantees Done() is called even if your callback panics or returns error.

    Default implementations for all callbacks exist. Unimplemented HTTP methods log warning and return 501 Not Implemented. This allows implementing only the methods you need without boilerplate for unsupported methods.

    Web
    type ApplicationBehavior interface {
        Load(node Node, args ...any) (ApplicationSpec, error)
        Start(mode ApplicationMode)
        Terminate(reason error)
    }
    func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name: "myapp",
            Group: []gen.ApplicationMemberSpec{
                {Name: "worker", Factory: createWorker},
                {Factory: createSupervisor},
            },
            Mode: gen.ApplicationModeTransient,
        }, nil
    }
    func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name: "api_service",
            Tags: []gen.Atom{"blue", "v2.1.0"},
            // ... rest of spec
        }, nil
    }
    // Query registrar for all instances
    routes, err := resolver.ResolveApplication("api_service")
    // Returns []ApplicationRoute, each with Node, Tags, Weight, State
    
    // Filter by tag
    for _, route := range routes {
        hasBlue := false
        for _, tag := range route.Tags {
            if tag == "blue" {
                hasBlue = true
                break
            }
        }
        if hasBlue {
            remoteNode, _ := network.GetNode(route.Node)
            info, _ := remoteNode.ApplicationInfo("api_service")
            // Use this instance
        }
    }
    func (a *MyApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name: "backend",
            Map: map[string]gen.Atom{
                "api":   "api_server_v2",
                "db":    "postgres_pool",
                "cache": "redis_manager",
            },
            Group: []gen.ApplicationMemberSpec{
                {Name: "api_server_v2", Factory: createAPI},
                {Name: "postgres_pool", Factory: createDB},
                {Name: "redis_manager", Factory: createCache},
            },
        }, nil
    }
    // Query application info (works locally or remotely)
    info, err := node.ApplicationInfo("backend")
    // or: info, err := remoteNode.ApplicationInfo("backend")
    
    // Find process name by role
    apiName, found := info.Map["api"]
    if found {
        // Use the actual process name to communicate
        response, err := node.Call(apiName, APIRequest{})
    }
    type APIWorker struct {
        act.WebWorker
    }
    
    func (w *APIWorker) HandleGet(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
        // Process GET request
        user := w.lookupUser(request.URL.Query().Get("id"))
        json.NewEncoder(writer).Encode(user)
        return nil
    }
    
    func (w *APIWorker) HandlePost(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
        // Process POST request
        var data CreateRequest
        json.NewDecoder(request.Body).Decode(&data)
    
        result := w.createResource(data)
        writer.WriteHeader(http.StatusCreated)
        json.NewEncoder(writer).Encode(result)
        return nil
    }
    
    func (w *APIWorker) HandleDelete(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
        id := request.URL.Query().Get("id")
        w.deleteResource(id)
        writer.WriteHeader(http.StatusNoContent)
        return nil
    }
    type WebService struct {
        act.Actor
    }
    
    func (s *WebService) Init(args ...any) error {
        // Spawn worker
        _, err := s.SpawnRegister("api-worker",
            func() gen.ProcessBehavior { return &APIWorker{} },
            gen.ProcessOptions{},
        )
        if err != nil {
            return err
        }
    
        // Create WebHandler pointing to worker
        handler := meta.CreateWebHandler(meta.WebHandlerOptions{
            Worker: "api-worker",
        })
    
        _, err = s.SpawnMeta(handler, gen.MetaOptions{})
        // rest of setup...
    }
    func (w *APIWorker) HandlePost(from gen.PID, writer http.ResponseWriter, request *http.Request) error {
        var data CreateRequest
        if err := json.NewDecoder(request.Body).Decode(&data); err != nil {
            // Invalid JSON - return error to client, continue processing
            http.Error(writer, "Invalid JSON", http.StatusBadRequest)
            return nil
        }
    
        if err := w.createResource(data); err != nil {
            // Transient error - return error to client, continue processing
            http.Error(writer, "Create failed", http.StatusInternalServerError)
            return nil
        }
    
        writer.WriteHeader(http.StatusCreated)
        return nil
    }
    type APIWorkerPool struct {
        act.Pool
    }
    
    func (p *APIWorkerPool) Init(args ...any) (act.PoolOptions, error) {
        return act.PoolOptions{
            PoolSize:          10,
            WorkerMailboxSize: 20,
            WorkerFactory:     func() gen.ProcessBehavior { return &APIWorker{} },
        }, nil
    }
    _, err := s.SpawnRegister("api-worker",
        func() gen.ProcessBehavior { return &APIWorkerPool{} },
        gen.ProcessOptions{},
    )
    func (w *APIWorker) HandleMessage(from gen.PID, message any) error {
        // meta.MessageWebRequest handled automatically by WebWorker
        // Other messages reach this callback
        switch m := message.(type) {
        case ConfigUpdate:
            w.config = m.Config
            w.Log().Info("Configuration updated")
        }
        return nil
    }
    module
    : allows you to specify the module name for the
    go.mod
    file.
  • -path <path>: specifies the path for the code of the generated project.

  • -with-actor <name>: adds an actor (based on act.Actor).

  • -with-app <name>: adds an application. Available parameters:

    • mode: specifies the application's start mode (temp - Temporary, perm - Permanent, trans - Transient). The default mode is trans. Example: -with-app MyApp{mode:perm}

  • -with-sup <name>: adds a supervisor (based on act.Supervisor). Available parameters:

    • type: specifies the type of supervisor (ofo - One For One, sofo - Simple One For One, afo - All For One, rfo - Rest For One). The default type is ofo.

    • strategy: specifies the for the supervisor (temp - Temporary, perm - Permanent, trans - Transient). The default strategy is trans.

  • -with-pool <name>: adds a process pool actor (based on act.Pool). Available parameters:

    • size: Specifies the number of worker processes in the pool. By default, 3 processes are started.

  • -with-web <name>: adds a Web server (based on act.Pool and act.WebHandler). Available parameters:

    • host: specifies the hostname for the Web server.

    • port: specifies the port number for the Web server. The default is 9090.

    • tls: enables encryption for the Web server using the node's CertManager.

  • -with-tcp <name>: adds a TCP server actor (based on act.Actor and meta.TCPServer meta-process). Available parameters:

    • host: specifies the hostname for the TCP server.

    • port: specifies the port number for the TCP server. The default is 7654.

    • tls: enables encryption for the TCP server using the node's CertManager.

  • -with-udp <name>: adds a UDP server actor (based on act.Pool , meta.UDPServer and act.Actor as worker processes). Available parameters:

    • host: specifies the hostname for the UDP server.

    • port: specifies the port number for the UDP server. The default is 7654.

  • -with-msg <name>: adds a message type for network interactions.

  • -with-logger <name>: adds a logger from the extended library. Available loggers: colored, rotate

  • -with-observer: adds the Observer application.

  • observer application
    inspecting the node

    Remote Spawn Process

    Spawning processes on remote nodes

    Remote spawning means starting a process on another node from your code. You call a method, provide a factory name and options, and a process starts on the remote node. From the caller's perspective, it's nearly identical to spawning locally - you get back a gen.PID and can communicate with it immediately.

    This capability enables dynamic workload distribution. Your node needs to process a job but doesn't have capacity? Spawn a worker on a remote node with available resources. Your application needs to scale horizontally? Spawn processes across multiple nodes and distribute load. Remote spawning makes the cluster feel like one large computing resource rather than isolated nodes.

    But remote spawning isn't automatic. Security matters. You don't want arbitrary nodes spawning arbitrary processes on your infrastructure. The framework requires explicit permission - the remote node must enable each process factory individually and can restrict which nodes are allowed to use it.

    hashtag
    Security Model

    Remote spawning is disabled by default at the framework level. To enable it, set the EnableRemoteSpawn flag in your node's network configuration:

    This flag is a global switch. With it disabled, all remote spawn requests fail immediately with gen.ErrNotAllowed. With it enabled, requests proceed to the next level of security: per-factory permission.

    hashtag
    Enabling Process Factories

    Even with EnableRemoteSpawn turned on, remote nodes can't spawn anything until you explicitly enable specific process factories:

    Now remote nodes can request spawning using the factory name "worker". The factory function createWorker returns a gen.ProcessBehavior, just like local spawning. When a remote spawn request arrives with name "worker", the framework calls createWorker() to instantiate the process.

    The factory name is the permission token. Remote nodes must use this exact name when requesting spawns. If they request "worker" and you haven't enabled it, the request fails. If they request "admin_process" without permission, it fails. You control the namespace of what's spawnable.

    hashtag
    Access Control Lists

    By default, EnableSpawn allows all nodes to use the factory. But you can restrict it to specific nodes:

    Now only those two nodes can spawn workers. Requests from other nodes fail with gen.ErrNotAllowed.

    You can update the access list dynamically:

    Calling EnableSpawn again with the same factory name updates the access list. The factory must be the same (same type) - you can't change which factory is associated with a name after the first EnableSpawn call. Attempting to do so returns an error.

    hashtag
    Disabling Access

    To remove nodes from the access list:

    This removes scheduler@node2 from the allowed list. Other nodes in the list remain allowed.

    To completely disable a factory:

    Without any node arguments, DisableSpawn removes the factory entirely. All future spawn requests for that name fail.

    To re-enable the factory with an open access list (any node can spawn):

    This is the explicit "allow all nodes" configuration.

    hashtag
    Spawning on Remote Nodes

    To spawn a process on a remote node, first get a gen.RemoteNode interface:

    GetNode establishes a connection if needed. If a connection already exists, it returns immediately. If discovery or connection fails, you get an error.

    With the remote node handle, spawn a process:

    The gen.ProcessOptions are the same as local spawning: mailbox size, compression settings, parent process options. The remote node respects these options when creating the process.

    hashtag
    Spawn with Arguments

    You can pass initialization arguments to the remote process:

    These arguments are passed to the factory's Init callback, just like local spawning. The arguments must be serializable via EDF - primitives, registered structs, framework types. Complex arguments require type registration on both sides.

    hashtag
    Spawn with Registration

    To spawn and register the process with a name:

    The first argument is the registration name. The remote process is registered under that name on the remote node, allowing other processes on that node (or other nodes) to find it via gen.ProcessID{Name: "worker-001", Node: "worker@otherhost"}.

    hashtag
    Spawning from Processes

    The gen.Process interface provides methods for remote spawning from within a process:

    This differs from using RemoteNode.Spawn in a subtle but important way: the spawned process inherits properties from the calling process, not from the node.

    Inherited properties:

    • Application name - if the caller is part of an application, the remote process becomes part of that application too

    • Logging level - the remote process uses the same log level as the caller

    • Environment variables - if ExposeEnvRemoteSpawn security flag is enabled, the remote process gets a copy of the caller's environment

    This inheritance enables application-level distribution. If your application spawns processes remotely using process.RemoteSpawn, those processes belong to your application's supervision tree (conceptually), inherit your configuration, and operate as extensions of your application rather than independent processes.

    The same inheritance applies.

    hashtag
    Parent Relationship and Inheritance

    Remote spawn behavior differs based on whether you spawn from a process or from the node:

    From a process (process.RemoteSpawn):

    The spawned process inherits attributes from the calling process:

    • Parent PID: Set to the calling process's PID

    • Group Leader: Set to the calling process's group leader

    • Application: Set to the calling process's application name (if caller belongs to an application)

    The remote process can send messages to its parent using process.Parent(). If LinkChild: true is set in options, the link is established after spawn. However, the parent is on a different node - if the network connection drops, the remote process receives an exit signal for the lost parent and may terminate if linked.

    From the node (RemoteNode.Spawn):

    The spawned process receives attributes from the requesting node's core:

    • Parent PID: Set to the requesting node's core PID

    • Group Leader: Set to the requesting node's core PID

    • Application: Not set (empty - process doesn't belong to any application)

    This creates independent processes without application affiliation. Use this for standalone remote workers that don't need to be part of an application's logical structure.

    hashtag
    Environment Variable Inheritance

    By default, remote processes don't inherit environment variables. This is a security decision - you probably don't want to expose your node's configuration to remote processes.

    To enable environment inheritance:

    Now when you use process.RemoteSpawn, the remote process receives a copy of the calling process's environment. The remote node reads these values and sets them on the spawned process.

    Important: Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via edf.RegisterTypeOf. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote spawn fails entirely with an error like "no encoder for type <type>". The framework doesn't skip problematic variables - any non-serializable value causes the entire spawn request to fail.

    Environment inheritance only works with process.RemoteSpawn. Using RemoteNode.Spawn doesn't inherit environment because there's no calling process - it's a node-level operation.

    hashtag
    How It Works

    When you call remote.Spawn:

    1. Check capabilities - The local node checks if the remote node's EnableRemoteSpawn flag is true (learned during handshake). If false, fail immediately.

    2. Create spawn message - Package the factory name, process options, and arguments into a MessageSpawn protocol message. Include a reference for tracking the response.

    If anything fails (factory not found, access denied, remote node terminating, initialization timeout), the error is returned to the caller. The entire operation is synchronous from the caller's perspective - you call Spawn and block until the process is created or an error occurs.

    hashtag
    Practical Considerations

    Performance - Remote spawning is slower than local spawning. There's network latency, message encoding, and a synchronous request-response roundtrip. If you're spawning hundreds of processes, doing it remotely will be noticeably slower. Consider spawning a pool locally and distributing work via messages rather than spawning on-demand remotely.

    Timeouts - Remote spawn has a maximum InitTimeout of 15 seconds (3x DefaultRequestTimeout). If the remote process's ProcessInit takes longer, spawn fails with gen.ErrTimeout. Setting InitTimeout higher than 15 seconds returns gen.ErrNotAllowed immediately without attempting the spawn.

    Failure modes - Remote spawn can fail in ways local spawn can't. The network connection can drop mid-request. The remote node can crash before responding. The factory might exist but lack permission. Handle errors explicitly and have fallback strategies (retry, spawn locally, defer the work).

    Resource ownership - A process spawned on a remote node runs on that node's resources (CPU, memory). It's part of that node's process table. If the remote node terminates, the process dies. If you're distributing workload, be aware of which node owns which processes.

    Linking - Both LinkChild and LinkParent options work for remote spawn. The link is established after the remote process is created. If the network connection drops, linked processes receive exit signals for the lost peer.

    Application membership - Processes spawned via RemoteNode.Spawn don't belong to any application. Processes spawned via process.RemoteSpawn inherit the caller's application. This affects supervision, lifecycle, and monitoring.

    Registration names - Use SpawnRegister carefully. The name you provide is registered on the remote node. If that name is already taken, spawn fails. Ensure your naming strategy avoids conflicts, especially if multiple nodes are spawning on the same target.

    hashtag
    When to Use Remote Spawn

    Dynamic scaling - Your application detects high load and spawns additional workers on remote nodes to handle the burst. When load decreases, workers terminate naturally and resources are freed.

    Specialized hardware - Some nodes have GPUs, fast storage, or special network access. Spawn processes on those nodes when you need their capabilities, rather than sending data back and forth.

    Fault isolation - Spawn risky operations on remote nodes. If they crash or consume excessive resources, they don't affect your local node's stability.

    Data locality - If data lives on a specific node (in memory, on local disk), spawn processing near the data rather than transferring it across the network.

    Heterogeneous clusters - Different nodes run different process types. Scheduler nodes spawn job processors on worker nodes. API nodes spawn request handlers on computation nodes. Remote spawning enables this separation.

    Remote spawning isn't always the right answer. For static topologies where processes have fixed homes, use supervision trees and let supervisors spawn locally. For message-passing workloads where spawning overhead matters, use process pools and distribute work via messages. Remote spawning shines when you need dynamic, on-demand process creation across a cluster.

    For understanding the underlying network mechanics, see . For controlling connections to remote nodes, see .

    WebSocket

    WebSocket provides persistent bidirectional connections between clients and servers. Unlike HTTP request-response, a WebSocket connection remains open for extended periods, allowing both client and server to send messages at any time.

    The framework provides WebSocket meta-process implementation that integrates WebSocket connections with the actor model. Each connection becomes an independent actor addressable from anywhere in the cluster.

    hashtag
    The Integration Problem

    WebSocket connections need two capabilities simultaneously:

    Continuous reading: Connection must block reading messages from the client. When a message arrives, forward it to application actors for processing.

    Asynchronous writing: Backend actors must be able to push messages to the client at any time - notifications, updates, events from the actor system.

    This is exactly what meta-processes solve. External Reader continuously reads from the WebSocket. Actor Handler receives messages from backend actors and writes to the WebSocket. Both operate concurrently on the same connection.

    hashtag
    Components

    Two meta-processes work together:

    WebSocket Handler: Implements http.Handler interface. When HTTP request arrives, upgrades it to WebSocket connection using gorilla/websocket library. Spawns Connection meta-process for each upgrade. Returns immediately - does not block.

    WebSocket Connection: Meta-process managing one WebSocket connection. External Reader continuously reads messages from client, sends them to application actors. Actor Handler receives messages from actors, writes them to client. Connection lives until client disconnects or error occurs.

    hashtag
    Creating WebSocket Server

    Use websocket.CreateHandler to create handler meta-process:

    Handler options:

    ProcessPool: List of process names that will receive messages from WebSocket connections. When connection is established, handler round-robins across this pool to select which process receives messages from this connection. If empty, connection sends to parent process.

    HandshakeTimeout: Maximum time for WebSocket upgrade handshake. Default 15 seconds.

    EnableCompression: Enable per-message compression. Reduces bandwidth for text messages.

    CheckOrigin: Function to verify request origin. Return true to accept, false to reject. Default rejects cross-origin requests. Use func(r *http.Request) bool { return true } to accept all origins.

    hashtag
    Connection Lifecycle

    When client connects:

    1. HTTP request arrives, handler upgrades to WebSocket

    2. Handler spawns Connection meta-process

    3. Connection sends MessageConnect to application

    During connection lifetime:

    • Client messages: External Reader reads → sends to application

    • Server messages: Application sends → Actor Handler writes to client

    • Both directions operate simultaneously

    When client disconnects:

    1. ReadMessage() returns error

    2. External Reader sends MessageDisconnect to application

    3. Connection closes socket

    hashtag
    Messages

    Three message types flow between connections and actors:

    websocket.MessageConnect: Sent when connection established.

    Receive this to track new connections:

    websocket.MessageDisconnect: Sent when connection closes.

    Receive this to clean up connection state:

    websocket.Message: Client message received or server message to send.

    Receive messages from client:

    Send messages to client:

    When sending, Type defaults to MessageTypeText if not set. ID field is ignored - target is specified in SendAlias() call.

    hashtag
    Network Transparency

    Connection meta-processes have gen.Alias identifiers that work across the cluster. Any actor on any node can send messages to any connection:

    Network transparency makes every WebSocket connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without intermediaries.

    hashtag
    Client Connections

    Create client-side WebSocket connections with websocket.CreateConnection:

    CreateConnection performs WebSocket dial during creation. If dial fails, error is returned. If successful, connection is established but meta-process is not started yet. Call SpawnMeta() to start the meta-process. If spawn fails, call conn.Terminate(err) to close the connection.

    Connection options:

    URL: WebSocket server address. Use ws:// or wss:// scheme.

    Process: Process name that will receive messages from server. If empty, sends to parent process.

    HandshakeTimeout: Maximum time for connection handshake. Default 15 seconds.

    EnableCompression: Enable compression. Must match server setting.

    Client connections work identically to server connections. External Reader reads from server, Actor Handler sends to server. Messages use the same websocket.Message type.

    hashtag
    Process Pool Distribution

    Handler accepts ProcessPool - list of process names to receive connection messages. Handler distributes connections across this pool using round-robin:

    Connection 1 sends to "handler1", connection 2 to "handler2", connection 3 to "handler3", connection 4 to "handler1", etc. This distributes load across multiple handler processes.

    Useful for scaling: spawn multiple handler processes, each managing subset of connections. Prevents single handler from becoming bottleneck.

    Behind the NAT

    Running nodes behind NAT or load balancers

    When a node starts, it registers its routes with a registrar. A route contains connection parameters: port number, TLS flag, handshake version, protocol version, and optionally a host address. When another node needs to connect, it resolves the target node's routes from the registrar and uses these parameters to establish a connection.

    The host address in the route is optional. When empty, the connecting node extracts the host from the target's node name. If you're connecting to [email protected], the framework extracts 10.0.1.50 and connects to that address on the resolved port.

    This works when node names reflect reachable addresses. But when a node is behind NAT, its node name contains a private IP that external nodes can't reach. The solution is to include a public address in the route itself using RouteHost and RoutePort.

    Remote Start Application

    Starting applications on remote nodes

    Remote application starting means launching an application on another node from your code. The remote node has the application loaded but not running. You send a start request, and the application starts on that node with the mode and options you specify. The application runs under the remote node's supervision, part of the remote node's application tree.

    This capability enables dynamic application deployment and orchestration. You have a cluster of nodes, each with applications loaded but waiting. A coordinator node decides which applications should run where, based on load, topology, or scheduling logic. Remote application starting makes this coordination explicit and controllable.

    Like remote spawning, remote application starting isn't automatic. Security matters. You don't want arbitrary nodes starting arbitrary applications. The framework requires explicit permission - the remote node must enable each application individually and can restrict which nodes are allowed to start it.

    Inspecting With Observer

    hashtag
    Installation and starting

    To install the observer tool, you need to have Golang compiler version 1.20 or higher. Run the following command:

    Available arguments for starting observer:

    $ ergo -path /tmp/project \
          -init demo{tls} \
          -with-app MyApp \
          -with-actor MyApp:MyActorInApp \
          -with-sup MyApp:MySup \
          -with-actor MySup:MyActorInSup \
          -with-tcp "MySup:MyTCP{port:12345,tls}" \
          -with-udp MySup:MyUDP{port:54321} \
          -with-pool MySup:MyPool{size:4} \
          -with-web "MyWeb{port:8888,tls}" \
          -with-msg MyMsg1 \
          -with-msg MyMsg2 \
          -with-logger colored \
          -with-logger rotate \
          -with-observer
          
    Generating project "/tmp/project/demo"...
       generating "/tmp/project/demo/apps/myapp/myactorinapp.go"
       generating "/tmp/project/demo/apps/myapp/myactorinsup.go"
       generating "/tmp/project/demo/cmd/myweb.go"
       generating "/tmp/project/demo/cmd/myweb_worker.go"
       generating "/tmp/project/demo/apps/myapp/mytcp.go"
       generating "/tmp/project/demo/apps/myapp/myudp.go"
       generating "/tmp/project/demo/apps/myapp/myudp_worker.go"
       generating "/tmp/project/demo/apps/myapp/mypool.go"
       generating "/tmp/project/demo/apps/myapp/mypool_worker.go"
       generating "/tmp/project/demo/apps/myapp/mysup.go"
       generating "/tmp/project/demo/apps/myapp/myapp.go"
       generating "/tmp/project/demo/types.go"
       generating "/tmp/project/demo/cmd/demo.go"
       generating "/tmp/project/demo/README.md"
       generating "/tmp/project/demo/go.mod"
       generating "/tmp/project/demo/go.sum"
    
    Successfully completed.
     demo
    ├── apps
    │  └── myapp
    │     ├── myactorinapp.go
    │     ├── myactorinsup.go
    │     ├── myapp.go
    │     ├── mypool.go
    │     ├── mypool_worker.go
    │     ├── mysup.go
    │     ├── mytcp.go
    │     ├── myudp.go
    │     └── myudp_worker.go
    ├── cmd
    │  ├── demo.go
    │  ├── myweb.go
    │  └── myweb_worker.go
    ├── go.mod
    ├── go.sum
    ├── README.md
    └── types.go
    restart strategy
    Log Level: Inherits the calling process's log level
  • Environment: Inherits the calling process's environment (if SecurityOptions.ExposeEnvRemoteSpawn is enabled)

  • Log Level: Inherits the requesting node's default log level
  • Environment: Inherits the requesting node's environment (if SecurityOptions.ExposeEnvRemoteSpawn is enabled)

  • Send request
    - Encode and send the message to the remote node. Wait for a response (this is synchronous - remote spawning blocks until the remote node replies).
  • Remote processing - The remote node receives the message, checks if the factory is enabled, checks if the requesting node is allowed, calls the factory function, spawns the process with the given options.

  • Response - The remote node sends back a MessageResult containing either the spawned PID or an error. The local node receives this, resolves the waiting request, and returns the PID to the caller.

  • Network Stack
    Static Routes
    hashtag
    Security Model

    Remote application starting is disabled by default at the framework level. To enable it, set the EnableRemoteApplicationStart flag in your node's network configuration:

    This flag is a global switch. With it disabled, all remote application start requests fail immediately with gen.ErrNotAllowed. With it enabled, requests proceed to per-application permission.

    hashtag
    Enabling Applications

    Even with EnableRemoteApplicationStart turned on, remote nodes can't start anything until you explicitly enable specific applications:

    Now remote nodes can request starting the "workers" application. The application must be loaded on this node (via node.ApplicationLoad). If it's not loaded, remote start requests fail with gen.ErrApplicationUnknown. If it's already running, remote start requests fail because you can't start a running application again.

    The application name is the permission token. Remote nodes must use this exact name when requesting starts. If they request "workers" and you haven't enabled it, the request fails. If they request "admin_app" without permission, it fails. You control what's startable remotely.

    hashtag
    Access Control Lists

    By default, EnableApplicationStart allows all nodes to start the application. But you can restrict it to specific nodes:

    Now only those two nodes can start the workers application. Requests from other nodes fail with gen.ErrNotAllowed.

    You can update the access list dynamically:

    Calling EnableApplicationStart again with the same application name updates the access list.

    hashtag
    Disabling Access

    To remove nodes from the access list:

    This removes scheduler@node2 from the allowed list. Other nodes in the list remain allowed.

    To completely disable remote starting for an application:

    Without any node arguments, DisableApplicationStart removes the permission entirely. All future start requests for that application fail.

    To re-enable with an open access list (any node can start):

    This is the explicit "allow all nodes" configuration.

    hashtag
    Starting Applications on Remote Nodes

    To start an application on a remote node, first get a gen.RemoteNode interface:

    With the remote node handle, start an application:

    The application starts on the remote node. The start is synchronous - the call blocks until the remote node confirms the application started or returns an error.

    hashtag
    Application Startup Modes

    Applications have three startup modes: Temporary, Transient, and Permanent. These modes control restart behavior when the application terminates. For remote starts, you can specify the mode explicitly:

    If you use ApplicationStart without specifying a mode, the application starts with the mode it was loaded with (set during ApplicationLoad).

    The mode affects how the remote node's application supervisor handles termination. If the application crashes, does it restart automatically? The mode determines this. Choose based on your operational requirements - critical services should be Permanent, optional services can be Temporary, and services that should restart only on failure can be Transient.

    For details on application modes, see Application Startup Modes.

    hashtag
    Application Parent and Process Hierarchy

    When an application starts remotely, parent tracking is set at multiple levels:

    Application Parent: Set to the requesting node name:

    Process Parent for Group Members: Processes started directly by the application (listed in Group) receive the requesting node's core PID as their parent:

    Process Parent for Descendants: If those processes spawn children, the children receive their spawning process PID as parent (normal process hierarchy):

    Only the first-level processes (application group members) have the cross-node parent relationship. Subsequent generations follow standard process parent-child relationships within the local node.

    This parent information is for tracking and auditing, not supervision. The application is supervised by the local application supervisor on the remote node. Terminating the requesting node does not affect the running application.

    hashtag
    Environment Variable Inheritance

    By default, remote applications don't inherit environment variables from the requesting node. To enable environment inheritance:

    Now when you start an application remotely, the application's processes receive a copy of the requesting node's core environment. This enables configuration propagation - your scheduler node has configuration in its environment, and applications started remotely inherit it.

    Important: Environment variable values must be EDF-serializable. Strings, numbers, booleans work fine. Custom types require registration via edf.RegisterTypeOf. If an environment variable contains a non-serializable value (e.g., a channel, function, or unregistered struct), the remote application start fails entirely with an error like "no encoder for type <type>". The framework doesn't skip problematic variables - any non-serializable value causes the entire start request to fail.

    hashtag
    How It Works

    When you call remote.ApplicationStart:

    1. Check capabilities - The local node checks if the remote node's EnableRemoteApplicationStart flag is true (learned during handshake). If false, fail immediately.

    2. Create start message - Package the application name, startup mode, and options into a MessageApplicationStart protocol message. Include a reference for tracking the response.

    3. Send request - Encode and send the message to the remote node. Wait for a response (this is synchronous - remote application start blocks until the remote node replies).

    4. Remote processing - The remote node receives the message, checks if the application is enabled for remote start, checks if the requesting node is allowed, verifies the application exists and isn't already running, calls the application's start logic with the given mode.

    5. Response - The remote node sends back a MessageResult containing either success or an error. The local node receives this, resolves the waiting request, and returns the result to the caller.

    If anything fails (application not found, access denied, already running, remote node terminating), the error is returned to the caller. The entire operation is synchronous - you call ApplicationStart and block until the application is running or an error occurs.

    hashtag
    Practical Considerations

    Idempotency - Starting an already-running application returns an error. If you're unsure of the application's state, query it first using remote.ApplicationInfo to check if it's already running. Or handle the error gracefully and treat "already running" as success.

    Startup time - Some applications take time to start - they might load configuration, establish connections, initialize state. The remote start call blocks during this entire startup sequence. If startup is slow, the caller waits. For long-running startup logic, consider using async patterns or monitoring application state separately.

    Failure modes - Remote application start can fail in ways local start can't. The network connection can drop mid-request. The remote node can crash before responding. The application might fail to start for reasons specific to that node (missing dependencies, configuration issues). Handle errors explicitly.

    Resource contention - An application starting on a remote node consumes that node's resources (CPU, memory, file descriptors). If multiple nodes simultaneously request starting applications on the same remote node, it could become resource-constrained. Coordinate start requests to avoid overwhelming nodes.

    Application lifecycle - Once started remotely, the application runs until explicitly stopped or until the remote node terminates. The requesting node has no automatic control over the running application. If you want to stop it later, you need to send another request (the framework doesn't currently support remote application stop, but you can implement custom coordination via messages).

    Supervision independence - The application is supervised by the remote node, not by the requesting node. If the requesting node crashes, the application keeps running. If the remote node crashes, the application terminates. This independence is important for operational reasoning - the application's lifecycle is tied to where it runs, not to who started it.

    Configuration management - Applications often need configuration. With ExposeEnvRemoteApplicationStart, you can propagate environment variables. But this creates coupling - the application depends on the requesting node's configuration. Consider whether configuration should come from the remote node's local environment, from a centralized configuration service, or from the requesting node. The right answer depends on your architecture.

    hashtag
    When to Use Remote Application Start

    Dynamic orchestration - A coordinator node decides which applications should run on which nodes based on cluster state, resource availability, or scheduling logic. The coordinator starts applications dynamically as needed.

    Staged deployment - Applications are pre-loaded on nodes but not started. A deployment controller starts them in a specific order, waiting for health checks between stages. This enables controlled rollouts.

    Capacity management - Some applications run only during high-load periods. A resource manager monitors load and starts applications on additional nodes when needed, then stops them when load decreases.

    Geographic distribution - Applications are loaded across multiple regions. A traffic manager starts applications in specific regions based on user distribution, latency requirements, or failover needs.

    Testing and validation - Test frameworks load applications on test nodes but don't start them until test execution. Tests start applications with specific configurations, run scenarios, then stop them. This enables repeatable, isolated testing.

    Maintenance windows - During maintenance, you stop applications on a node, perform updates, then start them again. Remote start enables coordinated maintenance across a cluster without manually SSHing to each node.

    Remote application starting is about control and coordination. If your cluster has static application deployment (applications always run on specific nodes), you don't need this feature - use supervision trees and let supervisors start applications automatically. If your cluster has dynamic application deployment (applications move between nodes based on conditions), remote application starting enables that flexibility.

    For understanding the underlying network mechanics, see Network Stack. For controlling connections to remote nodes, see Static Routes. For understanding application lifecycle and modes, see Application.

    node, err := ergo.StartNode("worker@localhost", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Flags: gen.NetworkFlags{
                Enable:            true,
                EnableRemoteSpawn: true,  // allow remote nodes to spawn processes
            },
        },
    })
    network := node.Network()
    
    err := network.EnableSpawn("worker", createWorker)
    if err != nil {
        // handle error
    }
    // Allow only these nodes to spawn workers
    network.EnableSpawn("worker", createWorker, 
        "scheduler@node1", 
        "scheduler@node2",
    )
    // Add more nodes to the allowed list
    network.EnableSpawn("worker", createWorker,
        "scheduler@node1",
        "scheduler@node2", 
        "scheduler@node3",  // newly allowed
    )
    // Remove specific nodes
    network.DisableSpawn("worker", "scheduler@node2")
    // No nodes can spawn workers anymore
    network.DisableSpawn("worker")
    // Re-enable for all nodes
    network.EnableSpawn("worker", createWorker)  // no node arguments
    network := node.Network()
    remote, err := network.GetNode("worker@otherhost")
    if err != nil {
        return err  // node unreachable, no route, etc
    }
    pid, err := remote.Spawn("worker", gen.ProcessOptions{})
    if err != nil {
        // handle error - not allowed, factory not found, remote node terminated, etc
    }
    
    // pid is the process running on the remote node
    process.Send(pid, WorkRequest{Job: "process-data"})
    pid, err := remote.Spawn("worker", gen.ProcessOptions{}, 
        ConfigData{WorkerID: 42, BatchSize: 100},
    )
    pid, err := remote.SpawnRegister("worker-001", "worker", gen.ProcessOptions{})
    pid, err := process.RemoteSpawn("worker@otherhost", "worker", gen.ProcessOptions{})
    pid, err := process.RemoteSpawnRegister(
        "worker@otherhost",
        "worker", 
        "worker-001",  // registration name
        gen.ProcessOptions{},
    )
    node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
        Security: gen.SecurityOptions{
            ExposeEnvRemoteSpawn: true,  // allow env inheritance for remote spawn
        },
    })
    node, err := ergo.StartNode("worker@localhost", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Flags: gen.NetworkFlags{
                Enable:                       true,
                EnableRemoteApplicationStart: true,  // allow remote nodes to start apps
            },
        },
    })
    network := node.Network()
    
    err := network.EnableApplicationStart("workers")
    if err != nil {
        // handle error
    }
    // Allow only these nodes to start the workers app
    network.EnableApplicationStart("workers",
        "scheduler@node1",
        "scheduler@node2",
    )
    // Add more nodes to the allowed list
    network.EnableApplicationStart("workers",
        "scheduler@node1",
        "scheduler@node2",
        "scheduler@node3",  // newly allowed
    )
    // Remove specific nodes
    network.DisableApplicationStart("workers", "scheduler@node2")
    // No nodes can start this application remotely anymore
    network.DisableApplicationStart("workers")
    // Re-enable for all nodes
    network.EnableApplicationStart("workers")  // no node arguments
    network := node.Network()
    remote, err := network.GetNode("worker@otherhost")
    if err != nil {
        return err  // node unreachable, no route, etc
    }
    err := remote.ApplicationStart("workers", gen.ApplicationOptions{})
    if err != nil {
        // handle error - not allowed, app not loaded, already running, etc
    }
    // Start as temporary (not restarted if it terminates)
    err := remote.ApplicationStartTemporary("workers", gen.ApplicationOptions{})
    
    // Start as transient (restarted only if it terminates abnormally)
    err := remote.ApplicationStartTransient("workers", gen.ApplicationOptions{})
    
    // Start as permanent (always restarted if it terminates)
    err := remote.ApplicationStartPermanent("workers", gen.ApplicationOptions{})
    // On the remote node
    info, err := node.ApplicationInfo("workers")
    // info.Parent == "scheduler@node1" (requesting node name)
    processInfo, err := node.ProcessInfo(workerPID)
    // processInfo.Parent == <[email protected]> (core PID of requesting node)
    // Worker spawns a child process
    childPID, _ := worker.Spawn(factory, options)
    childInfo, _ := node.ProcessInfo(childPID)
    // childInfo.Parent == workerPID (not the remote core PID)
    node, err := ergo.StartNode("scheduler@localhost", gen.NodeOptions{
        Security: gen.SecurityOptions{
            ExposeEnvRemoteApplicationStart: true,  // allow env inheritance for remote app starts
        },
    })
    External Reader enters continuous read loop
  • Actor Handler waits for backend messages

  • Meta-process terminates

    hashtag
    How Route Resolution Works

    Understanding the resolution flow clarifies why NAT causes problems and how RouteHost solves them.

    When a node registers with any registrar (embedded, etcd, or Saturn), it sends its routes:

    The registrar stores these routes exactly as received. When another node resolves [email protected]:

    The connecting node checks if route.Host is set. If empty, it extracts the host from the node name as a fallback.

    hashtag
    The NAT Problem

    When a node is behind NAT, its node name contains a private IP. The external node resolves routes, gets an empty host, extracts 10.0.1.50 from the node name, and tries to connect to a private IP that's unreachable from the internet.

    hashtag
    The Solution: RouteHost and RoutePort

    Tell the node what address to advertise by setting RouteHost and RoutePort in AcceptorOptions:

    Now the route registered with the registrar includes the public address:

    When another node resolves:

    The connecting node sees a non-empty Host in the route and uses it directly. No fallback to node name extraction. The connection goes to the public address, NAT forwards it, and the connection succeeds.

    hashtag
    Field Reference

    AcceptorOptions Field
    Purpose

    Host

    Network interface to bind the listener socket

    Port

    TCP port to listen on

    RouteHost

    Host address to advertise in route registration

    RoutePort

    Host and RouteHost are independent:

    • Host: "0.0.0.0" binds to all interfaces but is useless as a connectable address

    • RouteHost: "203.0.113.50" is what other nodes use to connect

    hashtag
    Registrar Behavior

    All registrars (embedded, etcd, Saturn) handle routes identically:

    1. Registration: Store routes exactly as provided, including Host field

    2. Resolution: Return routes exactly as stored

    3. Connection: Connecting node uses route.Host if set, otherwise extracts from node name

    The embedded registrar sends resolution queries via UDP to the host portion of the node name. For [email protected], it queries 10.0.1.50:4499. This works because the registrar query goes to the private network (where the registrar runs), not to the NAT-ed node directly.

    External registrars (etcd, Saturn) use their central server for all queries. The node name's host portion is irrelevant for resolution since queries go to etcd/Saturn, not to the target host.

    hashtag
    Common Scenarios

    hashtag
    Same Port Forwarding

    NAT forwards the same port (15000 external = 15000 internal):

    hashtag
    Different Port Forwarding

    NAT maps different ports (32000 external -> 15000 internal):

    hashtag
    DNS Name Instead of IP

    Advertise a DNS name for flexibility:

    The DNS name is stored in the route. Connecting nodes resolve DNS at connection time, getting the current IP.

    hashtag
    Kubernetes NodePort

    Pod behind NodePort service:

    hashtag
    Local Network Considerations

    Setting RouteHost affects all nodes that resolve your address, including nodes on the same local network. If local nodes should use internal addresses while external nodes use public addresses, you have several options.

    hashtag
    Multiple Acceptors

    Run acceptors on different ports for internal and external access:

    Both routes are registered. Local nodes can connect via either. External nodes can only use the one with RouteHost set.

    hashtag
    Static Routes on Local Nodes

    Configure local nodes to bypass registrar resolution:

    Static routes are checked before registrar resolution. Local nodes use the static route (internal IP), external nodes use registrar resolution (public IP from RouteHost).

    hashtag
    Hairpin NAT

    Hairpin NAT (also called NAT loopback) allows internal nodes to connect using the public IP address.

    When you set RouteHost: "203.0.113.50", all nodes - including local ones - receive this public address from the registrar and try to connect to it.

    Without hairpin NAT support:

    With hairpin NAT support:

    The traffic makes a "hairpin turn" at the NAT device - goes toward the external interface, turns around, comes back to the internal network.

    This is a network infrastructure configuration on your router/firewall, not an application change. Check your NAT device documentation for "hairpin NAT", "NAT loopback", or "NAT reflection" settings.

    hashtag
    Relation to Static Routes

    RouteHost/RoutePort and static routes solve opposite problems:

    Problem
    Solution

    You're behind NAT, others can't reach you

    Set RouteHost/RoutePort to advertise your public address

    Others are behind NAT, you can't reach them

    Configure static routes with their public addresses

    In complex topologies, you might use both. Your node advertises its public address via RouteHost. It also configures static routes to reach other nodes through specific gateways.

    hashtag
    Troubleshooting

    External nodes can't connect

    1. Verify NAT/firewall forwards traffic to your node

    2. Check RouteHost and RoutePort match your NAT configuration

    3. Confirm the public address is reachable from outside

    Local nodes unnecessarily using public address

    Expected when RouteHost is set. Use multiple acceptors or static routes to give local nodes a direct path.

    Wrong port advertised

    If using PortRange and the first port is unavailable, the node binds to a different port. RoutePort (if set) still advertises your configured value. Ensure NAT forwards to the actual bound port, or ensure your configured port is available.

    Embedded registrar resolution fails for cross-network nodes

    The embedded registrar sends UDP queries to hostname:4499 extracted from the target node name. If [email protected] is behind NAT, external nodes send UDP to 10.0.1.50:4499, which is unreachable. Use external registrars (etcd, Saturn) for cross-network deployments, or configure static routes.

    -help: displays information about the available arguments.

  • -version: prints the current version of the Observer tool.

  • -host: specifies the interface name for the Web server to run on (default: "localhost").

  • -port: defines the port number for the Web server (default: 9911).

  • -cookie: sets the default cookie value used for connecting to other nodes.

  • If you are running observer on a server for continuous operation, it is recommended to use the environment variable COOKIE instead of the -cookie argument. Using sensitive data in command-line arguments is insecure.

    After starting observer, it initially has no connections to other nodes, so you will be prompted to specify the node you want to connect to.

    Once you establish a connection with a remote node, the Observer application main page will open, displaying information about that node.

    If you have integrated the Observer application into your node, upon opening the Observer page, you will immediately land on the main page showing information about the node where the Observer application was launched.

    hashtag
    Info (main page)

    On this tab, you will find general information about the node and the ability to manage its logging level. Changing the logging level only affects the node itself and any newly started processes, but it does not impact processes that are already running.

    Graphs provide real-time information over the last 60 seconds, including the total number of processes, the number of processes in the running state, and memory usage data. Memory usage is divided into used, which indicates how much memory was reserved from the operating system, and allocated, which shows how much of that reserved memory is currently being used by the Golang runtime.

    In addition to these details, you can view information about the available loggers on the node and their respective logging levels. For more details, refer to the Logging section. Environment variables will also be displayed here, but only if the ExposeEnvInfo option was enabled in the gen.NodeOptions.Security settings when the inspected node was started.

    hashtag
    Network (main page)

    The Network tab displays information about the node's network stack.

    The Mode indicates how the network stack was started (enabled, hidden, or disabled).

    The Registrar section shows the properties of the registrar in use, including its capabilities. Embedded Server indicates whether the registrar is running in server mode, while the Server field shows the address and port number of the registrar with which the node is registered.

    Additionally, the tab provides information about the default handshake and protocol versions used for outgoing connections.

    The Flags section lists the set of flags that define the functionality available to remote nodes.

    The Acceptors section lists the node's acceptors, with detailed information available for each. This list will be empty if the network stack is running in hidden mode.

    Since the node can work with multiple network stacks simultaneously, some acceptors may have different registrar parameters and handshake/protocol versions. For an example of simultaneous usage of the Erlang and Ergo Framework network stacks, refer to the Erlang section.

    The Connected Nodes section displays a list of active connections with remote nodes. For each connection, you can view detailed information, including the version of the handshake used when the connection was established and the protocol currently in use. The Flags section shows which features are available to the node when interacting with the remote node.

    Since the ENP protocol supports a pool of TCP connections within a single network connection, you will find information about the Pool Size (the number of TCP connections). The Pool DSN field will be empty if this is an incoming connection for the node or if the protocol does not support TCP connection pooling.

    Graphs provide a summary of the number of received/sent messages and network traffic over the last 60 seconds, offering a quick overview of communication activity and data flow.

    hashtag
    Process list (main page)

    On the Processes List tab, you can view general information about the processes running on the node. The number of processes displayed is controlled by the Start from and Limit parameters.

    By default, the list is sorted by the process identifier. However, you can choose different sorting options:

    • Top Running: displays processes that have spent the most time in the running state.

    • Top Messaging: sorts processes by the number of sent/received messages in descending order.

    • Top Mailbox: helps identify processes with the highest number of messages in their mailbox, which can be an indication that the process is struggling to handle the load efficiently.

    For each process, you can view brief information:

    The Behavior field shows the type of object that the process represents.

    Application field indicates the application to which the process belongs. This property is inherited from the parent, so all processes started within an application and their child processes will share the same value.

    Mailbox Messages displays the total number of messages across all queues in the process's mailbox.

    Running Time shows the total time the process has spent in the running state, which occurs when the process is actively handling messages from its queue.

    By clicking on the process identifier, you will be directed to a page with more detailed information about that specific process.

    hashtag
    Log (main page)

    All log messages from the node, processes, network stack, or meta-processes are displayed here. When you connect to the Observer via a browser, the Observer's backend sends a request to the inspector to start a log process with specified logging levels (this log process is visible on the main Info tab).

    When you change the set of logging levels, the Observer's backend requests the start of a new log process (the old log process will automatically terminate).

    To reduce the load on the browser, the number of displayed log messages is limited, but you can adjust this by setting the desired number in the Last field.

    The Play/Pause button allows you to stop or resume the log process, which is useful if you want to halt the flow of log messages and focus on examining the already received logs in more detail.

    hashtag
    Process information

    This page displays detailed information about the process, including its state, uptime, and other key metrics.

    The fallback parameters specify which process will receive redirected messages in case the current process's mailbox becomes full. However, if the Mailbox Size is unlimited, these fallback parameters are ignored.

    The Message Priority field shows the priority level used for messages sent by this process.

    Keep Network Order is a parameter applied only to messages sent over the network. It ensures that all messages sent by this process to a remote process are delivered in the same order as they were sent. This parameter is enabled by default, but it can be disabled in certain cases to improve performance.

    The Important Delivery setting indicates whether the important flag is enabled for messages sent to remote nodes. Enabling this option forces the remote node to send an acknowledgment confirming that the message was successfully delivered to the recipient's mailbox.

    The Compression parameters allow you to enable message compression for network transmissions and define the compression settings.

    Graphs on this page help you assess the load on the process, displaying data over the last 60 seconds.

    Additionally, you can find detailed information about any aliases, links, and monitors created by this process, as well as any registered events and started meta-processes.

    The list of environment variables is displayed only if the ExposeEnvInfo option was enabled in the node's gen.NodeOptions.Security settings.

    Additionally, on this page, you can send a message to the process, send an exit signal, or even forcibly stop the process using the kill command. These options are available in the context menu.

    hashtag
    Inspect (process page)

    If the behavior of this process implements the HandleInspect method, the response from the process to the inspect request will be displayed here. The Observer sends these requests once per second while you are on this tab.

    In the example screenshot above, you can see the inspection of a process based on act.Pool. Upon receiving the inspect request, it returns information about the pool of processes and metrics such as the number of messages processed.

    hashtag
    Log (process page)

    The Log tab on the process information page displays a list of log messages generated by the specific process.

    Please note that since the Observer uses a single stream for logging, any changes to the logging levels will also affect the content of the Log tab on the main page.

    hashtag
    Meta-process information

    On this page, you'll find detailed information about the meta-process, along with graphs showing data for the last 60 seconds related to incoming/outgoing messages and the number of messages in its mailbox. The meta-process has only two message queues: main and system.

    You can also send a message to the meta-process or issue an exit signal. However, it is not possible to forcibly stop the meta-process using the kill command.

    hashtag
    Inspect (meta-process page)

    If the meta-process's behavior implements the HandleInspect method, the response from the meta-process to the inspect request will be displayed on this tab. The Observer sends this request once per second while you are on the tab.

    hashtag
    Log (meta-process page)

    On the Log tab of the meta-process, you will see log messages generated by that specific meta-process. Changing the logging levels will also affect the content of the Log tab on the main page.

    Logging

    Logging system and logger implementations

    Understanding what happens inside a running system requires logging. But logging in distributed actor systems isn't straightforward. Messages pass between dozens of processes. Processes spawn dynamically, handle requests, and terminate. Network connections form and break. Following a single request's path through the system means tracking its journey across multiple processes, possibly across multiple nodes.

    Traditional logging compounds the problem. Each component writes to its own log. Process logs go to one file, network logs to another, node events to a third. When something goes wrong, you're piecing together a timeline from scattered sources, correlating by timestamp and hoping you've found all the relevant entries. It's detective work when you need diagnostic clarity.

    Ergo Framework centralizes the logging flow while keeping distribution flexible. Every log call - whether from a process, meta process, or the node itself - flows through a single logging system. That system distributes messages to registered loggers based on configurable filters. One logger might write everything to the console. Another might write only errors to a file. A third might send metrics to a monitoring system. The architecture is simple: centralized input, filtered distribution to multiple outputs.

    hashtag
    How Messages Flow

    When code calls process.Log().Info("message"), the framework creates a gen.MessageLog structure. This contains the timestamp, severity level, source identifier, message format and arguments, and any attached structured fields. The message enters the node's logging subsystem.

    The subsystem maintains loggers organized by severity level. Each logger, when registered, declares which levels it handles - perhaps just errors and panics, perhaps everything from debug upward. When a log message arrives, the subsystem looks up which loggers are registered for that message's level and calls their Log methods.

    This is fan-out distribution. A single info-level message goes to every logger registered for info level. The default logger writes it to stdout. A file logger appends it to a file. A metrics logger counts it. Each logger receives the same message and processes it independently.

    Hidden Loggers (introduced in v3.2.0) - Prefix a logger name with "." to create a hidden logger that's excluded from fan-out. Hidden loggers only receive logs from processes that explicitly call SetLogger(name). This creates truly isolated logging streams - bidirectional isolation. For example, register ".debug" as a hidden logger, then have a specific process use SetLogger(".debug"). That process's logs go only to the hidden logger (not to other loggers), and the hidden logger receives logs only from that process (not from fan-out). This is useful for separating verbose debugging output or creating per-process log files without mixing logs from other processes.

    You can also use SetLogger("filename") to send a process's logs to a specific logger. The process's logs go only to that logger, but the logger still receives fan-out logs from other processes. This routes verbose process logs to a dedicated destination but doesn't create isolation - the logger sees both the process's logs and system-wide fan-out.

    hashtag
    Severity Levels

    The framework provides six severity levels, ordered from most to least verbose:

    gen.LogLevelTrace - Framework internals, message routing, network packets. Extremely verbose, intended only for deep debugging of the framework itself.

    gen.LogLevelDebug - Application debugging information. Useful during development but typically disabled in production.

    gen.LogLevelInfo - Normal informational messages. This is the default level. Startup events, request handling, normal operations.

    gen.LogLevelWarning - Conditions that merit attention but don't prevent operation. Deprecated API usage, approaching resource limits, retry scenarios.

    gen.LogLevelError - Errors that prevent specific operations but don't crash the system. Failed requests, unavailable resources, validation failures.

    gen.LogLevelPanic - Critical errors requiring immediate attention. Despite the name, logging at this level doesn't trigger a panic - it's just the highest severity marker.

    Setting a level creates a threshold. Set a process to gen.LogLevelWarning and it logs warnings, errors, and panics, but suppresses info, debug, and trace. Each level implicitly includes all higher severity levels.

    Two special levels control behavior rather than representing severity:

    gen.LogLevelDefault - Sentinel meaning "inherit." Nodes with this level become gen.LogLevelInfo. Processes with this level inherit from their parent, leader, or node. This default-then-inherit pattern allows hierarchical log level configuration.

    gen.LogLevelDisabled - Stops all logging from the source. The framework doesn't even create log messages. Use this to completely silence a source without removing loggers.

    Trace deserves special mention. It's so verbose that enabling it accidentally could flood storage. You can't enable it dynamically via SetLevel. It must be set at startup through gen.NodeOptions.Log.Level or gen.ProcessOptions.LogLevel. This restriction prevents operational mistakes.

    The node starts at gen.LogLevelInfo. Processes inherit this unless their spawn options specify otherwise. After startup, you can adjust a process's level dynamically with SetLevel, allowing surgical verbosity changes during debugging.

    hashtag
    Identifying Log Sources

    The logging subsystem differentiates between four source types: node, process, meta process, and network. Each carries its source information in a typed structure - gen.MessageLogNode, gen.MessageLogProcess, gen.MessageLogMeta, or gen.MessageLogNetwork. This typing allows custom loggers to handle different sources differently, perhaps routing network logs to one destination and process logs to another.

    The default logger formats each source type distinctly in its output:

    Node logs show the node name as a CRC32 hash:

    Process logs show the full PID:

    With IncludeName enabled, the registered name appears:

    With both IncludeName and IncludeBehavior enabled, the actor type appears:

    Meta process logs show the alias:

    Network logs show local and remote node hashes:

    These visual distinctions make scanning logs easier. At a glance, you can distinguish node events from process activity, meta process operations from network communications. The format itself tells you what layer of the system generated each message.

    hashtag
    Adding Context with Fields

    Beyond the message text, you can attach structured fields - key-value pairs providing context. Fields enable correlation across log entries and make logs machine-parseable.

    Consider a request handler. It receives a request with an ID. Every log entry related to that request should include the ID, allowing you to filter logs to just that request's activity:

    With IncludeFields enabled in the logger configuration, output shows:

    Fields appear on a separate line below the message, prefixed with "fields" and aligned with the timestamp. Multiple fields are space-separated, each formatted as key:value. In JSON output, fields become separate JSON properties at the message's top level.

    Fields only appear in output if the logger is configured to include them. The default logger requires gen.NodeOptions.Log.DefaultLogger.IncludeFields = true. Without this, fields are tracked internally but not displayed - useful if some loggers need fields while others don't.

    Fields accumulate. Call AddFields multiple times and you add more fields rather than replacing existing ones. This supports incremental context building. Add session_id when the session starts. Add transaction_id when beginning a transaction. Add payment_id when processing payment. Each subsequent log includes all accumulated fields.

    Remove fields with DeleteFields:

    This clears the named fields from subsequent logs.

    hashtag
    Field Scoping

    Field scoping handles nested contexts where you need temporary fields that shouldn't persist beyond a specific operation.

    PushFields saves the current field set and starts a new scope. Add temporary fields, perform the operation (with those fields appearing in logs), then PopFields to restore the previous field set:

    Output shows:

    The operation field exists only within the push/pop scope. After popping, logs include only session_id.

    Scopes can nest. Each PushFields returns the stack depth. Each PopFields returns the new depth. This supports complex nested contexts - a request containing a transaction containing multiple operations, each adding its own contextual fields that disappear when the operation completes.

    One restriction protects consistency: you can't delete fields while the field stack has active frames. If you've pushed fields, pop back to the base level before deleting. This prevents deleting a field that a pending pop might restore, which would leave the field state inconsistent.

    hashtag
    The Default Logger

    Every node starts with a default logger writing to os.Stdout. Configure it through gen.NodeOptions.Log.DefaultLogger:

    TimeFormat controls timestamp display. Empty means nanoseconds since epoch. Any Go time format works - time.DateTime, time.RFC3339, or custom formats.

    IncludeBehavior adds actor type names to process logs, showing which implementation generated each message.

    IncludeName adds registered process names to process logs, making output more readable than PIDs alone.

    IncludeFields controls whether structured fields appear in output.

    EnableJSON switches to JSON format, with each message as a single-line JSON object.

    To disable the default logger entirely, set Disable: true. Do this when using only custom loggers.

    hashtag
    Adding Custom Loggers

    Custom loggers implement gen.LoggerBehavior:

    The Log method receives each message. The Terminate method handles cleanup when the logger is removed or the node shuts down.

    Register a logger with node.LoggerAdd:

    The filter (final arguments) specifies which levels this logger handles. The logger receives only messages at those levels. Omit the filter to use gen.DefaultLogFilter, which includes all levels from Trace through Panic.

    Loggers are stored per-level internally. Registering for Error and Panic stores the logger in both level maps. When an error occurs, the framework looks up the Error map and delivers the message to all loggers in that map.

    Logger names must be unique. Reusing a name returns gen.ErrTaken. Remove a logger with LoggerDelete before adding a new one with the same name.

    The Log method is called synchronously. If it blocks, it delays the logging path. For expensive operations - compressing logs, sending over network, database writes - make Log queue the work and return immediately, processing asynchronously.

    hashtag
    Process-Based Loggers

    A process can act as a logger, receiving log messages through its mailbox. This integrates logging with the actor model.

    Implement the HandleLog callback in your actor:

    Register the process as a logger:

    Process-based logging queues messages asynchronously. The Log call places the message in the process's Log mailbox and triggers the process. The process handles log messages through HandleLog, processing them sequentially. The code that generated the log continues immediately without waiting.

    This queuing prevents blocking. If the logger process is busy or the logging logic is expensive, messages queue and are processed when ready. The logging path stays fast.

    One detail matters: when a logger process terminates, it's automatically removed from the logging system. No need to call LoggerDeletePID explicitly.

    hashtag
    Using Multiple Loggers

    The fan-out architecture supports multiple loggers operating simultaneously with different purposes.

    A typical production configuration disables the default logger and adds specialized loggers:

    The colored logger handles debug through panic for console display during development. The rotate logger receives everything and writes to rotating files. Trace messages don't appear anywhere because no logger is registered for trace level.

    Loggers can be added and removed dynamically. Start with console logging during development. Add file logging in staging. In production, remove console, keep files, add metrics forwarding. The system adapts without code changes.

    hashtag
    Controlling Verbosity

    Different processes often need different verbosity. Most processes log at Info. Increase a troublesome process to Debug temporarily. Keep infrastructure processes at Warning to reduce noise.

    For processes generating high-volume logs, route them to a dedicated logger using a hidden logger. A trading engine logging every order would overwhelm general logs:

    This creates isolation - the trading process logs only to .trading, and .trading receives only trading process logs. Other processes and loggers are unaffected. Without the hidden logger (using a regular logger name), the logger would also receive fan-out logs from all other processes.

    Process-based loggers enable sophisticated handling. A logger process can aggregate metrics - count errors per minute, track which processes log most frequently. It can detect patterns - the same error repeating indicates a stuck condition. It can forward to external systems - send errors to Slack, metrics to Prometheus. As an actor, it maintains state, can be supervised for reliability, and integrates naturally with the rest of your system.

    hashtag
    Logger Implementations

    The framework provides two logger implementations in separate packages for common needs:

    Colored (ergo.services/logger/colored) - Terminal output with ANSI colors. Highlights Ergo types (PIDs, Atoms, Refs) and colorizes log levels (yellow for warnings, red for errors, etc.). Visual clarity for development, but has performance overhead. Not suitable for high-volume production logging.

    Rotate (ergo.services/logger/rotate) - File logging with automatic rotation. Supports size-based and time-based rotation. Compresses old logs with gzip. Configurable retention policies. Production-ready for long-running systems generating substantial logs.

    Both integrate with the logging system through node.LoggerAdd. You can combine them - colored for console during development, rotate for persistent storage, both receiving the same filtered log stream.

    For implementation details and configuration options, see and in the extra library documentation.

    Network Stack

    Understanding the network stack for distributed communication

    The network stack makes remote messaging work like local messaging. When you send to a process on another node, the framework discovers where that node is, establishes a connection if needed, encodes the message, sends it over TCP, and delivers it to the recipient's mailbox. From your perspective, it's just Send(pid, message) - whether the PID is local or remote.

    This transparency requires three systems working together: service discovery to find nodes, connection management to establish reliable links, and message encoding to serialize data for transmission. Each system handles a specific problem, and together they create the illusion that remote communication is just local communication.

    hashtag
    The Big Picture

    When you send a message to a remote process:

    1. Routing decision - The framework examines the node portion of the PID. Local node? Direct mailbox delivery. Remote node? Continue to step 2.

    2. Connection lookup - Check if a connection to that node already exists. If yes, use it. If no, continue to step 3.

    3. Discovery - Query the registrar (or check static routes) to find where the remote node is listening: hostname, port, TLS requirements, protocol versions.

    This entire pipeline is invisible to your code. You call Send, and the framework does the rest.

    hashtag
    Service Discovery

    Before connecting to a remote node, the framework needs to know where that node is. Service discovery translates logical node names ([email protected]) into connection parameters (IP, port, TLS, protocol versions).

    The embedded registrar provides basic discovery:

    • One node per host runs a registrar server (whoever started first)

    • Other nodes connect as clients

    • Same-host discovery is direct (no network)

    For production clusters, external registrars provide more features:

    • etcd - Centralized discovery, application routing, configuration storage, HTTP polling for registration

    • Saturn - Purpose-built for Ergo, immediate event propagation, efficient at scale

    The embedded registrar works for development and small deployments. For larger clusters or dynamic topologies, use etcd or Saturn. The choice is transparent to your code - you specify the registrar at node startup, and everything else works identically.

    For details, see .

    hashtag
    Static Routes

    Discovery is dynamic - nodes register themselves, and others query to find them. But sometimes you want explicit control. Maybe nodes have fixed addresses. Maybe you're behind a firewall that blocks discovery. Maybe you're connecting to external systems.

    Static routes let you hardcode connection parameters:

    Now when connecting to [email protected], the framework uses your route directly. No discovery query. No registrar involvement. You've taken control.

    Static routes support pattern matching ("prod-.*"), multiple routes with failover weights, and hybrid approaches (use patterns for selection, resolvers for address lookup). You can configure per-route cookies, certificates, network flags, and atom mappings.

    The framework checks static routes first, always. If a static route exists, discovery is bypassed. If static routes fail or don't exist, the framework falls back to discovery.

    For details, see .

    hashtag
    Connection Establishment

    Once the framework knows where to connect (from discovery or static routes), it establishes a connection pool.

    hashtag
    Handshake

    The handshake performs mutual authentication using challenge-response. Node A connects to node B:

    1. A sends hello with random salt and digest (computed from salt + cookie)

    2. B verifies digest - if cookies match, digest is correct

    3. B sends its own challenge

    If TLS is enabled, certificate fingerprints are exchanged and verified too.

    After authentication, nodes exchange introduction messages:

    • Node names and version information

    • Network flags (capabilities: remote spawn? important delivery? fragmentation?)

    • Caching dictionaries (atoms, types, errors that will be used frequently)

    The flags negotiation ensures nodes with different feature sets can work together. Features not supported by both sides are disabled for that connection.

    The caching dictionaries enable efficiency. Instead of encoding "mynode@localhost" repeatedly (19 bytes), it gets a cache ID and subsequent uses encode as 2 bytes.

    hashtag
    Connection Pool

    After handshake, the accepting node tells the dialing node to create a connection pool:

    • Pool size (default 3 TCP connections)

    • Acceptor addresses to connect to

    The dialing node opens additional TCP connections using a shortened join handshake (skips full authentication since the first connection already authenticated). These connections join the pool, forming a single logical connection with multiple physical TCP links.

    Multiple connections enable parallel message delivery. Each message goes to a connection based on the sender's identity (derived from sender PID). Messages from the same sender always use the same connection, preserving order. Messages from different senders use different connections, enabling parallelism.

    The receiving side creates 4 receive queues per TCP connection. A 3-connection pool has 12 receive queues processing messages concurrently. This parallel processing improves throughput while preserving per-sender message ordering.

    hashtag
    Message Encoding and Transmission

    Once a connection exists, messages flow through encoding and framing.

    hashtag
    EDF (Ergo Data Format)

    EDF is a binary encoding specifically designed for the framework's communication patterns. It's type-aware - each value is prefixed with a type tag (e.g., 0x95 for int64, 0xaa for PID, 0x9d for slice). The decoder reads the tag and knows what follows.

    Framework types like gen.PID and gen.Ref have optimized encodings. Structs are encoded field-by-field in declaration order (no field names on the wire). Custom types must be registered on both sides - registration happens during init(), and during handshake nodes exchange their type lists to agree on encoding.

    Compression is automatic. If a message exceeds the compression threshold (default 1024 bytes), it's compressed using GZIP, ZLIB, or LZW. The protocol frame indicates compression, so the receiver decompresses before decoding.

    For details on EDF - type tags, struct encoding, registration requirements, compression, caching - see .

    hashtag
    ENP (Ergo Network Protocol)

    ENP wraps encoded messages in frames for transmission. Each frame has an 8-byte header with magic byte, protocol version, frame length, order byte, and message type. The frame body contains sender/recipient identifiers and the EDF-encoded payload.

    The order byte preserves message ordering per sender. Messages from the same sender have the same order value and route to the same receive queue, guaranteeing sequential processing. Messages from different senders have different order values and route to different queues, enabling parallel processing.

    For details on protocol framing, order bytes, receive queue distribution, and the exact byte layout, see .

    hashtag
    Network Transparency in Practice

    Network transparency means remote operations look like local operations. You send to a PID without checking if it's local or remote. You establish links and monitors the same way regardless of location. The framework handles discovery, encoding, and transmission automatically.

    But transparency has limits:

    • Latency - Remote sends take milliseconds vs microseconds for local

    • Bandwidth - Network links have finite capacity, local operations don't

    • Failures - Networks fail in ways local memory doesn't (packets lost, connections drop, nodes unreachable)

    The framework makes distributed programming feel local, but you still need to design for network realities: use timeouts, handle connection failures, prefer async over sync, batch messages, keep payloads small.

    For deep understanding of how transparency works - EDF encoding, struct serialization, type registration, important delivery, failure semantics - see .

    hashtag
    Network Configuration

    Configure the network stack in gen.NodeOptions.Network:

    Mode - NetworkModeEnabled enables full networking with acceptors. NetworkModeHidden allows outgoing connections only (no acceptors). NetworkModeDisabled disables networking entirely.

    Cookie - Shared secret for authentication. All nodes must use the same cookie to communicate. Set explicitly for distributed deployments.

    MaxMessageSize - Maximum incoming message size. Protects against memory exhaustion. Default unlimited (fine for trusted clusters).

    Flags - Control capabilities. Remote nodes learn your flags during handshake and can only use features you've enabled. EnableRemoteSpawn allows spawning (with explicit permission per process). EnableImportantDelivery enables delivery confirmation.

    Acceptors - Define listeners for incoming connections. Multiple acceptors on different ports are supported. Each can have its own cookie, TLS, and protocol.

    hashtag
    Custom Network Stacks

    The framework provides three extension points:

    gen.NetworkHandshake - Control connection establishment and authentication. Implement this to change how nodes authenticate or how connection pools are created.

    gen.NetworkProto - Control message encoding and transmission. The Erlang distribution protocol is implemented as a custom proto, allowing Ergo nodes to join Erlang clusters.

    gen.Connection - The actual connection handling. Implement this for custom framing, routing, or error handling.

    You can register multiple handshakes and protos, allowing one node to support multiple protocol stacks simultaneously:

    This enables migration scenarios (gradually migrate from Erlang to Ergo) and integration scenarios (connect to systems using different protocols).

    hashtag
    Remote Operations

    Once connections exist, you can spawn processes and start applications on remote nodes:

    Remote spawning requires the remote node to explicitly enable it:

    Without explicit permission, remote spawn requests fail. This prevents arbitrary code execution.

    The same pattern applies to starting applications:

    Requires:

    This security model ensures you control exactly what remote nodes can do on your node.

    hashtag
    Where to Go Next

    This chapter provided an overview of how the network stack operates. For deeper understanding:

    • - How nodes find each other, application routing, configuration management, embedded vs external registrars

    • - How messages are encoded, EDF details, protocol framing, compression, caching, important delivery

    • - Explicit routing configuration, pattern matching, failover, proxy routes

    Each of these chapters dives deep into its specific topic, giving you the details needed for production deployments.

    etcd Client

    This package implements the gen.Registrar interface and serves as a client library for etcdarrow-up-right, a distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. In addition to the primary Service Discovery function, it automatically notifies all connected nodes about cluster configuration changes and supports hierarchical configuration management with type conversion.

    To create a client, use the Create function from the etcd package. The function requires a set of options etcd.Options to configure the connection and behavior.

    Then, set this client in the gen.NetworkOption.Registrar options:

    Using etcd.Options, you can specify:

    • Cluster - The cluster name for your node (default: "default")

    • Endpoints - List of etcd endpoints (default: ["localhost:2379"])

    • Username - Username for etcd authentication (optional)

    When the node starts, it will register with the etcd cluster and maintain a lease to ensure automatic cleanup if the node becomes unavailable.

    hashtag
    Configuration Management

    The etcd registrar provides hierarchical configuration management with four priority levels:

    1. Cross-cluster node-specific: services/ergo/config/{cluster}/{node}/{item}

    2. Cluster node-specific: services/ergo/cluster/{cluster}/config/{node}/{item}

    3. Cluster-wide default: services/ergo/cluster/{cluster}/config/*/{item}

    hashtag
    Typed Configuration

    The etcd registrar supports typed configuration values using string prefixes. Configuration values are stored as strings in etcd and automatically converted to the appropriate Go types when read by the registrar:

    • "int:123" → int64(123)

    • "float:3.14" → float64(3.14)

    Important: All configuration values must be stored as strings in etcd. The type conversion happens automatically when the registrar reads the configuration.

    Example configuration setup using etcdctl:

    Access configuration in your application:

    hashtag
    Event System

    The etcd registrar registers a gen.Event and generates messages based on changes in the etcd cluster within the specified cluster. This allows the node to stay informed of any updates or changes within the cluster, ensuring real-time event-driven communication and responsiveness to cluster configurations:

    • etcd.EventNodeJoined - Triggered when another node is registered in the same cluster

    • etcd.EventNodeLeft - Triggered when a node disconnects or its lease expires

    • etcd.EventApplicationLoaded - An application was loaded on a remote node

    To receive such messages, you need to subscribe to etcd client events using the LinkEvent or MonitorEvent methods from the gen.Process interface. You can obtain the name of the registered event using the Event method from the gen.Registrar interface:

    hashtag
    Application Discovery

    To get information about available applications in the cluster, use the ResolveApplication method from the gen.Resolver interface, which returns a list of gen.ApplicationRoute structures:

    • Name - The name of the application

    • Node - The name of the node where the application is loaded or running

    • Weight - The weight assigned to the application in gen.ApplicationSpec

    You can access the gen.Resolver interface using the Resolver method from the gen.Registrar interface:

    hashtag
    Node Discovery

    Get a list of all nodes in the cluster:

    hashtag
    Data Storage Structure

    The etcd registrar organizes data in etcd using the following key structure:

    Important Architecture Notes:

    • Routes (nodes/applications) use edf.Encode + base64 encoding and are stored in the routes/ subpath. Don't change anything there.

    • Configuration uses string encoding with type prefixes and is stored in the config/ subpath

    hashtag
    Example

    A fully featured example can be found at in the docker directory.

    This example demonstrates how to run multiple Ergo nodes using etcd as a registrar for service discovery. It showcases service discovery, actor communication, typed configuration management, and real-time configuration event monitoring across a cluster.

    hashtag
    Development and Testing

    The etcd registrar includes comprehensive testing infrastructure:

    hashtag
    Docker Testing Setup

    Use the included Docker Compose setup for testing:

    hashtag
    Manual etcd Operations

    For debugging and manual operations:

    The etcd registrar provides a robust, scalable solution for service discovery and configuration management in distributed Ergo applications, with the reliability and consistency guarantees of etcd.

    Pool

    A single actor processes messages sequentially. This is fundamental to the actor model - it eliminates race conditions and makes reasoning about state straightforward. But it also means one actor can become a bottleneck. If messages arrive faster than the actor can process them, the mailbox grows, latency increases, and eventually the system stalls.

    The standard solution is to run multiple workers. Instead of sending requests to one actor, distribute them across several identical actors processing in parallel. This works, but now you need routing logic: pick a worker, check if it's alive, handle mailbox overflow, restart dead workers. This boilerplate appears in every pool implementation.

    act.Pool solves this. It's an actor that manages a pool of worker actors and automatically distributes incoming messages and requests across them. You send to the pool's PID, the pool forwards to an available worker. The pool handles worker lifecycle, automatic restarts, and load balancing. From the sender's perspective, it's just one actor. Under the hood, it's N workers processing in parallel.

    hashtag
    Creating a Pool

    Like act.Actor provides callbacks for regular actors, act.Pool uses the act.PoolBehavior interface:

    The key difference from ActorBehavior: Init returns PoolOptions that define the pool configuration. All callbacks are optional except Init.

    Embed act.Pool in your struct and implement Init to configure workers:

    The pool spawns workers during initialization. Each worker is linked to the pool (via LinkParent: true). If a worker crashes, the pool receives an exit signal and can restart it.

    Workers are created using the WorkerFactory. This is the same factory pattern as regular Spawn - it returns a gen.ProcessBehavior instance. The workers can be act.Actor, act.Pool (nested pools), or custom behaviors.

    hashtag
    Rate Limiting Through Pool Configuration

    The combination of PoolSize and WorkerMailboxSize provides a natural rate limiting mechanism. The pool can buffer at most PoolSize × WorkerMailboxSize messages. If all workers are busy and their mailboxes are full, new messages are rejected:

    When a sender tries to send beyond this limit, they receive ErrProcessMailboxFull (if using important delivery) or the message is dropped with a log entry. This backpressure prevents the system from accepting more work than it can handle.

    For external APIs (HTTP, gRPC), this translates to returning "503 Service Unavailable" when the pool is saturated. The pool size controls maximum concurrency, and the mailbox size controls burst capacity. Tune both based on your worker processing speed and acceptable latency.

    hashtag
    Automatic Message Distribution

    When you send a message or make a call to the pool, act.Pool automatically forwards it to an available worker:

    Forwarding happens for messages in the Main queue (normal priority). The pool maintains a FIFO queue of worker PIDs. When a message arrives:

    1. Pop a worker from the queue

    2. Forward the message using Forward (preserves original sender and ref)

    3. Check result:

    If all workers have full mailboxes, the message is dropped and logged. The pool doesn't have its own buffer beyond the workers' mailboxes. This is intentional - backpressure should propagate to senders.

    The pool forwards Regular messages, Requests, and Events. Exit signals and Inspect requests are handled by the pool itself (they're not forwarded to workers).

    hashtag
    Workers and the Original Sender

    Workers receive the original sender's PID, not the pool's PID. When a worker processes a forwarded message, from points to whoever sent to the pool:

    The same applies to Call requests. Workers see the original caller's from and ref. When they return a result or call SendResponse, it goes directly to the original caller, bypassing the pool entirely.

    This is why forwarding is transparent. The worker doesn't know it's part of a pool. It processes messages as if they were sent directly to it.

    hashtag
    Intercepting Pool Messages

    Automatic forwarding applies only to the Main queue (normal priority). Urgent and System queues are handled by the pool itself through HandleMessage and HandleCall callbacks:

    The same for synchronous requests:

    Important: High-priority requests that return (nil, nil) from HandleCall are not forwarded to workers. They're simply ignored, and the caller times out. Forwarding only happens for Main queue messages. If you want a request to be handled, either:

    • Send it with normal priority (goes to workers)

    • Handle it explicitly in pool's HandleCall and return a result

    Use high priority only for pool management that should be handled by the pool itself, not for work that should go to workers.

    hashtag
    Dynamic Pool Management

    Adjust the pool size at runtime with AddWorkers and RemoveWorkers:

    AddWorkers spawns new workers with the same factory and options used during initialization. They're added to the FIFO queue and immediately available for work.

    RemoveWorkers takes workers from the queue and sends them gen.TerminateReasonNormal via SendExit. The workers terminate gracefully, finishing any in-progress work before shutting down.

    Both methods return the new pool size after the operation. They fail if called from outside Running state.

    hashtag
    Worker Restarts

    Workers are linked to the pool with LinkParent: true. When a worker crashes, the pool receives an exit signal. The forward mechanism detects this (ErrProcessUnknown / ErrProcessTerminated), spawns a replacement with the same factory and arguments, and forwards the message to the new worker.

    This is automatic restart, not supervision. The pool doesn't track worker history or apply restart strategies. It just replaces dead workers immediately when detected during forwarding. If you need sophisticated restart strategies, use a Supervisor to manage the pool and its workers.

    hashtag
    Pool Statistics

    Pools expose internal metrics via Inspect:

    Use this for monitoring pool health. High messages_unhandled indicates workers are overwhelmed. High worker_restarts suggests worker stability issues.

    hashtag
    When to Use Pools

    Use a pool when:

    • One actor is a bottleneck (mailbox growing, latency increasing)

    • Work items are independent (no ordering dependencies)

    • Workers are stateless or can reconstruct state cheaply

    Don't use a pool when:

    • Work items depend on previous items (pools don't guarantee ordering)

    • Workers maintain critical state that can't be lost on restart

    • Concurrency isn't the bottleneck (single actor is fast enough)

    Pools are for horizontal scaling of stateless work. If workers need state coordination, use multiple independent actors with explicit routing instead.

    hashtag
    Patterns and Pitfalls

    Set WorkerMailboxSize to limit backpressure propagation. Unbounded mailboxes let workers accumulate huge queues, hiding the overload until memory exhausts. Bounded mailboxes cause forwarding to try next worker, eventually reaching the sender with backpressure.

    Don't forward Exit signals intentionally. The pool doesn't forward Exit messages to workers. If you need to broadcast shutdown to all workers, iterate manually and send to each worker PID.

    Monitor forwarding metrics. If messages_unhandled increases, your pool is undersized or workers are too slow. Scale up with AddWorkers or optimize worker processing.

    Use priority for pool management. Send management commands with MessagePriorityHigh to ensure they go to the pool, not forwarded to workers.

    Nested pools are possible but rarely useful. A pool of pools adds latency without much benefit. Prefer one pool with more workers over nested layers.

    type WebService struct {
        act.Actor
    }
    
    func (w *WebService) Init(args ...any) error {
        // Create WebSocket handler
        wsHandler := websocket.CreateHandler(websocket.HandlerOptions{
            ProcessPool:       []gen.Atom{"ws-handler"},
            HandshakeTimeout:  15 * time.Second,
            EnableCompression: true,
            CheckOrigin:       func(r *http.Request) bool { return true },
        })
    
        // Spawn handler meta-process
        _, err := w.SpawnMeta(wsHandler, gen.MetaOptions{})
        if err != nil {
            return err
        }
    
        // Register with HTTP mux
        mux := http.NewServeMux()
        mux.Handle("/ws", wsHandler)
    
        // Create web server
        server, err := meta.CreateWebServer(meta.WebServerOptions{
            Host:    "localhost",
            Port:    8080,
            Handler: mux,
        })
        if err != nil {
            return err
        }
    
        _, err = w.SpawnMeta(server, gen.MetaOptions{})
        return err
    }
    type MessageConnect struct {
        ID         gen.Alias  // Connection meta-process identifier
        RemoteAddr net.Addr   // Client address
        LocalAddr  net.Addr   // Server address
    }
    func (h *Handler) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case websocket.MessageConnect:
            h.connections[m.ID] = ConnectionInfo{
                RemoteAddr: m.RemoteAddr,
                ConnectedAt: time.Now(),
            }
            h.Log().Info("Client connected: %s from %s", m.ID, m.RemoteAddr)
        }
        return nil
    }
    type MessageDisconnect struct {
        ID gen.Alias  // Connection meta-process identifier
    }
    case websocket.MessageDisconnect:
        delete(h.connections, m.ID)
        h.Log().Info("Client disconnected: %s", m.ID)
    type Message struct {
        ID   gen.Alias    // Connection identifier
        Type MessageType  // Message type (text, binary, ping, pong, close)
        Body []byte       // Message payload
    }
    
    const (
        MessageTypeText   MessageType = 1
        MessageTypeBinary MessageType = 2
        MessageTypeClose  MessageType = 8
        MessageTypePing   MessageType = 9
        MessageTypePong   MessageType = 10
    )
    case websocket.Message:
        h.Log().Info("Received from %s: %s", m.ID, string(m.Body))
        // Process message, maybe reply
        h.SendAlias(m.ID, websocket.Message{Body: []byte("ack")})
    // Send to specific connection
    h.SendAlias(connID, websocket.Message{
        Type: websocket.MessageTypeText,
        Body: []byte("notification"),
    })
    
    // Broadcast to all connections
    for connID := range h.connections {
        h.SendAlias(connID, websocket.Message{
            Body: []byte("broadcast message"),
        })
    }
    // Actor on node1 sends to connection on node2
    actor.SendAlias(connectionAlias, websocket.Message{
        Body: []byte("update from backend"),
    })
    func (c *Client) Init(args ...any) error {
        conn, err := websocket.CreateConnection(websocket.ConnectionOptions{
            URL:               url.URL{Scheme: "ws", Host: "server:8080", Path: "/ws"},
            Process:           "message-handler",
            HandshakeTimeout:  15 * time.Second,
            EnableCompression: true,
        })
        if err != nil {
            return err
        }
    
        connID, err := c.SpawnMeta(conn, gen.MetaOptions{})
        if err != nil {
            conn.Terminate(err)
            return err
        }
    
        c.Log().Info("Connected to server: %s", connID)
        return nil
    }
    wsHandler := websocket.CreateHandler(websocket.HandlerOptions{
        ProcessPool: []gen.Atom{"handler1", "handler2", "handler3"},
    })
    // What gets registered (simplified)
    MessageRegisterRoutes{
        Node:   "[email protected]",
        Routes: []gen.Route{
            {
                Host:             "",      // empty by default
                Port:             15000,
                TLS:              false,
                HandshakeVersion: ...,
                ProtoVersion:     ...,
            },
        },
    }
    node, err := ergo.StartNode("[email protected]", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Acceptors: []gen.AcceptorOptions{
                {
                    Host:      "0.0.0.0",        // listen on all interfaces
                    Port:      15000,            // listen on this port
                    RouteHost: "203.0.113.50",   // advertise this host
                    RoutePort: 32000,            // advertise this port
                },
            },
        },
    })
    // What gets registered
    Routes: []gen.Route{
        {
            Host:             "203.0.113.50",  // from RouteHost
            Port:             32000,           // from RoutePort
            TLS:              false,
            HandshakeVersion: ...,
            ProtoVersion:     ...,
        },
    }
    Acceptors: []gen.AcceptorOptions{
        {
            Host:      "0.0.0.0",
            Port:      15000,
            RouteHost: "203.0.113.50",
            // RoutePort not set - uses actual port 15000
        },
    }
    Acceptors: []gen.AcceptorOptions{
        {
            Host:      "0.0.0.0",
            Port:      15000,
            RouteHost: "203.0.113.50",
            RoutePort: 32000,
        },
    }
    Acceptors: []gen.AcceptorOptions{
        {
            Host:      "0.0.0.0",
            Port:      15000,
            RouteHost: "worker.prod.example.com",
        },
    }
    Acceptors: []gen.AcceptorOptions{
        {
            Host:      "0.0.0.0",
            Port:      15000,              // container port
            RouteHost: os.Getenv("NODE_IP"), // Kubernetes node IP
            RoutePort: 32000,              // NodePort
        },
    }
    Acceptors: []gen.AcceptorOptions{
        {
            Host: "10.0.1.50",  // internal only, no RouteHost
            Port: 15000,
        },
        {
            Host:      "0.0.0.0",
            Port:      15001,
            RouteHost: "203.0.113.50",
            RoutePort: 32000,
        },
    }
    // On local nodes
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "10.0.1.50",
            Port: 15000,
        },
    }
    network.AddRoute("[email protected]", route, 100)
    $ go install ergo.services/tools/observer@latest
    import (
         "ergo.services/ergo"
         "ergo.services/ergo/gen"
    
         "ergo.services/registrar/etcd"
    )
    
    func main() {
         var options gen.NodeOptions
         ...
         registrarOptions := etcd.Options{
             Endpoints: []string{"localhost:2379"},
             Cluster:   "production",
         }
         options.Network.Registrar = etcd.Create(registrarOptions)
         ...
         node, err := ergo.StartNode("demo@localhost", options)
         ...
    }
    Colored
    Rotate

    Connection establishment - Open TCP connections to the remote node, perform mutual authentication via handshake, negotiate capabilities, exchange caching dictionaries, create a connection pool.

  • Message transmission - Encode the message into bytes (EDF), optionally compress it, wrap it in a protocol frame (ENP), send it over one of the TCP connections in the pool.

  • Remote delivery - The receiving node reads the frame, decompresses if needed, decodes back to Go values, routes to the recipient's mailbox.

  • Cross-host discovery uses UDP queries
  • Automatic failover if the server node dies

  • A verifies B's response
  • Both sides authenticated

  • Partial failures - Some nodes work while others fail (local systems fail entirely or work entirely)
    Service Discovery
    Static Routes
    Network Transparency
    Network Transparency
    Network Transparency
    Service Discovery
    Network Transparency
    Static Routes

    Password - Password for etcd authentication (optional)

  • TLS - TLS configuration for secure connections (optional)

  • InsecureSkipVerify - Option to ignore TLS certificate verification

  • DialTimeout - Connection timeout (default: 10s)

  • RequestTimeout - Request timeout (default: 10s)

  • KeepAlive - Keep-alive timeout (default: 10s)

  • Global default: services/ergo/config/global/{item}

  • "bool:true" → bool(true), "bool:false" → bool(false)
  • "hello" → "hello" (strings without prefixes remain unchanged)

  • etcd.EventApplicationStarted - Triggered when an application starts on a remote node

  • etcd.EventApplicationStopping - Triggered when an application begins stopping on a remote node

  • etcd.EventApplicationStopped - Triggered when an application is stopped on a remote node

  • etcd.EventConfigUpdate - The cluster configuration was updated

  • Mode - The application's startup mode (gen.ApplicationModeTemporary, gen.ApplicationModePermanent, gen.ApplicationModeTransient)

  • State - The current state of the application (gen.ApplicationStateLoaded, gen.ApplicationStateRunning, gen.ApplicationStateStopping)

  • GitHub - Ergo Services Examplesarrow-up-right

    Success → push worker back to queue

  • ErrProcessUnknown / ErrProcessTerminated → spawn replacement, forward to it

  • ErrProcessMailboxFull → push worker back, try next worker

  • Repeat until successful or all workers tried

  • 2024-07-31 07:53:57 [info] 6EE4478D: node started successfully
    2024-07-31 07:53:57 [info] <6EE4478D.0.1017>: processing request
    2024-07-31 07:53:57 [info] <6EE4478D.0.1017> 'worker': processing request
    2024-07-31 07:53:57 [info] <6EE4478D.0.1017> 'worker' main.MyWorker: processing request
    2024-07-31 07:53:57 [info] Alias#<6EE4478D.123663.24065.0>: handling HTTP request
    2024-07-31 07:53:57 [info] 6EE4478D-90A29F11: connection established
    func (a *OrderProcessor) HandleMessage(from gen.PID, message any) error {
        order := message.(Order)
    
        a.Log().AddFields(
            gen.LogField{Name: "order_id", Value: order.ID},
            gen.LogField{Name: "customer_id", Value: order.CustomerID},
        )
    
        a.Log().Info("processing order")
        a.Log().Debug("validating payment")
    
        return nil
    }
    2024-11-12 15:30:45 [info] <6EE4478D.0.1017>: processing order
                       fields order_id:12345 customer_id:67890
    2024-11-12 15:30:45 [debug] <6EE4478D.0.1017>: validating payment
                        fields order_id:12345 customer_id:67890
    a.Log().DeleteFields("order_id", "customer_id")
    a.Log().AddFields(gen.LogField{Name: "session_id", Value: "abc123"})
    
    a.Log().PushFields()
    a.Log().AddFields(gen.LogField{Name: "operation", Value: "payment"})
    a.Log().Info("processing payment")
    
    a.Log().PopFields()
    a.Log().Info("payment complete")
    2024-11-12 15:30:45 [info] <6EE4478D.0.1017>: processing payment
                       fields session_id:abc123 operation:payment
    2024-11-12 15:30:45 [info] <6EE4478D.0.1017>: payment complete
                       fields session_id:abc123
    options.Log.DefaultLogger.TimeFormat = time.DateTime
    options.Log.DefaultLogger.IncludeBehavior = true
    options.Log.DefaultLogger.IncludeName = true
    options.Log.DefaultLogger.IncludeFields = true
    type LoggerBehavior interface {
        Log(message MessageLog)
        Terminate()
    }
    node.LoggerAdd("errors", errorLogger, gen.LogLevelError, gen.LogLevelPanic)
    type MyLogger struct {
        act.Actor
    }
    
    func (ml *MyLogger) HandleLog(message gen.MessageLog) error {
        switch m := message.Source.(type) {
        case gen.MessageLogNode:
            // Handle node log
        case gen.MessageLogProcess:
            // Process log - has PID, name, behavior
        case gen.MessageLogMeta:
            // Meta process log - has Alias
        case gen.MessageLogNetwork:
            // Network log - has local and remote nodes
        }
        return nil
    }
    pid, err := node.Spawn(createMyLogger, gen.ProcessOptions{})
    node.LoggerAddPID(pid, "mylogger", gen.LogLevelError, gen.LogLevelPanic)
    options.Log.DefaultLogger.Disable = true
    
    coloredLogger := colored.CreateLogger(colored.Options{})
    node.LoggerAdd("console", coloredLogger,
        gen.LogLevelDebug, gen.LogLevelInfo, gen.LogLevelWarning,
        gen.LogLevelError, gen.LogLevelPanic)
    
    rotateLogger := rotate.CreateLogger(rotate.Options{Path: "/var/log/myapp"})
    node.LoggerAdd("file", rotateLogger)  // No filter = all levels
    // Debugging a specific process
    node.SetLogLevelProcess(suspiciousPID, gen.LogLevelDebug)
    
    // Later, restore normal level
    node.SetLogLevelProcess(suspiciousPID, gen.LogLevelInfo)
    // Register hidden logger for trading
    tradingFileLogger := rotate.CreateLogger(rotate.Options{Path: "/var/log/trading"})
    node.LoggerAdd(".trading", tradingFileLogger)
    
    // Trading process uses only the hidden logger
    tradingProcess.Log().SetLogger(".trading")
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "10.0.1.50",
            Port: 4370,
            TLS:  true,
        },
    }
    network.AddRoute("[email protected]", route, 100)
    node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Mode:           gen.NetworkModeEnabled,
            Cookie:         "secret-cluster-cookie",
            MaxMessageSize: 10 * 1024 * 1024, // 10MB
            Flags: gen.NetworkFlags{
                Enable:                       true,
                EnableRemoteSpawn:            true,
                EnableRemoteApplicationStart: true,
                EnableImportantDelivery:      true,
            },
            Acceptors: []gen.AcceptorOptions{
                {
                    Port:       15000,
                    PortRange:  10,
                    BufferSize: 64 * 1024,
                },
            },
        },
    })
    node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Handshake: customHandshake,
            Proto:     customProto,
            Acceptors: []gen.AcceptorOptions{
                {Port: 15000, Proto: ergoProto},   // Ergo protocol
                {Port: 16000, Proto: erlangProto}, // Erlang protocol
            },
        },
    })
    remote, err := node.Network().GetNode("worker@otherhost")
    if err != nil {
        return err
    }
    
    pid, err := remote.Spawn("worker_name", gen.ProcessOptions{})
    // On the remote node
    node.Network().EnableSpawn("worker_name", createWorker)
    remote.ApplicationStart("myapp", gen.ApplicationOptions{})
    node.Network().EnableApplicationStart("myapp")
    # Node-specific integer configuration (stored as string, converted to int64)
    etcdctl put services/ergo/cluster/production/config/web1/database.port "int:5432"
    
    # Cluster-wide float configuration (stored as string, converted to float64)
    etcdctl put services/ergo/cluster/production/config/*/cache.ratio "float:0.75"
    
    # Boolean configuration (stored as string, converted to bool)
    etcdctl put services/ergo/cluster/production/config/*/debug.enabled "bool:true"
    etcdctl put services/ergo/cluster/production/config/web1/ssl.enabled "bool:false"
    
    # Application-specific configuration (visible to all nodes using wildcard format)
    etcdctl put services/ergo/cluster/production/config/*/myapp.cache.size "int:256"
    etcdctl put services/ergo/cluster/production/config/*/client.timeout "int:30"
    
    # Global string configuration (stored and returned as string)
    etcdctl put services/ergo/config/global/log.level "info"
    registrar, err := node.Network().Registrar()
    if err != nil {
        return err
    }
    
    // Get single configuration item
    port, err := registrar.ConfigItem("database.port")
    if err != nil {
        return err
    }
    // port will be int64(5432)
    
    // Get multiple configuration items
    config, err := registrar.Config("database.port", "cache.ratio", "debug.enabled", "log.level")
    if err != nil {
        return err
    }
    // config["database.port"] = int64(5432)
    // config["cache.ratio"] = float64(0.75) 
    // config["debug.enabled"] = bool(true)
    // config["log.level"] = "info"
    type myActor struct {
        act.Actor
    }
    
    func (m *myActor) HandleMessage(from gen.PID, message any) error {
        reg, err := m.Node().Network().Registrar()
        if err != nil {
            m.Log().Error("unable to get Registrar interface: %s", err)
            return nil
        }
        
        ev, err := reg.Event()
        if err != nil {
            m.Log().Error("Registrar has no registered Event: %s", err)
            return nil
        }
        
        m.MonitorEvent(ev)
        return nil
    }
    
    func (m *myActor) HandleEvent(event gen.MessageEvent) error {
        switch msg := event.Message.(type) {
        case etcd.EventNodeJoined:
            m.Log().Info("Node %s joined cluster", msg.Name)
        case etcd.EventApplicationStarted:
            m.Log().Info("Application %s started on node %s", msg.Name, msg.Node)
        case etcd.EventConfigUpdate:
            m.Log().Info("Configuration %s updated", msg.Item)
            
            // Handle specific configuration changes
            if msg.Item == "ssl.enabled" {
                if enabled, ok := msg.Value.(bool); ok {
                    m.Log().Info("SSL %s", map[bool]string{true: "enabled", false: "disabled"}[enabled])
                }
            }
        }
        return nil
    }
    type ApplicationRoute struct {
        Node   Atom
        Name   Atom
        Weight int
        Mode   ApplicationMode
        State  ApplicationState
    }
    resolver := registrar.Resolver()
    
    // Resolve application routes
    routes, err := resolver.ResolveApplication("web-server")
    if err != nil {
        return err
    }
    
    for _, route := range routes {
        log.Printf("Application %s running on node %s (weight: %d, state: %s)", 
            route.Name, route.Node, route.Weight, route.State)
    }
    nodes, err := registrar.Nodes()
    if err != nil {
        return err
    }
    
    for _, nodeName := range nodes {
        log.Printf("Node in cluster: %s", nodeName)
    }
    services/ergo/cluster/{cluster}/
    ├── routes/                         # Non-overlapping with config paths
    │   ├── nodes/{node}               # Node registration with lease (edf.Encode + base64)
    │   └── applications/{app}/{node}  # Application routes (edf.Encode + base64)
    └── config/                        # Configuration data (string + type prefixes) 
        ├── {node}/{item}             # Node-specific config
        └── */{item}                  # Cluster-wide config
    
    services/ergo/config/
    ├── {cluster}/{node}/{item}        # Cross-cluster node config
    └── global/{item}                  # Global config
    # Start etcd for testing
    make start-etcd
    
    # Run tests with coverage
    make test-coverage
    
    # Run integration tests only
    make test-integration
    
    # Clean up
    make clean
    # Check cluster health
    etcdctl --endpoints=localhost:12379 endpoint health
    
    # List all keys in cluster
    etcdctl --endpoints=localhost:12379 get --prefix "services/ergo/"
    
    # Set configuration manually (values must be strings)
    etcdctl --endpoints=localhost:12379 put \
      "services/ergo/cluster/production/config/web1/database.timeout" "int:30"
    
    etcdctl --endpoints=localhost:12379 put \
      "services/ergo/cluster/production/config/web1/debug.enabled" "bool:true"
    
    # Watch for changes
    etcdctl --endpoints=localhost:12379 watch --prefix "services/ergo/cluster/production/"
    type PoolBehavior interface {
        gen.ProcessBehavior
        
        Init(args ...any) (PoolOptions, error)
        
        HandleMessage(from gen.PID, message any) error
        HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
        Terminate(reason error)
        
        HandleEvent(message gen.MessageEvent) error
        HandleInspect(from gen.PID, item ...string) map[string]string
    }
    type WorkerPool struct {
        act.Pool
    }
    
    func (p *WorkerPool) Init(args ...any) (act.PoolOptions, error) {
        return act.PoolOptions{
            PoolSize:          5,                    // 5 workers
            WorkerFactory:     createWorker,         // Factory for workers
            WorkerMailboxSize: 100,                  // Limit each worker to 100 messages
            WorkerArgs:        []any{"config"},      // Args passed to worker Init
        }, nil
    }
    
    func createPoolFactory() gen.ProcessBehavior {
        return &WorkerPool{}
    }
    
    // Spawn the pool
    poolPID, err := node.Spawn(createPoolFactory, gen.ProcessOptions{})
    // Rate limit: 5 workers × 20 messages = 100 requests max in flight
    return act.PoolOptions{
        PoolSize:          5,
        WorkerMailboxSize: 20,
        WorkerFactory:     createAPIWorker,
    }, nil
    // Send a message to the pool
    process.Send(poolPID, WorkRequest{Data: "task1"})
    
    // The pool forwards to a worker transparently
    // The worker's HandleMessage receives it
    // Sender
    process.Send(poolPID, "hello")
    
    // Worker's HandleMessage
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        // 'from' is the original sender's PID, not the pool's PID
        w.Send(from, "reply")  // Reply goes to original sender
        return nil
    }
    // Normal priority - forwarded to worker automatically
    process.Send(poolPID, WorkRequest{})
    
    // High priority - handled by pool's HandleMessage
    process.SendWithPriority(poolPID, ManagementCommand{}, gen.MessagePriorityHigh)
    
    // Pool's HandleMessage - invoked for Urgent/System messages
    func (p *WorkerPool) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ManagementCommand:
            count, _ := p.AddWorkers(msg.AdditionalWorkers)
            p.Log().Info("scaled to %d workers", count)
        
        default:
            p.Log().Warning("unhandled message: %T", message)
        }
        return nil
    }
    // Normal priority - forwarded to worker
    result, err := process.Call(poolPID, WorkRequest{})
    
    // High priority - handled by pool's HandleCall
    stats, err := process.CallWithPriority(poolPID, GetPoolStatsRequest{}, gen.MessagePriorityHigh)
    
    // Pool's HandleCall - invoked for Urgent/System requests
    func (p *WorkerPool) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case GetPoolStatsRequest:
            return PoolStats{
                WorkerCount: p.pool.Len(),
                Forwarded:   p.forwarded,
            }, nil
        
        default:
            p.Log().Warning("unhandled request: %T", request)
            return nil, nil  // Caller will timeout
        }
    }
    func (p *WorkerPool) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ScaleUpCommand:
            newSize, err := p.AddWorkers(msg.Count)
            if err != nil {
                p.Log().Error("failed to add workers: %s", err)
                return nil
            }
            p.Log().Info("scaled up to %d workers", newSize)
        
        case ScaleDownCommand:
            newSize, err := p.RemoveWorkers(msg.Count)
            if err != nil {
                p.Log().Error("failed to remove workers: %s", err)
                return nil
            }
            p.Log().Info("scaled down to %d workers", newSize)
        }
        return nil
    }
    stats, err := node.Inspect(poolPID)
    // stats contains:
    // - "pool_size": configured number of workers
    // - "worker_behavior": type name of worker behavior
    // - "worker_mailbox_size": mailbox limit per worker
    // - "worker_restarts": count of workers restarted
    // - "messages_forwarded": total messages forwarded to workers
    // - "messages_unhandled": messages dropped (all workers full)

    Port to advertise in route registration (0 = use actual listening port)

    Erlang

    Erlang network stack

    This package implements the Erlang network stack, including the DIST protocol, ETF data format, EPMD registrar functionality, and the Handshake mechanism.

    It is compatible with OTP-23 to OTP-27. The source code is available on the project's GitHub page at https://github.com/ergo-services/protoarrow-up-right in the erlang23 directory.

    Note that the source code is distributed under the Business Source License 1.1 and cannot be used for production or commercial purposes without a license, which can be purchased on the project's sponsor page.

    hashtag
    EPMD

    The epmd package implements the gen.Registrar interface. To create it, use the epmd.Create function with the following options:

    • Port: Registrar port number (default: 4369).

    • EnableRouteTLS: Enables TLS for all gen.Route responses on resolve requests. This is necessary if the Erlang cluster uses TLS.

    • DisableServer: Disables the internal server mode, useful when using the Erlang-provided

    To use this package, include ergo.services/proto/erlang23/epmd.

    hashtag
    Handshake

    The handshake package implements the gen.NetworkHandshake interface. To create a handshake instance, use the handshake.Create function with the following options:

    • Flags: Defines the supported functionality of the Erlang network stack. The default is set by handshake.DefaultFlags().

    • UseVersion5: Enables handshake version 5 mode (default is version 6).

    To use this package, include ergo.services/proto/erlang23/handshake.

    hashtag
    DIST protocol

    The ergo.services/proto/erlang/dist package implements the gen.NetworkProto and gen.Connection interfaces. To create it, use the dist.Create function and provide dist.Options as an argument, where you can specify the FragmentationUnit size in bytes. This value is used for fragmenting large messages. The default size is set to 65000 bytes.

    To use this package, include ergo.services/proto/erlang/dist.

    hashtag
    ETF data format

    Erlang uses the ETF (Erlang Term Format) for encoding messages transmitted over the network. Due to differences in data types between Golang and Erlang, decoding received messages involves converting the data to their corresponding Golang types:

    • number -> int64

    • float number -> float64

    When encoding data in the Erlang ETF format:

    • map -> map #{}

    • slice/array -> list []

    You can also use the functions etf.TermIntoStruct and etf.TermProplistIntoStruct for decoding data. These functions take into account etf: tags on struct fields, allowing the values to map correctly to the corresponding struct fields when decoding proplist data.

    To automatically decode data into a struct, you can register the struct type using etf.RegisterTypeOf. This function takes the object of the type being registered and decoding options etf.RegisterTypeOption. The options include:

    • Name - The name of the registered type. By default, the type name is taken using the reflect package in the format #/pkg/path/TypeName

    • Strict - Determines whether the data must strictly match the struct. If disabled, non-matching data will be decoded into any.

    To be automatically decoded the data sent from Erlang must be a tuple, with the first element being an atom whose value matches the type name registered in Golang. For example:

    The values sent by an Erlang process should be in the following format:

    hashtag
    Ergo-node in Erlang-cluster

    If you want to use the Erlang network stack by default in your node, you need to specify this in gen.NetworkOptions when starting the node:

    In this case, all outgoing and incoming connections will be handled by the Erlang network stack. For a complete example, you can refer to the repository at , specifically the erlang project

    If you want to maintain the ability to accept connections from Ergo nodes while using the Erlang network stack as a main one, you need to add an acceptor in the gen.NetworkOptions settings:

    Please note that if the list of acceptors is empty when starting the node, it will launch an acceptor with the network stack using Registrar, Handshake, and Proto from gen.NetworkOptions.

    If you set the options.Network.Acceptor, you must explicitly define the parameters for all necessary acceptors. In the example, acceptorErlang is created with empty gen.AcceptorOptions (the Erlang stack from gen.NetworkOptions will be used), while for acceptorErgo, the Ergo Framework stack (Registrar, Handshake, and Proto) is explicitly defined.

    In this example, you can establish incoming and outgoing connections using the Erlang network stack. However, the Ergo Framework network stack can only be used for incoming connections. To create outgoing network connections using the Ergo Framework stack, you need to configure a static route for a group of nodes by defining a match pattern:

    For more detailed information, please refer to the section.

    hashtag
    Erlang-node in Ergo-cluster

    If your cluster primarily uses the Ergo Framework network stack by default and you want to enable interaction with Erlang nodes, you'll need to add an acceptor using the Erlang network stack. Additionally, you must define a static route for Erlang nodes using a match pattern:

    hashtag
    Actor GenServer

    The erlang23.GenServer actor implements the low-level gen.ProcessBehavior interface, enabling it to handle messages and synchronous requests from processes running on an Erlang node. The following message types are used for communication in Erlang:

    • regular messages - sent from Erlang using erlang:send or the Pid ! message syntax

    • cast-messages - sent from Erlang with gen_server:cast

    • call-requests - from Erlang made with

    erlang23.GenServer uses the erlang23.GenServerBehavior interface to interact with your object. This interface defines a set of callback methods for your object, which allow it to handle incoming messages and requests. All methods in this interface are optional, meaning you can choose to implement only the ones relevant to your specific use case:

    The callback method HandleInfo is invoked when an asynchronous message is received from an Erlang process using erlang:send or via the Send* methods of the gen.Process interface. The HandleCast callback method is called when a cast message is sent using gen_server:cast from an Erlang process. Synchronous requests sent with gen_server:call or Call* methods are handled by the HandleCall callback method.

    If your actor only needs to handle regular messages from Erlang processes, you can use the standard act.Actor and process asynchronous messages in the HandleMessage callback method.

    To start a process based on erlang23.GenServer, create an object embedding erlang23.GenServer and implement a factory function for it.

    Example:

    To send a cast message, use the Cast method of erlnag23.GenServer.

    To send regular messages, use the Send* methods of the embedded gen.Process interface. Synchronous requests are made using the Call* methods of the gen.Process interface.

    Like act.Actor, an actor based on erlang23.GenServer supports the TrapExit functionality to intercept exit signals. Use the SetTrapExit and TrapExit methods of your object to manage this functionality, allowing your process to handle exit signals rather than terminating immediately when receiving them.

    Debugging

    Debugging distributed actor systems presents unique challenges. Traditional debugging tools struggle with concurrent message passing, process isolation, and distributed state. This article covers the debugging capabilities built into Ergo Framework and demonstrates practical techniques for troubleshooting common issues.

    hashtag
    Build Tags

    Ergo Framework uses Go build tags to enable debugging features without affecting production performance. These tags control compile-time behavior, ensuring zero overhead when disabled.

    Static Routes

    Controlling outgoing connections with static routing

    When your code sends a message to a remote process, the framework needs to establish a connection to that node. But how does it know where the node is? By default, it asks the system (the Registrar) to look up the node's address. This works well for dynamic clusters where nodes come and go.

    But sometimes you want more control. Maybe you know exactly where certain nodes are. Maybe you're behind a firewall and can't use dynamic discovery. Maybe you want to connect to external systems with fixed addresses. Static routes let you hardcode connection information directly, bypassing the discovery process entirely.

    This isn't just about convenience. It's about control. When you define a static route, you're saying "I know better than the discovery system where this node is, and here's exactly how to reach it." The framework respects that - static routes are checked first, before any discovery queries.

    hashtag

    SSE

    SSE provides unidirectional server-to-client streaming over HTTP. Unlike WebSocket bidirectional connections, SSE is designed for scenarios where server pushes updates to clients - live feeds, notifications, real-time dashboards.

    The framework provides SSE meta-process implementation that integrates SSE connections with the actor model. Each connection becomes an independent actor addressable from anywhere in the cluster.

    hashtag
    The Integration Problem

    SSE connections need two capabilities:

    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    epmd
    service.
    big number -> big.Int from math/big, or to int64/uint64
  • map -> map[any]any

  • binary -> []byte

  • list -> etf.List ([]any)

  • tuple -> etf.Tuple ([]any) or a registered struct type

  • string -> []any. convert to string using etf.TermToString

  • atom -> gen.Atom

  • pid -> gen.Pid

  • ref -> gen.Ref

  • ref (alias) -> gen.Alias

  • atom = true/false -> bool

  • struct -> map with field names as keys (considering etf: tags on struct fields)

  • registered type of struct -> tuple with the first element being the registered struct name, followed by field values in order.

  • []byte -> binary

  • int*/float*/big.Int -> number

  • string -> string

  • gen.Atom -> atom

  • gen.Pid -> pid

  • gen.Ref -> ref

  • gen.Alias -> ref (alias)

  • bool -> atom true/false

  • gen_server:call
    https://github.com/ergo-services/examplesarrow-up-right
    Static Routes
    How It Works

    The framework maintains an internal routing table. When you create an outgoing connection to a remote node, the framework:

    1. Checks static routes first - Looks in the routing table for a match

    2. Falls back to discovery - If no static route exists, queries the Registrar

    3. Tries proxy routes - If direct connection fails, attempts proxy routes

    This order is important. Static routes always win. If you've defined a route for "prod-*" that matches [email protected], the framework uses your route and never asks the Registrar. You've taken control.

    The routing table uses pattern matching. When the framework needs to connect to [email protected], it checks all static routes against that name using Go's regexp.MatchString. Any routes whose patterns match become candidates. If multiple routes match, they're sorted by weight (higher weights first), and the framework tries them in order until one succeeds.

    hashtag
    Adding Static Routes

    To add a static route, use AddRoute from the network interface:

    This tells the framework: "When connecting to [email protected], use host 10.0.1.50 on port 4370 with TLS enabled. This route has weight 100."

    The match pattern is a regular expression. Exact names like "[email protected]" match only that node. Patterns like "prod-.*" match multiple nodes - [email protected], [email protected], [email protected]. Use anchors (^ and $) for precise matching: "^[email protected]$" matches exactly that name and nothing else.

    The weight determines priority when multiple routes match the same node. Higher numbers mean higher priority. If you have two routes for "prod-.*" - one with weight 100 (the default datacenter) and one with weight 200 (a faster backup datacenter) - the framework tries weight 200 first.

    hashtag
    Pattern Matching Examples

    When the framework looks up [email protected], it finds all matching routes: the prefix match (prod-.*), the suffix match (.*@example.com), and the complex pattern (^prod-db[0-9][email protected]$). It sorts them by weight and tries the highest-weight route first.

    hashtag
    Route Configuration

    The gen.NetworkRoute struct gives you fine-grained control over how connections are established:

    hashtag
    Direct Connection

    The simplest route specifies connection parameters directly:

    When the framework uses this route, it connects to the specified host and port with TLS. The handshake and protocol versions default to the node's configured versions if you don't specify them explicitly.

    hashtag
    Route with Resolver

    You can combine static patterns with dynamic resolution:

    This hybrid approach uses the pattern to select which nodes use this route, then queries the resolver for connection details. The Route fields override any values returned by the resolver. In this example, even if the resolver returns a non-TLS route, the framework forces TLS. If the resolver returns staging-db@internal but you've specified Host: "custom.example.com", the framework connects to your specified host instead.

    Why would you do this? Imagine you have a staging environment behind a bastion host. The staging nodes register themselves in the discovery system with their internal addresses, but you need to connect through a specific gateway. The resolver pattern matches staging nodes, the resolver gets you the node's details, but your route configuration redirects the connection through your gateway.

    hashtag
    Custom Cookie

    Each route can override the node's default authentication cookie:

    This is essential when connecting to nodes outside your cluster. Your internal nodes use one cookie (say, "internal-cluster-secret"). An external partner's nodes use a different cookie (say, "shared-secret-with-partner"). Without per-route cookies, you'd have to use the same cookie everywhere or give up on connecting to external systems.

    hashtag
    Custom Certificates

    For TLS connections, you can specify a custom certificate manager:

    Different routes can use different certificates. Your production nodes might use certificates from one CA. A partner's nodes might use certificates from another CA. Each route gets its own certificate manager, allowing you to maintain separate trust chains.

    Setting InsecureSkipVerify: true disables certificate validation. Use this only for testing or when connecting to nodes with self-signed certificates you trust but can't properly validate.

    hashtag
    Custom Network Flags

    You can override network capabilities for specific routes:

    This is about defense. When you connect to an external node, you probably don't want them spawning arbitrary processes on your node or starting applications remotely. Custom flags let you expose only the features you're comfortable with for that specific connection.

    hashtag
    Atom Mapping

    Some advanced scenarios require translating atom values during communication:

    When sending to this route, the framework automatically replaces mynode@localhost with legacy_node in all messages. On receiving, it reverses the mapping. This is rarely needed - most systems agree on naming conventions. But when integrating with legacy systems or systems with incompatible naming schemes, atom mapping saves you from rewriting every piece of code that references those atoms.

    hashtag
    Per-Route Logging

    You can set the logging level for a specific connection:

    Normally your network stack runs at INFO or WARNING level. But when debugging a specific connection, you want TRACE logs for that connection without drowning in logs from all other connections. Per-route logging gives you surgical debugging.

    hashtag
    Multiple Routes and Failover

    The framework tries routes in weight order when multiple patterns match the same node:

    When connecting to [email protected], both patterns match. The framework sorts them by weight and tries weight-200 first. If that connection fails (host unreachable, handshake failure, timeout), it tries weight-100. This gives you automatic failover.

    Important limitation: You can't add the same pattern twice. AddRoute returns gen.ErrTaken if the pattern already exists - the pattern is the routing table key. To achieve multi-route failover for a single node, you need different patterns that both match:

    Both patterns match [email protected], but they're different strings, so both can be added to the routing table.

    Alternatively, use a resolver-based route. The resolver can return multiple addresses, and the framework tries them in order, letting the resolver handle failover logic.

    hashtag
    Querying Routes

    To see if a route exists for a node:

    This queries the routing table without establishing a connection. You get back all routes whose patterns match the node name, sorted by weight. The highest-weight route is first - that's the one the framework would try first when actually connecting.

    hashtag
    Removing Routes

    To remove a static route:

    The pattern you pass to RemoveRoute must exactly match the pattern you used in AddRoute. It's not a regex match - it's a literal string key lookup in the routing table. If you added "prod-.*", you must remove "prod-.*" exactly.

    Removing a route doesn't affect existing connections. If you have an active connection to [email protected] and you remove its static route, the connection stays alive. Removing a route only affects future connection attempts. The next time the framework needs to connect to that node, it won't find the static route and will fall back to discovery.

    hashtag
    Proxy Routes

    Sometimes you can't connect directly to a node. Maybe it's behind a firewall. Maybe it's in a private network. Proxy routes let you connect through an intermediate node:

    When the framework needs to connect to [email protected], it establishes a connection to [email protected] first, then asks the gateway to proxy the connection to the final destination. The gateway handles forwarding messages between you and the backend node.

    Proxy routes have the same pattern matching and weight semantics as direct routes. You can define multiple proxy routes for the same pattern with different weights for failover.

    hashtag
    Proxy Configuration

    MaxHop limits proxy chaining. If the gateway itself needs to proxy through another node, and that node proxies through yet another node, MaxHop prevents infinite loops. The default is 8. Each proxy hop decrements the counter. When it reaches zero, the framework refuses to proxy further.

    The Flags control what operations the proxy allows. Maybe your gateway allows monitoring remote processes but doesn't allow spawning processes through the proxy. This gives you granular security control at the proxy level.

    hashtag
    Static Routes vs Discovery

    Static routes are checked first, always. When the framework needs to connect to a node:

    1. Check routing table - Pattern match against static routes

    2. Try static routes - Attempt connection using matched routes (by weight order)

    3. Query discovery - If no static route exists or all failed, ask the Registrar

    4. Try discovered routes - Attempt connection using discovered addresses

    5. Try proxy discovery - If direct connection fails, try discovered proxy routes

    6. Fail - Return gen.ErrNoRoute

    This priority order means static routes override discovery. If you have a static route for prod-db pointing to 10.0.1.50, the framework never asks the Registrar for prod-db's address. It just uses your route. This is by design - you're explicitly taking control.

    But combining them is powerful. You can define static routes with resolvers:

    Now all production nodes use the static route for pattern matching, but the resolver for address lookup. You get the control of static routes (selecting which nodes use this configuration) with the dynamism of discovery (nodes can move without updating your code).

    hashtag
    When to Use Static Routes

    Fixed infrastructure - If your nodes run on specific servers with static IPs, static routes are simpler than running a discovery service. Add routes for your database, cache, and API servers, and you're done.

    Firewall restrictions - When discovery protocols can't traverse your firewall, static routes work around it. The internal nodes discover each other normally. External access uses static routes pointing to your gateway.

    External integration - Connecting to nodes outside your cluster almost always requires static routes. You don't control their discovery system (if they even have one). You just need to reach specific addresses.

    Testing - Hardcoding routes during development lets you point at local test nodes without configuring a full discovery system.

    Performance - Static routes eliminate discovery latency. The framework connects immediately without the resolver round-trip. For frequently accessed nodes, this shaves milliseconds off connection establishment.

    Security boundaries - Different routes can use different cookies and certificates. When integrating multiple trust domains, static routes let you configure each boundary explicitly.

    Static routes aren't a replacement for discovery. They're a tool for cases where discovery doesn't fit. Most production clusters use discovery for internal nodes (dynamic, automatic) and static routes for fixed external connections (explicit, controlled). The framework supports both, and they work together.

    For details on how connections are established, see Network Stack. For understanding the discovery system that static routes bypass, see Service Discovery.

    Service Discovery
    type MyValue struct{
        MyString string
        MyInt    int32
    }
    
    ...
    // register type MyValue with name "myvalue"
    etf.RegisterTypeOf(MyValue{}, etf.RegisterTypeOptions{Name: "myvalue", Strict: true})
    ...
    > erlang:send(Pid, {myvalue, "hello", 123}).
    import (
        "fmt"
        
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/proto/erlang23/dist"
        "ergo.services/proto/erlang23/epmd"
        "ergo.services/proto/erlang23/handshake"
    )
    
    func main() {
        var options gen.NodeOptions
        
        // set cookie
        options.Network.Cookie = "123"
        
        // set Erlang Network Stack for this node
        options.Network.Registrar = epmd.Create(epmd.Options{})
        options.Network.Handshake = handshake.Create(handshake.Options{})
        options.Network.Proto = dist.Create(dist.Options{})
    
        // starting node
        node, err := ergo.StartNode(gen.Atom(OptionNodeName), options)
        if err != nil {
            fmt.Printf("Unable to start node '%s': %s\n", OptionNodeName, err)
            return
        }
        
        node.Wait()
    }
    import (
        "fmt"
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        
        // Ergo Network Stack
        hs "ergo.services/ergo/net/handshake"
        "ergo.services/ergo/net/proto"
        "ergo.services/ergo/net/registrar"
    
        // Erlang Network Stack    
        "ergo.services/proto/erlang23/dist"
        "ergo.services/proto/erlang23/epmd"
        "ergo.services/proto/erlang23/handshake"
    )
    
    func main() {
        ...
        acceptorErlang := gen.AcceptorOptions{}
        acceptorErgo := gen.AcceptorOptions{
            Registrar: registrar.Create(registrar.Options{}),
            Handshake: hs.Create(hs.Options{}),
            Proto:     proto.Create(),
        }
        options.Network.Acceptors = append(options.Network.Acceptors, 
                                        acceptorErlang, acceptorErgo)
        // starting node
        node, err := ergo.StartNode(gen.Atom(OptionNodeName), options)
    ...
    // starting node
    node, err := ergo.StartNode(gen.Atom(OptionNodeName), options)
    // add static route  
    route := gen.NetworkRoute{
        Resolver: acceptorErgo.Registrar.Resolver(),
    }
    match := ".ergonodes.local"
    if err := node.Network().AddRoute(match, route, 1); err != nil {
        panic(err)
    }
    import (
        "fmt"
        
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/proto/erlang23/dist"
        "ergo.services/proto/erlang23/epmd"
        "ergo.services/proto/erlang23/handshake"
    )
    
    func main() {
        var options gen.NodeOptions
        
        // set cookie
        options.Network.Cookie = "123"
        
        // add acceptors
        acceptorErgo := gen.AcceptorOptions{}
        acceptorErlang := gen.AcceptorOptions{
            Registrar: epmd.Create(epmd.Options{}),
            Handshake: handshake.Create(handshake.Options{}),
            Proto:     dist.Create(dist.Options{}),
        }
        options.Network.Acceptors = append(options.Network.Acceptors, 
                                        acceptorErgo, acceptorErlang)
    
        // starting node
        node, err := ergo.StartNode(gen.Atom(OptionNodeName), options)
        if err != nil {
            fmt.Printf("Unable to start node '%s': %s\n", OptionNodeName, err)
            return
        }
        
        // add static route  
        route := gen.NetworkRoute{
            Resolver: acceptorErlang.Registrar.Resolver(),
        }
        if err := node.Network().AddRoute(".erlangnodes.local", route, 1); err != nil {
            panic(err)
        }
        
        node.Wait()
    }
    type GenServerBehavior interface {
    	gen.ProcessBehavior
    
    	Init(args ...any) error
    	HandleInfo(message any) error
    	HandleCast(message any) error
    	HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
    	Terminate(reason error)
    
    	HandleEvent(message gen.MessageEvent) error
    	HandleInspect(from gen.PID, item ...string) map[string]string
    }
    import "ergo.services/proto/erlang23"
    
    func factory_MyActor gen.ProcessBehavior {
        return &MyActor{}
    }
    
    type MyActor struct {
        erlang23.GenServer
    }
    func (ma *MyActor) HandleInfo(message any) error {
        ...
        ma.Cast(Pid, "cast message")
        return nil
    }
    network := node.Network()
    
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "10.0.1.50",
            Port: 4370,
            TLS:  true,
        },
    }
    
    err := network.AddRoute("[email protected]", route, 100)
    if err != nil {
        // handle error
    }
    // Exact match - only this specific node
    network.AddRoute("database@prod", route1, 100)
    
    // Prefix match - all production nodes
    network.AddRoute("prod-.*", route2, 100)
    
    // Suffix match - all nodes in a domain
    network.AddRoute(".*@example.com", route3, 100)
    
    // Complex pattern - production databases only
    network.AddRoute("^prod-db[0-9][email protected]$", route4, 100)
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "192.168.1.100",
            Port: 4370,
            TLS:  true,
            HandshakeVersion: handshake.Version(), // optional, uses default if not set
            ProtoVersion:     proto.Version(),     // optional, uses default if not set
        },
    }
    route := gen.NetworkRoute{
        Resolver: registrar.Resolver(), // use specific registrar
        Route: gen.Route{
            Host: "custom.example.com", // override resolved host
            TLS:  true,                 // force TLS
        },
    }
    
    network.AddRoute("staging-.*", route, 100)
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "partner.external.com",
            Port: 4370,
        },
        Cookie: "shared-secret-with-partner",
    }
    customCert := node.CertManager() // or create a new one
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "secure.partner.com",
            Port: 4370,
            TLS:  true,
        },
        Cert:               customCert,
        InsecureSkipVerify: false, // enforce certificate validation
    }
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "readonly.external.com",
            Port: 4370,
        },
        Flags: gen.NetworkFlags{
            Enable:                       true,
            EnableRemoteSpawn:            false, // don't let them spawn on us
            EnableRemoteApplicationStart: false, // don't let them start apps on us
            EnableImportantDelivery:      true,  // but do support important delivery
        },
    }
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "legacy.system.com",
            Port: 4370,
        },
        AtomMapping: map[gen.Atom]gen.Atom{
            "mynode@localhost":  "legacy_node",
            "process_manager":   "proc_mgr",
        },
    }
    route := gen.NetworkRoute{
        Route: gen.Route{
            Host: "debug.target.com",
            Port: 4370,
        },
        LogLevel: gen.LogLevelTrace, // detailed logging for this route only
    }
    // Primary datacenter - wider pattern
    primaryRoute := gen.NetworkRoute{
        Route: gen.Route{Host: "10.0.1.50", Port: 4370, TLS: true},
    }
    network.AddRoute("^prod-db@.*", primaryRoute, 200)
    
    // Backup datacenter - more specific pattern
    backupRoute := gen.NetworkRoute{
        Route: gen.Route{Host: "10.0.2.50", Port: 4370, TLS: true},
    }
    network.AddRoute("[email protected]", backupRoute, 100)
    // These are different patterns that match the same node
    network.AddRoute("^[email protected]$", primaryRoute, 200)  // exact match with anchors
    network.AddRoute("[email protected]", backupRoute, 100)     // substring match
    routes, err := network.Route("[email protected]")
    if err == gen.ErrNoRoute {
        // no static route defined
    } else {
        // routes contains all matching routes, sorted by weight descending
        for i, route := range routes {
            fmt.Printf("Route %d: %s:%d\n", i+1, route.Route.Host, route.Route.Port)
        }
    }
    err := network.RemoveRoute("[email protected]")
    if err == gen.ErrUnknown {
        // no such route existed
    }
    proxyRoute := gen.NetworkProxyRoute{
        Route: gen.ProxyRoute{
            To:    "[email protected]",  // final destination
            Proxy: "[email protected]",    // intermediate node
        },
    }
    
    network.AddProxyRoute("backend-.*@internal.local", proxyRoute, 100)
    proxyRoute := gen.NetworkProxyRoute{
        Route: gen.ProxyRoute{
            To:    "target@backend",
            Proxy: "gateway@dmz",
        },
        Cookie: "gateway-specific-cookie",        // authenticate to gateway
        MaxHop: 3,                                // limit proxy chain depth
        Flags: gen.NetworkProxyFlags{
            Enable:                  true,
            EnableLink:              true,        // allow link operations through proxy
            EnableMonitor:           true,        // allow monitor operations
            EnableSpawn:             false,       // don't allow spawning through proxy
        },
    }
    route := gen.NetworkRoute{
        Resolver: etcdRegistrar.Resolver(),
        Route: gen.Route{
            TLS: true,  // force TLS even if resolver says otherwise
        },
    }
    network.AddRoute("prod-.*", route, 100)
    hashtag
    The pprof Tag

    The pprof tag enables the built-in profiler and goroutine labeling:

    This activates:

    • pprof HTTP endpoint at http://localhost:9009/debug/pprof/

    • PID labels on actor goroutines and Alias labels on meta process goroutines for identification in profiler output

    The endpoint address can be customized via environment variables:

    • PPROF_HOST - host to bind (default: localhost)

    • PPROF_PORT - port to listen on (default: 9009)

    The profiler endpoint exposes standard Go profiling data:

    Endpoint
    Description

    /debug/pprof/goroutine

    Stack traces of all goroutines

    /debug/pprof/heap

    Heap memory allocations

    /debug/pprof/profile

    CPU profile (30-second sample)

    /debug/pprof/block

    hashtag
    The norecover Tag

    By default, Ergo Framework recovers from panics in actor callbacks to prevent a single misbehaving actor from crashing the entire node. While this improves resilience in production, it can hide bugs during development.

    With norecover, panics propagate normally, providing full stack traces and allowing debuggers to catch the exact failure point. This is particularly useful when:

    • Investigating nil pointer dereferences in message handlers

    • Tracking down type assertion failures

    • Understanding the call sequence leading to a panic

    hashtag
    The trace Tag

    The trace tag enables verbose logging of framework internals:

    This produces detailed output about:

    • Process lifecycle events (spawn, terminate, state changes)

    • Message routing decisions

    • Network connection establishment and teardown

    • Supervision tree operations

    To see trace output, also set the node's log level:

    hashtag
    Combining Tags

    Tags can be combined for comprehensive debugging:

    This enables all debugging features simultaneously. Use this combination when investigating complex issues that span multiple subsystems.

    hashtag
    Profiler Integration

    The Go profiler is a powerful tool for understanding runtime behavior. Ergo Framework enhances its usefulness by labeling goroutines with their identifiers.

    hashtag
    Identifying Actor and Meta Process Goroutines

    When built with the pprof tag, each actor's goroutine carries a label containing its PID, and each meta process goroutine carries a label with its Alias. This creates a direct link between the logical identity and the runtime goroutine.

    To find labeled goroutines:

    Example output for actors:

    Example output for meta processes:

    Meta processes have two goroutines with different roles:

    • "role":"reader" - External Reader goroutine running the Start() method (blocking I/O)

    • "role":"handler" - Actor Handler goroutine processing messages (HandleMessage/HandleCall)

    The output shows:

    • The goroutine's stack trace

    • The identifier label (PID for actors, Alias for meta processes)

    • The exact location in your code where the goroutine is currently executing

    hashtag
    Debugging Stuck Processes

    During graceful shutdown, Ergo Framework logs processes that are taking too long to terminate. These logs include PIDs that can be matched against profiler output.

    Consider a shutdown scenario where the node reports:

    To investigate why <ABC123.0.1005> is stuck:

    1. Capture the goroutine profile:

    1. Search for the specific PID:

    1. Analyze the stack trace to understand what the actor is waiting on.

    The debug=2 parameter provides full stack traces with argument values, which is more verbose than debug=1 but contains more diagnostic information.

    hashtag
    Common Patterns in Stack Traces

    Different types of blocking have characteristic stack traces:

    Blocked on channel receive:

    Blocked on mutex:

    Blocked on network I/O:

    Blocked on synchronous call (waiting for response):

    Understanding these patterns helps quickly identify the root cause of stuck processes.

    hashtag
    Shutdown Diagnostics

    Ergo Framework provides built-in diagnostics during graceful shutdown. When ShutdownTimeout is configured (default: 3 minutes), the framework logs pending processes every 5 seconds.

    The shutdown log includes:

    • PID: Process identifier for correlation with profiler

    • State: Current process state (running, sleep, etc.)

    • Queue: Number of messages waiting in the mailbox

    A process with state=running and queue=0 is actively processing something (likely stuck in a callback). A process with state=running and queue>0 is stuck while new messages continue to arrive. A process with state=sleep and queue=0 is idle - during shutdown this typically means the process is waiting for its children to terminate first (normal supervision tree behavior).

    hashtag
    Practical Debugging Scenarios

    hashtag
    Scenario: Message Handler Never Returns

    Symptoms:

    • Process stops responding to messages

    • Other processes waiting on Call timeout

    • Shutdown hangs on specific process

    Investigation:

    1. Note the PID from shutdown logs or observer

    2. Capture goroutine profile with debug=2

    3. Find the goroutine by PID label

    4. Examine the stack trace

    Common causes:

    • Infinite loop in message handler

    • Blocking channel operation

    • Deadlock with another process via synchronous calls

    • External service call without timeout

    Solution approach:

    • Never use blocking operations (channels, mutexes) in actor callbacks

    • Always use timeouts for external calls

    • Use asynchronous messaging patterns where possible

    hashtag
    Scenario: Memory Growth

    Symptoms:

    • Heap size increases over time

    • Process eventually killed by OOM

    Investigation:

    1. Capture heap profile:

    1. In pprof, use top to see largest allocators:

    1. Use list to examine specific functions:

    Common causes:

    • Messages accumulating in mailbox faster than processing

    • Actor state holding references to large data

    • Unbounded caches or buffers in actor state

    hashtag
    Scenario: Distributed Deadlock

    Symptoms:

    • Two or more processes stop responding

    • Circular dependency in synchronous calls

    Investigation:

    1. Identify stuck processes from shutdown logs

    2. For each process, capture its goroutine stack

    3. Look for waitResponse in stack traces (indicates waiting for synchronous call response)

    4. Map the call targets to build a dependency graph

    Prevention:

    • Prefer asynchronous messaging over synchronous calls

    • Design clear hierarchies where calls flow in one direction

    • Use timeouts on all synchronous operations

    • Consider using request-response patterns with explicit message types

    hashtag
    Scenario: Process Crash Investigation

    Symptoms:

    • Process terminates unexpectedly

    • TerminateReasonPanic in logs

    Investigation:

    1. Build with --tags norecover to get full panic stack

    2. Run the scenario that triggers the crash

    3. Examine the complete stack trace

    With norecover, the panic propagates with full context:

    This shows exactly which line in your code triggered the panic.

    hashtag
    Observer Integration

    The Observerarrow-up-right tool provides a web interface for inspecting running nodes. While not strictly a debugging tool, it complements profiler-based debugging by providing:

    • Real-time process list with state and mailbox sizes

    • Application and supervision tree visualization

    • Network topology view

    • Message inspection capabilities

    Observer runs at http://localhost:9911 by default when included in your node.

    hashtag
    Best Practices

    1. Always use build tags in development: Run with --tags pprof during development to have profiler and goroutine labels available when needed.

    2. Configure reasonable shutdown timeout: A shorter timeout (30-60 seconds) in development helps identify stuck processes quickly.

    3. Use framework logging: The framework's Log() method automatically includes PID/Alias in log output, enabling correlation with profiler data.

    4. Use structured logging: The framework's logging system supports log levels and structured fields. Add context with AddFields() for correlation:

      For scoped logging, use PushFields()/PopFields() to save and restore field sets.

    5. Profile regularly: Periodic profiling during development helps catch performance regressions before production.

    6. Test shutdown paths: Explicitly test graceful shutdown to verify all actors terminate cleanly.

    hashtag
    Summary

    Debugging actor systems requires tools that bridge the gap between logical actors and runtime goroutines. Ergo Framework provides this bridge through:

    • Build tags that enable profiling and diagnostics without production overhead

    • Goroutine labels that link runtime goroutines to their actor (PID) and meta process (Alias) identities

    • Shutdown diagnostics that identify processes preventing clean termination

    • Observer integration for visual inspection of running systems

    Combined with Go's standard profiling tools, these capabilities enable effective debugging of even complex distributed systems.

    HTTP streaming: Connection must keep HTTP response open and stream events to client. Standard HTTP handlers return immediately - SSE requires long-lived responses.

    Asynchronous writing: Backend actors must be able to push events to the client at any time - notifications, updates, data changes from the actor system.

    This is exactly what meta-processes solve. The SSE connection meta-process holds the HTTP response open. Actor Handler receives messages from backend actors and writes formatted SSE events to the response stream.

    hashtag
    Components

    Two meta-processes work together:

    SSE Handler: Implements http.Handler interface. When HTTP request arrives, sets SSE headers and spawns Connection meta-process. Returns after connection closes.

    SSE Connection: Meta-process managing one SSE connection. Actor Handler receives messages from actors, formats them as SSE events, writes to HTTP response stream. Connection lives until client disconnects or error occurs.

    For client-side connections:

    SSE Client Connection: Meta-process connecting to external SSE endpoint. External Reader continuously reads SSE stream, parses events, sends them to application actors.

    hashtag
    Creating SSE Server

    Use sse.CreateHandler to create handler meta-process:

    Handler options:

    ProcessPool: List of process names that will receive messages from SSE connections. When connection is established, handler round-robins across this pool to select which process handles this connection. If empty, connection sends to parent process.

    Heartbeat: Interval for sending comment heartbeats to keep connection alive. Default 30 seconds. Heartbeats prevent proxies and load balancers from closing idle connections.

    hashtag
    Connection Lifecycle

    When client connects:

    1. HTTP request arrives with Accept: text/event-stream

    2. Handler sets SSE response headers

    3. Handler spawns Connection meta-process

    4. Connection sends MessageConnect to application

    5. Connection blocks waiting for client disconnect

    6. Actor Handler waits for backend messages

    During connection lifetime:

    • Server events: Application sends message -> Actor Handler formats and writes SSE event

    • Heartbeats: Periodic comment lines keep connection alive

    • Connection remains open until client disconnects

    When client disconnects:

    1. HTTP request context is cancelled

    2. Connection sends MessageDisconnect to application

    3. Meta-process terminates

    4. HTTP handler returns

    hashtag
    Messages

    Four message types flow between connections and actors:

    sse.MessageConnect: Sent when connection established.

    Receive this to track new connections:

    sse.MessageDisconnect: Sent when connection closes.

    Receive this to clean up connection state:

    sse.Message: Event to send to client (server) or received from server (client).

    Send events to client:

    Wire format for the above message:

    sse.MessageLastEventID: Sent when client reconnects with Last-Event-ID header.

    Handle reconnection to resume from last event:

    hashtag
    SSE Wire Format

    SSE events follow a simple text format:

    • event: - Event type. Client listens with addEventListener("type", ...). Optional, defaults to "message".

    • id: - Event ID. Client sends as Last-Event-ID header on reconnect. Optional.

    • retry: - Suggested reconnection delay. Client uses this if connection drops. Optional.

    • data: - Event payload. Can span multiple lines, each prefixed with data:. Required.

    • Empty line terminates event.

    The sse.Message struct maps directly to this format. Multi-line data is handled automatically.

    hashtag
    Client Connections

    Create client-side SSE connections with sse.CreateConnection:

    Connection options:

    URL: SSE server endpoint. Use http:// or https:// scheme.

    Process: Process name that will receive events from server. If empty, sends to parent process.

    Headers: Custom HTTP headers for the request. Useful for authentication.

    LastEventID: Initial Last-Event-ID header value for resuming from specific event.

    ReconnectInterval: Default reconnection delay. Can be overridden by server's retry: field. Default 3 seconds.

    Client connections receive the same message types. External Reader parses SSE stream and sends sse.Message to application:

    hashtag
    Network Transparency

    Connection meta-processes have gen.Alias identifiers that work across the cluster. Any actor on any node can send events to any connection:

    Network transparency makes every SSE connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without intermediaries.

    hashtag
    Process Pool Distribution

    Handler accepts ProcessPool - list of process names to receive connection messages. Handler distributes connections across this pool using round-robin:

    Connection 1 sends to "handler1", connection 2 to "handler2", connection 3 to "handler3", connection 4 to "handler1", etc. This distributes load across multiple handler processes.

    Useful for scaling: spawn multiple handler processes, each managing subset of connections. Prevents single handler from becoming bottleneck.

    hashtag
    Differences from WebSocket

    Aspect
    WebSocket
    SSE

    Direction

    Bidirectional

    Server to client only

    Protocol

    Upgrade to ws://

    Standard HTTP streaming

    Choose SSE when:

    • Server pushes updates to clients (notifications, live feeds, dashboards)

    • Clients only need to receive, not send through same connection

    • Working with proxies that may not support WebSocket

    • Want automatic reconnection with event replay

    Choose WebSocket when:

    • True bidirectional communication needed

    • Binary data transfer required

    • Low latency in both directions critical

    Meta-Process

    A meta-process solves a specific problem: how to integrate blocking I/O with the actor model without breaking its guarantees. It runs two goroutines - one executes your blocking I/O code, the other handles actor messages. This separation preserves sequential message processing while allowing continuous external I/O operations.

    Meta-processes are owned by their parent process. When the parent terminates, all its meta-processes terminate with it. This dependency is by design - meta-processes extend the parent's capabilities rather than existing as independent entities in the supervision tree.

    hashtag
    The Problem

    Actors work sequentially. One message arrives, gets processed, completes. Next message. This simplicity eliminates race conditions and makes reasoning straightforward.

    Blocking I/O breaks this model. Call net.Listener.Accept() in a message handler and the actor freezes. The goroutine blocks waiting for connections. Other messages pile up unprocessed. The actor becomes unresponsive.

    The obvious fix fails. Spawn a goroutine for Accept() and now two goroutines access the actor's state concurrently. You need locks. The sequential guarantee vanishes. The actor model collapses into traditional concurrent programming with all its complexity.

    Meta-processes preserve both. One goroutine blocks on I/O. Another goroutine processes messages sequentially. Neither interferes with the other.

    hashtag
    Two Goroutines, Two Purposes

    When a meta-process starts, the framework launches two goroutines:

    External Reader: Runs your Start() method from beginning to end. This goroutine is meant for blocking operations - Accept() loops, ReadFrom() calls, reading from pipes. When external events occur, this goroutine sends messages into the actor system using Send(). It never processes incoming messages.

    Actor Handler: Created on-demand when messages arrive in the mailbox. Processes messages sequentially by calling your HandleMessage() and HandleCall() methods. When the mailbox empties, this goroutine terminates. Next time messages arrive, a new actor handler spawns. This goroutine never does I/O directly - it handles requests from actors.

    The External Reader runs continuously from spawn until termination. The Actor Handler comes and goes based on message traffic.

    hashtag
    Why Regular Processes Cannot Do This

    Aspect
    Process
    Meta-Process

    Processes have one goroutine that must handle everything. If it blocks on I/O, message processing stops. If it spawns additional goroutines for I/O, the actor model breaks.

    Meta-processes separate concerns. The External Reader handles I/O. The Actor Handler handles messages. Both run independently.

    hashtag
    Restrictions Explained

    Meta-processes cannot make synchronous calls. Which goroutine should block waiting for the response? The External Reader is blocked on external I/O. The Actor Handler might not be running. Neither can reliably wait for responses.

    Meta-processes cannot create links or monitors. When a linked process terminates, it sends an exit signal as a message. The Actor Handler processes messages, but only when running. Signals could be delayed or lost if the Actor Handler is not active. Incoming links and monitors work because other processes send signals that queue in the mailbox. Creating outgoing links requires guarantees that meta-processes cannot provide.

    These are not arbitrary limitations. They follow from having two goroutines with distinct responsibilities.

    hashtag
    Behavior Implementation

    Init() runs once during creation. Initialize state, store the MetaProcess reference, prepare resources. Return an error to prevent spawning.

    Start() runs in the External Reader. This is where your blocking I/O lives. Loop forever accepting connections. Block reading datagrams. Read from pipes. When Start() returns, the meta-process terminates.

    HandleMessage() processes regular messages sent by actors. Runs in the Actor Handler. Return nil to continue, return an error to terminate.

    HandleCall() processes synchronous requests from actors. Return (result, nil) to send the result back. Return (nil, error) to send an error. The framework handles the response automatically.

    Terminate() runs during shutdown regardless of how termination occurred. Close resources, flush buffers, clean up. Do not block or panic here.

    HandleInspect() returns diagnostic information as string key-value pairs. Used by monitoring tools. Inspect requests are sent to the system queue (high priority) and processed before regular messages. You can inspect meta processes from within a process context using process.InspectMeta(alias) or directly from the node using node.InspectMeta(alias). Both methods only work for local meta processes (same node).

    hashtag
    Three States

    Sleep: External Reader is running (usually blocked on I/O), Actor Handler does not exist. Mailbox may contain messages waiting to be processed. This is the resting state when no actors are communicating with the meta-process.

    Running: Both goroutines active. External Reader continues I/O operations. Actor Handler processes messages from the mailbox. Both work simultaneously without blocking each other.

    Terminated: Both goroutines stopped. Start() returned and Actor Handler completed its final message.

    Transitions are automatic. Message arrives → Actor Handler spawns → Sleep becomes Running. Mailbox empties → Actor Handler exits → Running becomes Sleep. Start() returns → Terminated regardless of current state.

    hashtag
    Data Flow

    The External Reader blocks reading while the Actor Handler simultaneously blocks writing. Two blocking operations, two goroutines, neither prevents the other.

    hashtag
    Creating Meta-Processes

    Define your behavior:

    Spawn from a process:

    The meta-process lives as long as its parent lives. When Server terminates, the UDP server terminates automatically.

    hashtag
    State-Based Operations

    Different operations are available in different states:

    All states (Sleep, Running, Terminated):

    • Send(), SendWithPriority() - External Reader sends in Sleep, Actor Handler sends in Running

    • ID(), Parent() - Identity never changes

    Running only:

    • SendResponse(), SendResponseError() - Only Actor Handler has the gen.Ref from HandleCall()

    • SetSendPriority(), SetCompression() - Actor Handler controls these

    Sleep and Running (not Terminated):

    • Spawn() - Both goroutines can spawn child meta-processes

    The External Reader operates in Sleep state and has minimal capabilities - just sending messages and spawning children. The Actor Handler operates in Running state and has full capabilities for processing requests.

    hashtag
    Shared State

    Both goroutines access the same struct fields. Use atomic operations for shared counters and flags:

    Avoid complex synchronization. If you need mutexes, the design probably belongs in a regular process with meta-processes handling only I/O.

    hashtag
    Common Patterns

    External events to actors: External Reader reads events, sends them to actors for processing.

    Actor-controlled I/O: Actors send commands, Actor Handler executes them against external resources.

    Full-duplex communication: External Reader reads, Actor Handler writes, both operate on the same connection.

    Server accepting connections: External Reader accepts connections, spawns child meta-processes for each.

    hashtag
    When to Use Meta-Processes

    Use meta-processes when:

    • Operating on blocking I/O (TCP accept, UDP read, pipe read, file read)

    • Bridging external event sources with actors (monitoring filesystems, listening to OS signals)

    • Wrapping synchronous APIs that cannot be made asynchronous

    Do not use meta-processes when:

    • Implementing business logic

    • Managing application state

    • Coordinating between actors

    • Processing messages that do not involve blocking I/O

    Meta-processes sit at the boundary between the external world and the actor system. They translate blocking operations into asynchronous messages and execute actor commands using blocking APIs. Regular processes implement everything else.

    For complete examples, see , , , and .

    go run --tags pprof ./cmd
    go run --tags norecover ./cmd
    go run --tags trace ./cmd
    options := gen.NodeOptions{
        Log: gen.LogOptions{
            Level: gen.LogLevelTrace,
        },
    }
    go run --tags "pprof,norecover,trace" ./cmd
    # Find actor goroutines by PID
    curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*pid'
    
    # Find meta process goroutines by Alias
    curl -s "http://localhost:9009/debug/pprof/goroutine?debug=1" | grep -B5 'labels:.*meta'
    1 @ 0x100c17fa0 0x100c18abc 0x100c19def ...
    # labels: {"pid":"<ABC123.0.1005>"}
    #   main.(*Worker).HandleMessage+0x27  /path/worker.go:45
    1 @ 0x100c17fa0 0x100c18abc 0x100c19def ...
    # labels: {"meta":"Alias#<ABC123.0.1.2>", "role":"reader"}
    #   main.(*TCPServer).Start+0x1bc  /path/tcp_server.go:52
    [warning] shutdown: waiting for 3 processes
    [warning]   <ABC123.0.1005> state=running queue=5
    [warning]   <ABC123.0.1012> state=running queue=0
    [warning]   <ABC123.0.1018> state=sleep queue=0
    curl -s "http://localhost:9009/debug/pprof/goroutine?debug=2" > goroutines.txt
    grep -A30 'pid.*ABC123.0.1005' goroutines.txt
    runtime.chanrecv1
        /usr/local/go/src/runtime/chan.go:442
    sync.(*Mutex).Lock
        /usr/local/go/src/sync/mutex.go:81
    internal/poll.(*FD).Read
        /usr/local/go/src/internal/poll/fd_unix.go:163
    ergo.services/ergo/node.(*process).waitResponse
        /path/node/process.go:1961
    options := gen.NodeOptions{
        ShutdownTimeout: 30 * time.Second, // shorter timeout for debugging
    }
    curl -s "http://localhost:9009/debug/pprof/heap" > heap.prof
    go tool pprof heap.prof
    (pprof) top 10
    (pprof) list HandleMessage
    panic: runtime error: invalid memory address or nil pointer dereference
    
    goroutine 42 [running]:
    main.(*MyActor).HandleMessage(0x140001a2000, {0x100d12345, 0x140001b0000})
        /path/myactor.go:45 +0x1bc
    type WebService struct {
        act.Actor
        connections map[gen.Alias]bool
    }
    
    func (w *WebService) Init(args ...any) error {
        w.connections = make(map[gen.Alias]bool)
    
        // Create SSE handler
        sseHandler := sse.CreateHandler(sse.HandlerOptions{
            ProcessPool: []gen.Atom{"sse-handler"},
            Heartbeat:   30 * time.Second,
        })
    
        // Spawn handler meta-process
        _, err := w.SpawnMeta(sseHandler, gen.MetaOptions{})
        if err != nil {
            return err
        }
    
        // Register with HTTP mux
        mux := http.NewServeMux()
        mux.Handle("/events", sseHandler)
    
        // Create web server
        server, err := meta.CreateWebServer(meta.WebServerOptions{
            Host:    "localhost",
            Port:    8080,
            Handler: mux,
        })
        if err != nil {
            return err
        }
    
        _, err = w.SpawnMeta(server, gen.MetaOptions{})
        return err
    }
    type MessageConnect struct {
        ID         gen.Alias      // Connection meta-process identifier
        RemoteAddr net.Addr       // Client address
        LocalAddr  net.Addr       // Server address
        Request    *http.Request  // Original HTTP request
    }
    func (h *Handler) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case sse.MessageConnect:
            h.connections[m.ID] = true
            h.Log().Info("Client connected: %s from %s", m.ID, m.RemoteAddr)
    
            // Send welcome event
            h.SendAlias(m.ID, sse.Message{
                Event: "welcome",
                Data:  []byte("Connected successfully"),
            })
        }
        return nil
    }
    type MessageDisconnect struct {
        ID gen.Alias  // Connection meta-process identifier
    }
    case sse.MessageDisconnect:
        delete(h.connections, m.ID)
        h.Log().Info("Client disconnected: %s", m.ID)
    type Message struct {
        ID    gen.Alias  // Connection identifier
        Event string     // Event type (optional)
        Data  []byte     // Event data (can be multi-line)
        MsgID string     // Event ID for reconnection (optional)
        Retry int        // Retry hint in milliseconds (optional)
    }
    // Simple data event
    h.SendAlias(connID, sse.Message{
        Data: []byte("Hello, client!"),
    })
    
    // Named event with ID
    h.SendAlias(connID, sse.Message{
        Event: "update",
        Data:  []byte(`{"temperature": 23.5}`),
        MsgID: "42",
    })
    
    // Broadcast to all connections
    for connID := range h.connections {
        h.SendAlias(connID, sse.Message{
            Event: "broadcast",
            Data:  []byte("Server announcement"),
        })
    }
    event: update
    id: 42
    data: {"temperature": 23.5}
    
    type MessageLastEventID struct {
        ID          gen.Alias  // Connection identifier
        LastEventID string     // ID from client header
    }
    case sse.MessageLastEventID:
        h.Log().Info("Client reconnected, last event: %s", m.LastEventID)
        // Send missed events since LastEventID
        h.sendMissedEvents(m.ID, m.LastEventID)
    event: <event-type>
    id: <event-id>
    retry: <milliseconds>
    data: <line1>
    data: <line2>
    
    func (c *Client) Init(args ...any) error {
        conn := sse.CreateConnection(sse.ConnectionOptions{
            URL:               url.URL{Scheme: "http", Host: "server:8080", Path: "/events"},
            Process:           "event-handler",
            Headers:           http.Header{"Authorization": []string{"Bearer token"}},
            LastEventID:       "42",
            ReconnectInterval: 5 * time.Second,
        })
    
        connID, err := c.SpawnMeta(conn, gen.MetaOptions{})
        if err != nil {
            return err
        }
    
        c.Log().Info("Connected to SSE server: %s", connID)
        return nil
    }
    func (h *EventHandler) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case sse.MessageConnect:
            h.Log().Info("Connected to server")
    
        case sse.Message:
            h.Log().Info("Event: %s, Data: %s", m.Event, string(m.Data))
    
        case sse.MessageDisconnect:
            h.Log().Info("Disconnected from server")
        }
        return nil
    }
    // Actor on node1 sends to connection on node2
    actor.SendAlias(connectionAlias, sse.Message{
        Event: "notification",
        Data:  []byte("Update from backend service"),
    })
    sseHandler := sse.CreateHandler(sse.HandlerOptions{
        ProcessPool: []gen.Atom{"handler1", "handler2", "handler3"},
    })

    Goroutine blocking events

    /debug/pprof/mutex

    Mutex contention

    func (a *MyActor) HandleMessage(from gen.PID, message any) error {
        log := a.Log()
        log.AddFields(
            gen.LogField{Name: "request_id", Value: requestID},
            gen.LogField{Name: "user_id", Value: userID},
        )
        defer log.DeleteFields("request_id", "user_id")
    
        log.Info("processing request")
        // all log messages now include request_id and user_id
        return nil
    }

    gen.PID

    gen.Alias

    States

    Init → Sleep → Running → Terminated

    Sleep → Running → Terminated

    Spawn children

    Processes and meta-processes

    Meta-processes only

    Synchronous calls

    Can make calls with Call()

    Cannot make calls

    Links and monitors

    Can create and receive

    Can only receive

    Env(), EnvList(), EnvDefault() - Configuration access
  • Log() - Logging always available

  • SendPriority(), Compression() - Read settings

  • Implementing network servers where accept loop must run continuously

    Goroutines

    One per process

    Two per meta-process

    Message queues

    4 queues (urgent, system, main, log)

    2 queues (system, main)

    TCP
    UDP
    Web
    Port

    Identifier type

    Client to server

    WriteMessage()

    Not supported (use separate HTTP requests)

    Browser support

    Requires WebSocket API

    Native EventSource API

    Reconnection

    Manual implementation

    Built-in with Last-Event-ID

    Binary data

    Supported

    Text only (base64 encode if needed)

    Proxy support

    May require configuration

    Works through standard HTTP proxies

    Important Delivery Flag

    Guaranteed message delivery with acknowledgment

    In the actor model, messages are typically fire-and-forget. You send a message, and it either arrives or it doesn't. For local communication, errors are immediate - if the process doesn't exist or the mailbox is full, Send returns an error. But for remote communication, Send succeeds as soon as the message reaches the network layer. You don't know if it arrived at the remote node, if the target process exists, or if the mailbox had space.

    This works fine for many scenarios. Asynchronous messaging doesn't require confirmation. Actors process what arrives and ignore what doesn't. Systems are resilient because actors don't wait for acknowledgments - they keep working.

    But some operations need certainty. A payment authorization must definitely be recorded or definitely fail - "maybe it worked" isn't acceptable. A distributed transaction coordinator needs to know that all participants received the commit message before proceeding. Critical state updates can't be silently lost.

    Important Delivery provides guaranteed message delivery through acknowledgment. When you send with the important flag, the framework tracks the message, waits for confirmation from the recipient, and reports errors if delivery fails.

    hashtag
    The Problem: Network Opacity

    Without important delivery, remote communication is opaque:

    The remote Send succeeds even if:

    • The remote process doesn't exist

    • The remote process's mailbox is full

    • The remote node received the message but dropped it

    You only discover problems through absence - no response arrives, timeouts fire, but you don't know why. Did the request get lost? Did the process crash? Is it just slow?

    hashtag
    The Solution: Confirmed Delivery

    Important delivery makes remote communication transparent - errors are immediate, just like local:

    The framework sends the message, waits for acknowledgment from the remote node, and reports the outcome. Either the message is in the recipient's mailbox (success) or you get an error explaining what went wrong (failure). No ambiguity.

    hashtag
    How to Use Important Delivery

    There are two ways to enable important delivery:

    Method 1: Per-message explicit methods

    Use SendImportant and CallImportant instead of Send and Call:

    Method 2: Process-level flag

    Set the important delivery flag on the process - all outgoing messages use important delivery:

    The process-level flag affects all outgoing messages: Send, SendPID, SendProcessID, SendAlias, and Call requests. You don't need to use special methods - regular Send and Call automatically include the important flag.

    Use the flag when the process primarily deals with critical messages. Use explicit methods when only specific messages require guarantees.

    hashtag
    How Important Delivery Works

    hashtag
    Send with Important Delivery

    Here's what happens when you send a message with important delivery:

    The sender blocks until the acknowledgment arrives. The remote node attempts delivery and sends either success (ACK) or failure (error). The sender's SendImportant unblocks with the result.

    For local sends, the behavior is identical to regular Send - immediate error if the process doesn't exist or mailbox is full. The important flag only affects remote sends.

    hashtag
    Call with Important Delivery

    Call requests already have a response channel (the caller waits for HandleCall to return), so important delivery works differently. The ACK is only sent if there's an error - if delivery succeeds, no ACK is sent, and the caller waits for the actual response:

    The key difference from regular Call: with CallImportant, if the remote process doesn't exist or its mailbox is full, you get an immediate error instead of waiting for timeout. If delivery succeeds, you wait for the response just like regular Call.

    Without the important flag, ErrProcessUnknown looks like timeout - you can't tell if the process is slow, dead, or never existed. With important delivery, you know immediately.

    hashtag
    Combining Call and Response Delivery

    Things get interesting when you combine important delivery on requests with important delivery on responses. There are four combinations, each with different guarantees.

    hashtag
    Regular Call + Regular Response

    Guarantees: None. Request may be lost. Response may be lost. Timeout is ambiguous.

    Use case: Fast, non-critical operations where occasional loss is acceptable.

    hashtag
    Regular Call + Important Response (RR-2PC)

    Guarantees: Response delivery is confirmed. If the handler returns a result, the caller will receive it (or get an error if delivery fails). Request delivery is not confirmed - the handler might never receive the request.

    Protocol name: RR-2PC (Response-Reliable Two-Phase Commit)

    Use case: The handler's work is critical, the caller must know if it succeeded. Example: committing a transaction. If the transaction commits, the caller must know. But it's okay if the request gets lost (request is idempotent, can be retried).

    How it works:

    The handler blocks after processing until the caller acknowledges the response. If the caller crashes before sending ACK, the handler's SendResponseImportant returns ErrResponseIgnored or ErrTimeout.

    The request has no guarantee - it might be lost, and the caller would timeout. But if the handler processed the request and sends a response, that response is guaranteed to be delivered.

    hashtag
    Important Call + Regular Response

    Guarantees: Request delivery is confirmed. The handler will receive the request (or caller gets an error immediately). Response delivery is not confirmed - response may be lost.

    Use case: The handler must receive the request, but the response is less critical or can be retried. Example: triggering a background job. The job must start, but if the status response is lost, the caller can query status later.

    How it works:

    The caller gets immediate confirmation that the request arrived, then waits for the response. If the response gets lost, the caller times out - but knows the handler received and processed the request.

    hashtag
    Important Call + Important Response (FR-2PC)

    Guarantees: Both request and response delivery are confirmed. The handler definitely receives the request, and the caller definitely receives the response. No ambiguity at any point.

    Protocol name: FR-2PC (Fully-Reliable Two-Phase Commit)

    Use case: Critical operations where both request and response must be guaranteed. Example: distributed transaction commit coordination, financial operations, critical state synchronization.

    How it works:

    With FR-2PC:

    • The caller gets immediate error if request can't be delivered (no ambiguous timeout)

    • If request is delivered, caller waits for response

    • The handler blocks after sending response until caller confirms receipt

    This is the most reliable pattern but also the most expensive. Use it only when guaranteed delivery is essential.

    hashtag
    FR-2PC as Foundation for 3PC

    FR-2PC provides the messaging reliability needed to implement Three-Phase Commit (3PC) and other distributed transaction protocols at the application level.

    Traditional Two-Phase Commit (2PC) has a blocking problem: if the coordinator crashes after participants vote "yes" but before sending commit/abort, participants don't know what to do. They're stuck.

    Three-Phase Commit solves this by adding a pre-commit phase:

    1. Prepare: Can you commit?

    2. Pre-commit: Everyone said yes, get ready to commit

    3. Commit: Now commit

    If the coordinator crashes after pre-commit, participants know the outcome was "commit" and can proceed independently.

    But 3PC only works if messages are reliably delivered. If a pre-commit message gets lost and a participant doesn't receive it, the protocol breaks - some participants think we're committing, others are still waiting.

    FR-2PC guarantees that messages are delivered or errors are reported. This lets you implement 3PC confidently:

    FR-2PC ensures that:

    • If CallImportant returns nil, the participant received the message

    • If CallImportant returns an error, the participant didn't receive the message

    • No ambiguous timeouts where you don't know if the message arrived

    This determinism is essential for 3PC. Without it, you'd need complex timeout-based recovery that can't distinguish "participant is slow" from "participant is dead" from "message was lost."

    hashtag
    Performance Considerations

    Important delivery adds overhead:

    • Extra round trip: Sender waits for ACK before proceeding

    • Sender blocks: Can't process other messages while waiting

    • Network traffic: Additional ACK messages

    For SendImportant, the sender blocks until ACK arrives (success or error) or timeout. For CallImportant, the sender gets immediate error if delivery fails, or waits for response if delivery succeeds (no extra ACK on success).

    The blocking is process-local - only the sending actor waits. Other actors on the node continue normally. But the sending actor's mailbox isn't processed during the wait.

    Use important delivery selectively:

    • Use for: Critical state updates, transaction coordination, payment processing, data synchronization

    • Don't use for: High-frequency updates, informational messages, monitoring events, retryable operations

    Most actor communication doesn't need guarantees. The actor model is resilient because actors handle partial failure gracefully. Important delivery is for the cases where partial failure isn't acceptable - where certainty is worth the cost.

    hashtag
    Local vs Remote Behavior

    Important delivery only affects remote communication. For local sends:

    Local mailbox operations are synchronous - pushing to the mailbox either succeeds or fails immediately. The important flag is unnecessary because there's no network uncertainty. The framework silently treats local important sends as regular sends.

    This means your code works identically for local and remote processes. You can use SendImportant everywhere without checking if the target is local or remote - the framework optimizes local communication automatically.

    hashtag
    Error Types

    Important delivery produces specific errors:

    ErrProcessUnknown - The remote process doesn't exist. Without important delivery, you'd discover this through timeout. With important delivery, you know immediately.

    ErrProcessMailboxFull - The remote process exists but its mailbox is full. Without important delivery, the message would queue in the network layer or be dropped. With important delivery, you get immediate feedback.

    ErrTimeout - The remote node received the message but didn't send ACK within the timeout period. This is different from Call timeout - it means the node is unresponsive or overloaded.

    ErrResponseIgnored - For important responses, the caller is no longer waiting (timed out or terminated). The response couldn't be delivered. Without important delivery, the handler wouldn't know the response was ignored.

    ErrNoConnection - Cannot establish connection to the remote node. This error occurs for both regular and important sends, but important delivery surfaces it immediately instead of silently queueing.

    These specific errors let you handle different failure modes appropriately - retry for ErrTimeout, provision more resources for ErrProcessMailboxFull, fail immediately for ErrProcessUnknown.

    hashtag
    Summary

    Important delivery trades performance for certainty. Messages are guaranteed to be delivered or errors are reported immediately. Use it when:

    • The operation is critical and must succeed or definitely fail

    • Ambiguous timeouts are unacceptable

    • You're implementing distributed protocols that require guaranteed delivery

    For most actor communication, fire-and-forget messaging is sufficient. The actor model handles uncertainty through supervision, retries, and eventual consistency. Important delivery is for the cases where uncertainty itself is the problem.

    For more on handling synchronous requests, see .

    Service Discovering

    How nodes find each other and establish connections

    Service discovery solves a fundamental problem in distributed systems: how does one node find another node when all it has is a name?

    When you send a message to a remote process, the target identifier contains the node name - a gen.PID includes the node where that process runs, a gen.ProcessID specifies both process name and node, and a gen.Alias includes the node. But what does that node name mean in network terms? What IP address? What port? Is TLS required? What protocol versions are supported? Service discovery answers these questions, translating logical node names into concrete connection parameters.

    Metrics

    The metrics actor provides observability for Ergo applications by collecting and exposing runtime statistics in Prometheus format. Instead of manually instrumenting your code with counters and gauges scattered throughout, the metrics actor centralizes telemetry into a single process that exposes an HTTP endpoint for Prometheus to scrape.

    This approach separates monitoring concerns from application logic. Your actors focus on business functionality while the metrics actor handles collection, aggregation, and exposure of operational data. Prometheus or compatible monitoring systems poll the /metrics endpoint periodically, building time-series data for alerting and visualization.

    hashtag
    Why Monitor Actors

    type MetaBehavior interface {
        Init(process MetaProcess) error
        Start() error
        HandleMessage(from PID, message any) error
        HandleCall(from PID, ref Ref, request any) (any, error)
        Terminate(reason error)
        HandleInspect(from PID, item ...string) map[string]string
    }
    type UDPServer struct {
        gen.MetaProcess
        socket net.PacketConn
        target gen.PID
    }
    
    func (u *UDPServer) Init(process gen.MetaProcess) error {
        u.MetaProcess = process
        u.target = process.Parent()
        return nil
    }
    
    func (u *UDPServer) Start() error {
        // External Reader - continuous read loop
        for {
            buf := make([]byte, 65536)
            n, addr, err := u.socket.ReadFrom(buf)
            if err != nil {
                return err
            }
            u.Send(u.target, Datagram{Data: buf[:n], From: addr})
        }
    }
    
    func (u *UDPServer) HandleMessage(from gen.PID, message any) error {
        // Actor Handler - write on demand
        switch msg := message.(type) {
        case SendDatagram:
            u.socket.WriteTo(msg.Data, msg.To)
        }
        return nil
    }
    
    func (u *UDPServer) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        return nil, nil
    }
    
    func (u *UDPServer) Terminate(reason error) {
        u.socket.Close()
    }
    
    func (u *UDPServer) HandleInspect(from gen.PID, item ...string) map[string]string {
        return map[string]string{"local_addr": u.socket.LocalAddr().String()}
    }
    type Server struct {
        act.Actor
    }
    
    func (s *Server) Init(args ...any) error {
        socket, err := net.ListenPacket("udp", ":8080")
        if err != nil {
            return err
        }
    
        udpServer := &UDPServer{socket: socket}
        alias, err := s.SpawnMeta(udpServer, gen.MetaOptions{})
        if err != nil {
            socket.Close()
            return err
        }
    
        s.Log().Info("UDP server listening on :8080 as %s", alias)
        return nil
    }
    type TCPConnection struct {
        gen.MetaProcess
        conn     net.Conn
        bytesIn  uint64  // accessed by both goroutines
        bytesOut uint64  // accessed by both goroutines
    }
    
    func (t *TCPConnection) Start() error {
        // External Reader
        buf := make([]byte, 4096)
        for {
            n, err := t.conn.Read(buf)
            if err != nil {
                return err
            }
            atomic.AddUint64(&t.bytesIn, uint64(n))
            t.Send(t.Parent(), Data{Bytes: buf[:n]})
        }
    }
    
    func (t *TCPConnection) HandleMessage(from gen.PID, message any) error {
        // Actor Handler
        if msg, ok := message.(Data); ok {
            n, err := t.conn.Write(msg.Bytes)
            atomic.AddUint64(&t.bytesOut, uint64(n))
            return err
        }
        return nil
    }
    
    func (t *TCPConnection) HandleInspect(from gen.PID, item ...string) map[string]string {
        // Actor Handler
        in := atomic.LoadUint64(&t.bytesIn)
        out := atomic.LoadUint64(&t.bytesOut)
        return map[string]string{
            "bytes_in":  fmt.Sprintf("%d", in),
            "bytes_out": fmt.Sprintf("%d", out),
        }
    }
    func (r *FileReader) Start() error {
        file, _ := os.Open(r.filename)
        defer file.Close()
    
        scanner := bufio.NewScanner(file)
        for scanner.Scan() {
            r.Send(r.processor, Line{Text: scanner.Text()})
        }
        return scanner.Err()
    }
    func (e *CommandExecutor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case RunCommand:
            output, err := exec.Command(msg.Cmd, msg.Args...).Output()
            e.Send(from, CommandResult{Output: output, Error: err})
        }
        return nil
    }
    func (t *TCPConnection) Start() error {
        // Continuous reading
        for {
            n, err := t.conn.Read(buf)
            if err != nil {
                return err
            }
            t.Send(t.target, Received{Data: buf[:n]})
        }
    }
    
    func (t *TCPConnection) HandleMessage(from gen.PID, message any) error {
        // On-demand writing
        if msg, ok := message.(Send); ok {
            _, err := t.conn.Write(msg.Data)
            return err
        }
        return nil
    }
    func (t *TCPServer) Start() error {
        for {
            conn, err := t.listener.Accept()
            if err != nil {
                return err
            }
    
            handler := &TCPConnection{conn: conn}
            if _, err := t.Spawn(handler, gen.MetaOptions{}); err != nil {
                conn.Close()
                t.Log().Error("failed to spawn connection handler: %s", err)
            }
        }
    }
    spinner
    The network delivered the message but it got lost before reaching the process
    Both sides know definitively whether delivery succeeded
    The cost of retrying without knowing if the first attempt succeeded is high
    Sync Request Handling
    hashtag
    The Discovery Problem

    Consider a simple scenario. Node A wants to send to a process on node B. The process has a gen.PID that includes the node name "[email protected]". That's the logical address, but it's not enough to open a TCP connection. The node needs to translate that into connection parameters:

    • The IP address or hostname to connect to

    • The port number where node B is listening

    • Whether TLS is required for this connection

    • Which handshake and protocol versions node B supports

    • Which acceptor to use if node B has multiple listeners

    This information changes dynamically. Nodes start and stop. Ports change. TLS gets enabled or disabled. You don't want to hardcode these details into your application. You want discovery to happen automatically, and you want it to stay current.

    hashtag
    The Embedded Registrar

    Every node includes a registrar component that handles discovery. When a node starts, its registrar attempts to become a server by binding to port 4499 - TCP on localhost:4499 for registration and UDP on 0.0.0.0:4499 for resolution. If the TCP bind succeeds, the registrar runs in server mode. If the port is already taken (another node is using it), the registrar switches to client mode and connects to the existing server.

    This design means one node per host acts as the discovery server for all other nodes on that host. Whichever node started first becomes the server. The rest are clients.

    When a node's registrar runs in server mode, it:

    • Listens on TCP localhost:4499 for registration from same-host nodes

    • Listens on UDP 0.0.0.0:4499 (all interfaces) for resolution queries from any host

    • Maintains a registry of which nodes are running and how to reach them

    • Responds to queries with current connection information

    When a node's registrar runs in client mode, it:

    • Connects via TCP to the local registrar server at localhost:4499

    • Forwards its own registration to the server over TCP

    • Performs discovery queries via UDP (to localhost for same-host, to remote hosts for cross-host)

    • Maintains the TCP connection until termination (for registration keepalive)

    This dual-mode design provides automatic failover. If the server node terminates, its TCP connections close. The remaining nodes detect the disconnection, and they race to bind port 4499. The winner becomes the new server. The others reconnect as clients. Discovery continues without manual intervention.

    hashtag
    Registration

    When a node starts, it registers with the registrar. This registration happens over the TCP connection (for same-host nodes) or through initial discovery queries (for the server itself).

    What gets registered:

    • Node name (must be unique on the host)

    • List of acceptors this node is running

    • For each acceptor: port number, handshake version, protocol version, TLS flag

    The TCP connection from client to server stays open. It serves two purposes: maintaining registration (if the connection drops, the node is considered dead) and enabling the server to push updates (though the current implementation doesn't use this capability).

    If a node tries to register a name that's already taken, the registrar returns gen.ErrTaken. Node names must be unique within a host. Across hosts, the same name is fine - node names include the hostname for disambiguation.

    hashtag
    Resolution

    When a node needs to connect to a remote node, it queries the registrar for connection information.

    The resolution mechanism depends on whether the querying node is running the registrar in server mode:

    If the node runs the registrar server and the target is on the same host, resolution is a direct function call - no network involved. The server looks up the target in its local registry and returns the acceptor information immediately.

    If the node is a registrar client, resolution uses UDP regardless of whether the target is same-host or cross-host. The node extracts the hostname from the target node name (worker@otherhost becomes otherhost), sends a UDP packet to that host on port 4499, and waits for a response. For same-host queries, this means UDP to localhost:4499. For cross-host queries, it's UDP to the remote host. The registrar server (wherever it is) looks up the node and sends back the acceptor list via UDP reply.

    This UDP-based resolution is stateless. No connection is maintained. Each query is independent. This keeps it lightweight but means there's no push notification when remote nodes change - you only discover changes when you query again. The TCP connection between client and server is used only for registration and keepalive, not for resolution queries.

    The resolution response includes everything needed to establish a connection:

    • Acceptor port number

    • Handshake protocol version

    • Network protocol version

    • TLS flag (whether encryption is required)

    Multiple acceptors are supported. If a node has three acceptors listening on different ports with different configurations, all three appear in the resolution response. The connecting node tries them in order until one succeeds.

    hashtag
    Application Discovery

    Central registrars (etcd and Saturn) provide application discovery - finding which nodes in your cluster are running specific applications. The embedded registrar doesn't support this feature.

    When an application starts on a node, it registers an application route with the registrar:

    The registrar stores this deployment information. Other nodes can then discover where the application is running:

    The response includes the node name, application state, running mode, and a weight value. Multiple nodes can run the same application - the resolver returns all of them.

    hashtag
    Load Balancing with Weights

    Weights enable intelligent load distribution across application instances.

    When multiple nodes run the same application, each registration includes a weight. Higher weights indicate preference - nodes with more resources, better performance, or strategic positioning get higher weights. When you resolve an application, you get all instances with their weights:

    You choose which instance to use based on your load balancing strategy:

    Weighted random - Randomly select, but favor higher weights. Worker3 gets picked 2x more often than worker1, 4x more than worker2.

    Round-robin with weights - Cycle through instances, but send proportionally more requests to higher-weighted nodes. Send 4 requests to worker3, 2 to worker1, 1 to worker2, then repeat.

    Least-loaded - Track active requests per instance, prefer higher-weight nodes when load is equal.

    Geographic routing - Set weights based on proximity. Same datacenter gets weight 100, same region gets 50, cross-region gets 10.

    The weight is metadata - the registrar doesn't enforce any particular strategy. Your application decides how to interpret weights.

    hashtag
    Use Cases for Application Discovery

    Service mesh - Applications discover service endpoints dynamically. Your "api" application needs to send requests to the "workers" application. Instead of hardcoding which nodes run workers, you resolve it at runtime. When workers scale up or down, discovery reflects the current topology.

    Job distribution - A scheduler needs to distribute jobs across worker nodes. Resolve the "workers" application, get the list of available instances with their weights, and distribute jobs proportionally. If a worker node goes down, the next resolution returns fewer instances automatically.

    Application migration - You're moving an application from old nodes to new nodes. Start the application on new nodes with low weights. Verify it works correctly. Gradually increase weights on new nodes while decreasing weights on old nodes. Traffic shifts smoothly. Once migration completes, stop the application on old nodes.

    Feature flags - Run experimental versions of an application on a subset of nodes with specific weights. Route a percentage of traffic to the experimental version. If it performs well, increase its weight. If it fails, remove its registration entirely.

    Multi-region deployment - Deploy applications across regions. Use weights to prefer local regions. A node in us-east resolves the application and gets instances from all regions, but us-east instances have weight 100, us-west has weight 20, eu has weight 10. Most traffic stays local, but you can still route to other regions if needed.

    hashtag
    Configuration Management

    Central registrars provide cluster-wide configuration storage. The embedded registrar doesn't support this - each node maintains its own configuration independently.

    Configuration lives in the registrar's key-value store. For etcd, this is etcd's native key-value storage. For Saturn, it's stored in the Raft-replicated state. Any node can read configuration, creating a single source of truth for cluster settings:

    Configuration values can be any type - strings, numbers, booleans, nested structures. The registrar encodes them using EDF, so complex configuration is supported.

    hashtag
    Configuration Patterns

    Global configuration - Settings that apply cluster-wide. Database connection strings, external service URLs, feature flags. Store them in the registrar, and all nodes read the same values. When you update a configuration item in the registrar, new nodes get the updated value automatically.

    Per-node configuration - Node-specific settings stored with the node name as a key prefix. Store node:worker1:cpu_limit, node:worker2:cpu_limit separately. Each node reads its own configuration using its name. This enables heterogeneous clusters where nodes have different capabilities.

    Per-application configuration - Settings specific to an application. Store under an application key prefix: app:workers:batch_size, app:workers:concurrency. When the application starts on any node, it reads this configuration from the registrar.

    Environment-based configuration - Different values for dev/staging/production. Use key prefixes: prod:database_url, staging:database_url, dev:database_url. Nodes set an environment variable indicating their environment and read the appropriate keys.

    Configuration hierarchy - Combine multiple patterns with fallbacks. Read app:workers:batch_size, fall back to default:batch_size, fall back to hardcoded default. This provides specificity where needed and defaults everywhere else.

    hashtag
    Dynamic Configuration Updates

    Configuration in the registrar is static from the framework's perspective - it doesn't push updates to running nodes. When you change a configuration item in etcd or Saturn, running nodes don't see the change automatically. They have the value they read during startup or their last query.

    To implement dynamic configuration updates, use the registrar event system:

    Both etcd and Saturn registrars support events and push notifications immediately when:

    • Configuration changes - EventConfigUpdate with item name and new value

    • Nodes join/leave - EventNodeJoined / EventNodeLeft with node name

    • Applications lifecycle - EventApplicationLoaded, EventApplicationStarted, EventApplicationStopping, EventApplicationStopped, EventApplicationUnloaded with application name, node, weight, and mode

    Each registrar defines its own event types in its package (ergo.services/registrar/etcd or ergo.services/registrar/saturn). The event structures are identical, but you must use the correct package import for your registrar. This lets you react to cluster changes in real-time.

    Embedded registrar doesn't support events.

    With event notifications from etcd or Saturn registrars, nodes learn about configuration changes within milliseconds.

    hashtag
    Use Cases for Configuration Management

    Database connection strings - Instead of deploying configuration files to every node, store the connection string in the registrar. Nodes read it on startup. When you rotate credentials or migrate to a new database, update the registrar. Restart nodes gradually, and they pick up the new connection string automatically. No configuration file deployment needed.

    Feature flags - Enable or disable features dynamically across the cluster. Store feature:new_algorithm:enabled in the registrar. Applications check this flag when deciding which code path to use. Change the flag in the registrar, restart applications (or use events for live updates), and the feature rolls out cluster-wide.

    Capacity planning - Store node capacity information: CPU limits, memory limits, concurrent job limits. Applications read these limits and respect them when distributing work. When you upgrade hardware, update the capacity values in the registrar. Applications discover the new capacity automatically.

    Service discovery integration - Combine application discovery with configuration. Store connection parameters for each application deployment. When you resolve the "workers" application, you get not just the node names but also their specific configurations - which worker pool size, which queue they're processing, which priority level they handle.

    Staged rollouts - Store configuration with version tags. Set config:version to "v2". Nodes read their configuration version on startup. Half your cluster uses v1 configuration, half uses v2. Monitor behavior. If v2 performs better, update all nodes to v2. If it causes problems, roll back to v1. Configuration versioning enables controlled changes.

    Cluster-wide coordination - Store cluster-wide state that multiple nodes need to coordinate on. Leader election metadata, distributed lock information, shared counters. This isn't what the registrar is designed for (use dedicated coordination services for complex coordination), but simple coordination needs can be met with registrar configuration storage.

    hashtag
    Failover and Reliability

    The embedded registrar has built-in automatic failover.

    When a registrar server node terminates:

    1. Its TCP connections to client nodes (on the same host) close

    2. Client nodes detect the disconnection

    3. Each client attempts to bind localhost:4499

    4. The first to succeed becomes the new server

    5. The rest connect to the new server as clients

    6. Everyone re-registers their routes with the new server

    This failover is automatic and takes a few milliseconds. Discovery continues without interruption.

    For cross-host discovery, the same failover mechanism applies to each host independently. If a remote host's registrar server node goes down, another node on that host immediately takes over the server role. From the perspective of nodes on other hosts, discovery to that host continues working - they send UDP queries to the host, and whichever node is currently the registrar server responds. The failover is invisible to external hosts because the UDP queries are addressed to the host (port 4499), not to a specific node.

    hashtag
    Limitations of the Embedded Registrar

    The embedded registrar is minimal by design. It provides route resolution only. What it doesn't provide:

    No application discovery - You can discover where nodes are, but not where specific applications are running. Want to find which nodes are running the "workers" application? You have to query every node individually or maintain that mapping yourself.

    No load balancing metadata - There's no weight system for distributing load across multiple instances of the same application. You can't express that some nodes have more capacity or should receive more traffic.

    No centralized configuration - Configuration lives with each node. There's no cluster-wide config store. If you want to change a setting across the cluster, you modify each node individually through node environment variables or configuration files.

    No event notifications - Discovery is pull-based. You query when you need information. The registrar doesn't push updates when things change. If a node joins or leaves, or an application starts or stops, you only discover the change when you query again.

    No topology awareness - The registrar doesn't understand your cluster structure. It treats all nodes equally. If you have nodes in different datacenters or regions, the registrar provides no metadata to help you route efficiently based on proximity or cost.

    Limited scalability - The UDP query model works for small to medium clusters but doesn't scale to hundreds of nodes efficiently. Cross-host discovery has no caching - every query hits the network. For large clusters, this generates significant network traffic.

    These limitations don't matter for development or small deployments. Two nodes on your laptop? Three nodes in a single datacenter? The embedded registrar works fine. But for production clusters, especially large ones or those requiring dynamic topology, you want the richer feature set of etcd or Saturn registrars.

    hashtag
    External Registrars

    External registrars replace the embedded implementation with centralized discovery services.

    etcd registrar (ergo.services/registrar/etcd) uses etcd as the discovery backend. All nodes register their routes in etcd on startup. All discovery queries go to etcd. This centralizes cluster state: any node can discover any other node, applications can advertise their deployment locations, configuration can be stored in etcd's key-value store.

    The etcd registrar implementation maintains registration through HTTP polling - each node makes a registration request every second to keep its entry alive. This works well for small to medium clusters (50-70 nodes) but creates overhead at larger scales. The polling approach reflects etcd's design for web services rather than continuous cluster communication. Despite this limitation, etcd provides proven reliability, extensive tooling, and operational familiarity for teams already using etcd in their infrastructure.

    Saturn registrar (ergo.services/registrar/saturn) is purpose-built for Ergo clusters. It's an external Raft-based registry designed specifically for the framework's communication patterns. Instead of polling, Saturn maintains persistent connections and pushes updates immediately when cluster state changes. This makes it more efficient at scale - Saturn can handle clusters with thousands of nodes without the overhead of constant HTTP polling. The immediate event propagation means nodes learn about topology changes instantly rather than waiting for the next poll interval.

    Which registrar you choose depends on your deployment:

    • Small clusters (< 10 nodes), same host or trusted network: embedded registrar

    • Medium clusters (10-70 nodes), existing etcd infrastructure: etcd registrar

    • Large clusters (70+ nodes) or real-time requirements: Saturn registrar

    The choice is transparent to application code. You specify the registrar in gen.NodeOptions.Network.Registrar at startup. Everything else - registration, resolution, failover - works the same way regardless of which registrar you use.

    hashtag
    Registrar Configuration

    For the embedded registrar, configuration is minimal:

    Setting DisableServer: true prevents the node from becoming a registrar server. It will always run in client mode. This is useful if you have a dedicated node that should handle discovery and you don't want application nodes competing for the server role.

    For external registrars, configuration includes the service endpoint:

    The node connects to the registrar during startup. If the connection fails, startup fails. Discovery is considered essential - if you can't register and discover, the node can't participate in the cluster, so there's no point in starting.

    hashtag
    Discovery in Practice

    Service discovery is invisible during normal operation. You send messages, make calls, establish links - discovery happens automatically behind the scenes.

    Where discovery becomes visible is during debugging and operations. When connections fail, understanding discovery helps diagnose why. Is the registrar unreachable? Is the target node not registered? Are the acceptor configurations incompatible?

    The registrar provides an Info() method that shows its status:

    This information helps you understand what discovery features are available and whether the registrar is functioning correctly.

    For deeper understanding of how discovery integrates with connection establishment and message routing, see the Network Stack chapter. For configuring explicit routes that bypass discovery, see Static Routes.

    Actor systems present unique monitoring challenges. Traditional thread-based applications have predictable resource usage patterns - you monitor thread pools, request queues, and database connections. Actor systems are more dynamic - processes spawn and terminate constantly, messages flow asynchronously through mailboxes, and work distribution depends on supervision trees and message routing.

    The metrics actor addresses this by tracking:

    Process metrics - How many processes exist, how many are running vs. idle vs. zombie. This reveals whether your node is under load or experiencing process leaks.

    Memory metrics - Heap allocation and actual memory used. Actor systems can accumulate small allocations across thousands of processes. Memory metrics help identify whether garbage collection keeps pace with allocation.

    Network metrics - For distributed Ergo clusters, tracking bytes and messages flowing between nodes reveals network bottlenecks, routing inefficiencies, or failing connections.

    Application metrics - How many applications are loaded and running. Applications failing to start or terminating unexpectedly appear in these counts.

    These base metrics provide system-level visibility. For application-specific metrics (request rates, business transactions, custom counters), you extend the metrics actor with your own Prometheus collectors.

    hashtag
    ActorBehavior Interface

    The metrics actor extends gen.ProcessBehavior with a specialized interface:

    Only Init() is required - register your custom metrics and return options; all other callbacks have default implementations you can override as needed.

    You have two main patterns:

    Periodic collection - Implement CollectMetrics() to query state at intervals. Use when metrics reflect current state from other actors or external sources.

    Event-driven updates - Implement HandleMessage() or HandleEvent() to update metrics when events occur. Use when your application produces natural event streams or publishes events.

    hashtag
    How It Works

    When you spawn the metrics actor:

    1. HTTP endpoint starts at the configured host and port. The /metrics endpoint immediately serves Prometheus-formatted data.

    2. Base metrics collect automatically. Node information (processes, memory, CPU) and network statistics (connected nodes, message rates) update at the configured interval.

    3. Custom metrics update via CollectMetrics() callback or HandleMessage() processing, depending on your implementation.

    4. Prometheus scrapes the /metrics endpoint and receives current values for all registered collectors (base + custom).

    The actor handles HTTP serving and registry management. You focus on defining metrics and updating their values.

    hashtag
    Basic Usage

    Spawn the metrics actor like any other process:

    Default configuration:

    • Host: localhost

    • Port: 3000

    • CollectInterval: 10 seconds

    The HTTP endpoint starts automatically during initialization. The first metrics collection happens immediately, and subsequent collections run at the configured interval.

    hashtag
    Configuration

    Customize the HTTP endpoint and collection frequency:

    Host determines which network interface the HTTP server binds to. Use "localhost" to restrict access to local connections only (development, testing). Use "0.0.0.0" to accept connections from any interface (production, containerized environments).

    Port should not conflict with other services. Prometheus conventionally uses 9090, but many Ergo applications use that for other purposes. Choose a port that doesn't collide with your application's HTTP servers, Observer UI (default 9911), or other metrics exporters.

    CollectInterval controls how frequently the actor queries node statistics. Shorter intervals provide more granular time-series data but increase CPU usage for collection. Longer intervals reduce overhead but miss short-lived spikes. For most applications, 10-15 seconds balances responsiveness with resource usage. Prometheus typically scrapes every 15-60 seconds, so collecting more frequently than your scrape interval wastes resources.

    hashtag
    Base Metrics

    The metrics actor automatically exposes these Prometheus metrics without any configuration:

    hashtag
    Node Metrics

    Metric
    Type
    Description

    ergo_node_uptime_seconds

    Gauge

    Time since node started. Useful for detecting node restarts and calculating availability.

    ergo_processes_total

    Gauge

    Total number of processes including running, idle, and zombie. High counts suggest process leaks or inefficient cleanup.

    hashtag
    Network Metrics

    Metric
    Type
    Labels
    Description

    ergo_connected_nodes_total

    Gauge

    -

    Number of remote nodes connected. For distributed systems, this should match your expected cluster size.

    ergo_remote_node_uptime_seconds

    Gauge

    Network metrics use labels (node="...") to separate per-node data. This creates multiple time series - one per connected node. Prometheus queries can aggregate across labels or filter to specific nodes.

    hashtag
    Custom Metrics

    Extend the metrics actor by embedding metrics.Actor. You register custom Prometheus collectors in Init() and update them via CollectMetrics() or HandleMessage().

    hashtag
    Approach 1: Periodic Collection

    Implement CollectMetrics() to poll state at regular intervals:

    Use this when metrics reflect state you need to query - current values from other actors, computed aggregates, external API calls.

    hashtag
    Approach 2: Event-Driven Updates

    Update metrics immediately when events occur:

    Application actors send events to the metrics actor:

    Use this when your application naturally produces events. Metrics update in real-time without polling.

    hashtag
    Metric Types

    Prometheus defines four metric types, each suited for different use cases:

    Counter - Monotonically increasing value. Use for events that accumulate (requests processed, errors occurred, bytes sent). Counters never decrease except on process restart. Prometheus queries typically use rate() to calculate per-second rates or increase() for total change over a time window.

    Gauge - Value that can go up or down. Use for current state (active connections, queue depth, memory usage, CPU utilization). Gauges represent snapshots. Prometheus queries can graph them directly or use functions like avg_over_time() to smooth spikes.

    Histogram - Observations bucketed into configurable ranges. Use for latency or size distributions. Histograms let you calculate percentiles (p50, p95, p99) in Prometheus queries. They're more resource-intensive than gauges because they maintain multiple buckets per metric.

    Summary - Similar to histogram but calculates quantiles client-side. Use when you need precise quantiles but can't predict bucket boundaries. Summaries are more expensive than histograms because they track exact quantiles, not approximations.

    For most use cases, counters and gauges suffice. Use histograms when you need latency percentiles. Avoid summaries unless you have specific reasons - histograms are more flexible for Prometheus queries.

    hashtag
    Integration with Prometheus

    Configure Prometheus to scrape the metrics endpoint:

    Prometheus fetches /metrics every 15 seconds, parses the text format, and stores time-series data. You can then query, alert, and visualize metrics using Prometheus queries or Grafana dashboards.

    For dynamic discovery in Kubernetes or cloud environments, use Prometheus service discovery instead of static targets. The metrics actor itself doesn't need to know about Prometheus - it just exposes an HTTP endpoint.

    hashtag
    Observer Integration

    The metrics actor includes built-in Observer support via HandleInspect(). When you inspect it in Observer UI (http://localhost:9911), you see:

    • Total number of registered metrics

    • HTTP endpoint URL for Prometheus scraping

    • Collection interval

    • Current values for all metrics (base + custom)

    This works automatically for custom metrics - register them in Init() and they appear in Observer alongside base metrics.

    If you need custom inspection behavior, override HandleInspect() in your implementation:

    For detailed configuration options, see the metrics.Options struct and ActorBehavior interface in the package. For examples of custom metrics, see the example directoryarrow-up-right.

    Network Transparency

    Making distributed communication feel local

    Network transparency means the location of a process - whether it's in the same goroutine, on the same node, or on a remote node halfway across the world - doesn't change how you interact with it. You send messages the same way. You make calls with the same API. You establish links and monitors with the same methods. The framework handles the complexity of discovering nodes, encoding messages, and routing them across the network.

    This isn't just convenient. It's fundamental to building distributed systems in the actor model. If remote operations looked different from local operations, you'd be constantly checking location and branching your logic. That locality awareness would spread throughout your code, making it brittle and hard to reason about. Network transparency lets you design systems as collections of communicating actors, and deployment topology becomes an operational concern rather than a code concern.

    But transparency has limits. Networks are slower than in-process communication. They fail in ways local operations don't. Messages can be lost. Connections drop. Remote nodes crash or become unreachable. The framework makes remote operations look local, but the network's physical reality still matters.

    hashtag
    What Transparency Means in Practice

    Consider a simple example. You have a gen.PID and you want to send it a message:

    This code is identical whether pid points to a local process or a remote one. You don't check. You don't call different methods. You just send.

    Behind the scenes, the framework does different things:

    For a local process: The message is placed directly in the recipient's mailbox queue. The framework checks the priority, selects the appropriate queue (Main, System, or Urgent), and pushes the message. If the process is sleeping, it wakes up. The entire operation happens in microseconds.

    For a remote process: The node extracts the node name from the gen.PID, checks if a connection to that node exists, discovers the node's address if needed, establishes a connection pool if necessary, encodes your OrderRequest using EDF, wraps it in a protocol frame, sends it over TCP, and waits for the remote node to acknowledge delivery. The remote node receives the frame, decodes it, routes it to the recipient's mailbox, and sends an acknowledgment back. This takes milliseconds.

    From your code's perspective, both operations look identical. The framework abstracts the complexity.

    hashtag
    The Transparency Illusion

    Network transparency is an illusion carefully maintained by the framework. Several mechanisms work together to create this effect.

    Unified addressing - Every process has a gen.PID that includes the node name. Local and remote processes have the same identifier structure. You don't need different types for "local process" and "remote process". A gen.PID is just a gen.PID, and it works everywhere.

    Automatic routing - When you send to a process, the framework examines the node portion of the identifier. If it matches the local node, the message is delivered locally. If it doesn't match, the framework initiates discovery to find the remote node and routes the message over the network. You don't trigger this logic explicitly - it happens automatically.

    Location independence - You can receive a gen.PID from anywhere - as a return value, in a message, from a registry lookup - and immediately use it for communication. You don't need to check where it's from or set up connections. The framework handles it.

    Failure semantics - When you send to a local process that doesn't exist, you get an error immediately. When you send to a remote process that doesn't exist, you get... nothing, by default. The message is sent over the network, and if nobody's listening, it's silently dropped. This asymmetry breaks the transparency illusion. The Important delivery flag fixes this: with Important enabled, sending to a missing remote process gives you an immediate error, just like local delivery. The framework makes the network behave like local memory.

    hashtag
    How Messages Cross The Network

    When you send a message to a remote process, what actually happens? The framework performs a complex series of operations to transform your Go value into bytes, transmit them over TCP, and reconstruct them on the receiving side. Understanding this flow helps you design efficient distributed systems and debug problems when they arise.

    The sequence diagram below shows the complete message transmission pipeline, from the moment you call Send to the moment the recipient's HandleMessage is invoked:

    When you send a message, the framework:

    1. Encodes your value using EDF, transforming it into a byte sequence

    2. Compresses it if the message exceeds the compression threshold (default 1024 bytes)

    3. Frames it with protocol headers containing metadata (message type, sender, recipient, priority)

    The remote node reverses this:

    1. Reads the frame from the TCP connection

    2. Decompresses if the compression flag is set

    3. Decodes the bytes back into a Go value using EDF

    This entire pipeline is invisible. You call Send, and the framework executes these steps. The receiving process calls HandleMessage, and it receives your value as if you'd passed it locally.

    hashtag
    EDF: Ergo Data Format

    EDF (Ergo Data Format) is a binary serialization format designed for distributed actor systems. It solves a fundamental problem: how do you serialize Go values - structs, slices, maps, framework types like gen.PID - across the network with the performance of code-generated serializers like Protocol Buffers, but without requiring code generation?

    The answer is dynamic specialization. When you register a type, EDF analyzes its structure and builds specialized encoding and decoding functions specifically for that type. For structs, it creates functions for each field and composes them into a single encoder. This happens once at registration time, not during encoding. When you send a message, EDF uses these pre-built functions - no reflection, no runtime type analysis.

    This approach delivers Protocol Buffers-class performance without .proto files or protoc code generation.

    Registration happens at runtime - no build step, no generated files. You call edf.RegisterTypeOf() in your init() function, and EDF builds the optimized encoders. Framework types like gen.PID, gen.Ref, and gen.Event have native support with specialized encodings. During node handshake, both sides exchange their registered type lists and negotiate short numeric IDs, turning a full type name into 3 bytes on the wire. Field names aren't encoded - only field values in declaration order.

    Performance benchmarks (see benchmarks/serial/) show encoding is 50-100% faster than Protocol Buffers, while decoding is 20-60% slower. The encoding advantage comes from the specialized functions built during registration.

    EDF enforces strict type contracts - both nodes must register identical type definitions. Type identity is the full package path plus type name, not just the type name. For example, Order in package github.com/myapp/orders becomes #github.com/myapp/orders/Order. Two packages with the same type name Order are different types in EDF - this is Go's type system enforced at the protocol level.

    This strict typing is a deliberate design choice that pushes version management to the application level. When you need to evolve a message type, you version it explicitly in your code:

    Your actors handle both versions, routing logic based on the type received. This approach is essential for canary deployments where old and new versions coexist - each node declares what it understands, and the application code manages compatibility. Protocol-level backward compatibility would hide versioning from your code, making canary rollouts harder to control.

    hashtag
    Type Constraints

    EDF imposes size limits on certain types. These limits balance memory safety with practical message sizes.

    Atoms (gen.Atom) - Maximum 255 bytes. Atoms are used for names - node names, process names, event names. Names longer than 255 bytes are uncommon and likely indicate a design issue. The 255-byte limit keeps name handling efficient.

    Strings - Maximum 65,535 bytes (2^16-1). This covers most string use cases. For larger text (documents, logs, large payloads), use binary encoding ([]byte) instead, which supports up to 4GB.

    Errors - Maximum 32,767 bytes (2^15-1). Error messages longer than 32KB are unusual. If you need to send detailed diagnostic information, use a separate field in your message struct.

    Binary ([]byte) - Maximum 4,294,967,295 bytes (2^32-1, ~4GB). This is the largest single value EDF can encode. Messages containing multi-gigabyte binaries work but are inefficient. Consider chunking large data into multiple messages or using meta processes for streaming.

    Collections (map, array, slice) - Maximum 2^32 elements. A map can have up to 4 billion entries. A slice can have 4 billion elements. These limits are unlikely to be hit in practice - a slice of 4 billion int64 values would consume 32GB of memory.

    These limits are enforced during encoding. If you attempt to encode a 70,000 byte string, the encoder returns an error. The message isn't sent. On the receiving side, if a malicious sender tries to send an oversized value, the decoder rejects it and closes the connection.

    hashtag
    Type Registration Requirements

    For custom types to cross the network, both sending and receiving nodes must register them. Registration tells EDF how to encode and decode the type, and creates a numeric ID that's shared during handshake for efficient encoding.

    Register types during initialization:

    hashtag
    Registration Requirements

    Only exported fields - Structs must have all fields exported (starting with uppercase). This is by design: exported fields define your actor's contract. When actors communicate - locally or across the network - they exchange messages according to explicit contracts. Unexported fields are implementation details, internal state that shouldn't cross actor boundaries. If registration encounters unexported fields, it fails with "struct Order has unexported field(s)".

    No pointer types - EDF rejects pointer types and structs containing pointer fields. This is by design: pointers are a local memory optimization and shouldn't be part of network contracts. A *Database field is meaningless to a remote actor - it can't dereference your memory address. Pointers express local sharing semantics that don't translate across address spaces.

    For distributed references, use framework types designed for remote access: gen.PID (process reference), gen.Alias (named reference), gen.Ref (call reference). These work across nodes and provide location-independent semantics.

    Nested types must be registered first - If your type contains other custom types, register the inner types before the outer type:

    The order matters because registration builds the encoding schema by examining fields. When registering Person, EDF sees the Address field. If Address isn't registered yet, registration fails with "type Address must be registered first". If Address is already registered, EDF references its schema, creating an efficient nested encoding.

    hashtag
    Custom Marshaling for Special Cases

    If your type has unexported fields or needs special encoding, implement custom marshaling:

    EDF supports both edf.Marshaler/Unmarshaler and Go's standard encoding.BinaryMarshaler/Unmarshaler interfaces. The key difference is performance: edf.Marshaler writes directly to EDF's internal buffer (io.Writer), avoiding intermediate allocations. When you call MarshalEDF(w), the io.Writer is EDF's reusable buffer - your bytes go straight to the wire. With encoding.BinaryMarshaler, you must allocate and return a []byte, which EDF then copies into its buffer.

    For high-throughput message types, prefer edf.Marshaler. For types that implement standard interfaces or rarely-sent messages, encoding.BinaryMarshaler works fine.

    hashtag
    Encoding Errors

    Go's error type is an interface. Encoding an error requires special handling because interfaces don't have a fixed structure.

    Framework errors (gen.ErrProcessUnknown, gen.TerminateReasonNormal, etc.) are pre-registered when the node starts. They have numeric IDs and encode compactly as 3 bytes: type tag 0x9c + 2-byte ID.

    Custom errors need registration:

    Registered errors encode as 3 bytes total (type tag + 2-byte ID where ID > 32767). Unregistered errors encode as type tag + 2-byte length + error string bytes. On decoding, the framework checks if it has a local error with that string. If it does, it returns the local error instance. If not, it creates a new error with fmt.Errorf(string).

    This means error identity can be preserved across nodes if both sides register the error. If only one side registers it, you get an error with the correct message but not the same instance. Code comparing errors with errors.Is needs both sides to register for correct behavior.

    hashtag
    Type Registration Timing

    Type registration must happen before connection establishment. During handshake, nodes exchange their registered type lists and error lists. These lists become the encoding dictionaries for that connection.

    If you register a type after a connection is established, that type isn't in the dictionary. Attempting to send a value of that type fails - the encoder can't find it in the shared schema. The only way to use the newly registered type is to disconnect and reconnect, forcing a new handshake that includes the type.

    This is why registration typically happens in init() functions. The registration runs before main(), which runs before node startup, which runs before any connections are established. By the time connections form, all types are registered.

    For dynamic type registration (registering types based on runtime configuration or plugin loading), you have limited options:

    Register before node start - Load your configuration, determine which types you need, register them all, then start the node. This works but requires knowing all types upfront.

    Coordinate reconnection - Register the new type, disconnect existing connections to nodes that need the type, wait for reconnection with new handshake. This is complex and causes temporary communication loss.

    Use custom marshaling - Implement edf.Marshaler/Unmarshaler or encoding.BinaryMarshaler/Unmarshaler. These don't require pre-registration - they work immediately. The tradeoff is you write the encoding logic yourself.

    Most applications register types statically in init() and avoid these complications.

    hashtag
    Compression

    Large messages are automatically compressed to reduce network bandwidth. Compression is transparent - you configure it on the process or node, and the framework applies it automatically when appropriate.

    When compression is enabled, the framework checks the encoded message size before transmission. If it exceeds the compression threshold (default 1024 bytes), the message is compressed using the configured algorithm. The protocol frame's message type (byte 7) is set to 0xc8 (200, protoMessageZ) and byte 8 contains the compression type ID (100=LZW, 101=ZLIB, 102=GZIP), so the receiving node knows to decompress before decoding.

    Configure compression in process options:

    Or adjust it dynamically:

    Type determines the compression algorithm. GZIP (ID=102) provides good compression ratios with reasonable speed. ZLIB (ID=101) is similar but with slightly different format. LZW (ID=100) is faster but produces lower compression. Choose based on your CPU/bandwidth tradeoff.

    Level trades compression time for compression ratio. CompressionLevelBestSize produces smaller messages but takes longer. CompressionLevelBestSpeed compresses quickly but produces larger output. CompressionLevelDefault balances both.

    Threshold sets the minimum size for compression. Messages smaller than the threshold aren't compressed, even if compression is enabled. Compressing tiny messages adds overhead without reducing size meaningfully. The default 1024 bytes is reasonable - messages below 1KB go uncompressed, larger messages get compressed.

    Compression happens per-message. Each message is independently compressed or not, based on its size. This keeps compression stateless and allows the receiver to decode messages in any order.

    hashtag
    Caching and Optimization

    During handshake, nodes exchange caching dictionaries for frequently used values. This caching reduces message sizes significantly.

    Atom caching - Node names, process names, event names - these atoms appear repeatedly in messages. Every gen.PID contains the node name. Every message frame contains sender and recipient identifiers. Instead of encoding "mynode@localhost" repeatedly (2-byte length + 17 bytes = 19 bytes), the handshake assigns it a numeric ID. Cached atoms encode as 2 bytes (uint16 ID, where ID > 255). All subsequent uses of that atom encode as the 2-byte ID.

    Type caching - Registered types get numeric IDs. A User struct registered on both sides gets an agreed-upon ID. Messages containing User values encode the ID instead of the full type name and structure. A typical struct name like "#mypackage/User" might be 20-30 bytes - cached, it's 3 bytes (0x83 + 2-byte cache ID where ID > 4095).

    Error caching - Registered errors get IDs. Framework errors are pre-registered with well-known IDs. Custom errors get IDs during handshake. Error responses that might encode as 50+ bytes (error string message) encode as 3 bytes with caching (type tag + 2-byte ID where ID > 32767).

    The caches are bidirectional - both nodes maintain the same mappings. During encoding, the sender looks up the cache and uses IDs. During decoding, the receiver looks up IDs and reconstructs values. The cache persists for the connection lifetime. If the connection drops and reconnects, a new handshake creates a new cache.

    This caching is automatic. You don't manage the cache or invalidate entries. The framework handles it. You just benefit from smaller messages.

    hashtag
    Important Delivery

    Network transparency breaks down when dealing with failures. Sending to a local process that doesn't exist returns an error immediately - the framework checks the process table and sees the PID isn't registered. Sending to a remote process that doesn't exist returns... nothing. The message is encoded, sent to the remote node, and the remote node silently drops it because there's no recipient. Your code doesn't know the process was missing.

    This asymmetry makes debugging difficult. Is the remote process slow to respond, or does it not exist? Did the message get lost in the network, or was it never received? The fire-and-forget nature of normal Send provides no feedback.

    The Important delivery flag fixes this:

    With Important delivery:

    1. The message is sent to the remote node with an Important flag in the frame (bit 7 of priority byte set)

    2. The remote node attempts delivery to the recipient's mailbox

    3. If delivery succeeds, the remote node sends an acknowledgment back

    If the acknowledgment arrives, SendImportant returns nil. If an error response arrives, it returns the error. If the timeout expires, it returns gen.ErrTimeout.

    This gives you the same semantics as local delivery: immediate error feedback when something goes wrong. The network becomes transparent for failures too, not just successes.

    The cost is latency. Normal Send returns immediately - it queues the message and continues. SendImportant blocks until the remote node responds, adding a network round-trip. For messages that must be delivered, this cost is worth it. For best-effort messages where occasional loss is acceptable, stick with normal Send.

    For detailed exploration of Important Delivery patterns, reliability guarantees, and protocols like RR-2PC and FR-2PC, see .

    hashtag
    Protocol Frame Structure

    EDF-encoded messages are wrapped in ENP (Ergo Network Protocol) frames for transmission over TCP.

    Each frame has an 8-byte header:

    • Byte 0: Magic byte (78 for ENP)

    • Byte 1: Protocol version (1 for current version)

    • Bytes 2-5: Frame length (uint32, total size in bytes)

    For PID messages, the frame contains:

    • Sender PID (8 bytes - just the ID, node is known from connection)

    • Priority byte (bits 0-6 = priority 0-2, bit 7 = Important delivery flag)

    • Optional reference (8 bytes - first uint64 of Ref.ID, only if Important)

    The order byte (byte 6) preserves message ordering per sender. It's calculated as senderPID.ID % 255, ensuring messages from the same sender have the same order value. This guarantees sequential processing on the receiving side even if messages arrive on different TCP connections in the pool. Messages from different senders have different order values, enabling parallel processing.

    When the receiving node reads a frame from TCP, it extracts the order byte and routes the frame to the appropriate receive queue. The connection creates 4 receive queues per TCP connection in the pool. So a 3-connection pool has 12 receive queues total. Frames are distributed to queues based on order_byte % queue_count. Each queue is processed by a dedicated goroutine that decodes frames and delivers messages to recipients. This parallel processing improves throughput while preserving per-sender ordering.

    hashtag
    Limits of Transparency

    Network transparency is powerful but not magical. The network has physical properties that can't be abstracted away.

    Latency - Remote operations are slower. A local Send takes microseconds. A remote Send takes milliseconds. That's three orders of magnitude. For a single message, it's negligible. For thousands of messages, the difference is dramatic. Design systems to minimize remote calls, batch operations, and use asynchronous patterns.

    Bandwidth - Network links have finite capacity. Sending millions of small messages can saturate a network connection. Encoding and decoding adds CPU overhead. Compression helps but costs CPU time. Be mindful of message volume and size. Local operations have effectively infinite bandwidth - remote operations don't.

    Failures - Networks fail in ways local memory doesn't. Packets get lost. Connections drop. Nodes become unreachable. DNS fails. Firewalls block traffic. Local operations either succeed instantly or fail with a clear error. Remote operations can timeout, leaving you uncertain whether they succeeded. Design for these failure modes with timeouts, retries, and idempotent operations.

    Partial failures - In a distributed system, some nodes can fail while others continue working. A local system either works entirely or crashes entirely. A distributed system can be partially operational - some nodes reachable, others not. This partial failure is the hardest aspect of distributed systems. The framework can't hide it entirely.

    Ordering - Message ordering is preserved per-sender within a connection. Messages from process A to process B arrive in the order sent. But messages from different senders can interleave arbitrarily. And if a connection drops and reconnects, messages sent during disconnection are lost or delayed. Don't assume global ordering across the cluster.

    Network transparency makes distributed programming feel local. But distributed programming has fundamental differences from local programming. The transparency is a tool that simplifies common cases - it doesn't eliminate the need to think about distributed system challenges.

    hashtag
    Practical Implications

    Understanding network transparency helps you design better distributed systems.

    Use local clustering - Group processes that communicate frequently on the same node. If processes exchange hundreds of messages per second, put them locally. Their communication is microseconds instead of milliseconds, and you avoid network overhead.

    Prefer async over sync - Use Send (asynchronous) instead of Call (synchronous) for remote communication when possible. Async messaging doesn't block the sender, improving throughput. Sync calls over the network tie up your process waiting for responses.

    Design for message batching - Send one message with 100 items instead of 100 messages with 1 item each. Network overhead is per-message. Batching amortizes that overhead.

    Handle failures explicitly - Use timeouts on sync calls. Use Important delivery for critical messages. Monitor connection health. Don't assume remote operations succeed - check errors and have fallback logic.

    Keep messages small - Encoding and network transmission costs scale with message size. Large messages cause memory allocation, encoding overhead, network congestion. If you're sending megabytes of data, consider whether it belongs in messages or should use a different mechanism (file transfer, streaming, database).

    Leverage compression - Enable compression for processes that send large messages. The CPU cost of compression is usually worth the network bandwidth savings. But don't compress tiny messages - the overhead exceeds the benefit.

    Register types early - Do all type registration in init() functions before the node starts. Avoid dynamic type registration that requires connection cycling. Static registration is simpler and more reliable.

    For details on how the network stack implements transparency, see . For understanding how nodes discover each other, see .

    Web

    HTTP and actors speak different languages. HTTP is fundamentally synchronous - a request arrives, blocks waiting for processing, gets a response, connection closes. The actor model is fundamentally asynchronous - messages arrive in mailboxes, get processed sequentially one at a time, responses are separate messages sent whenever ready.

    Integrating these two worlds is possible, but the integration strategy matters. Choose wrong and you lose the benefits of both models. Choose right and you get HTTP's ubiquity with actors' concurrency and distribution capabilities.

    This chapter shows two integration approaches, ordered from simple to complex. The simple approach works for most cases and keeps the entire HTTP ecosystem available. The meta-process approach trades tooling for deeper actor integration, enabling patterns impossible with standard HTTP stacks.

    Before reaching for meta-processes, understand what you're giving up and what you're gaining. The simple approach might be all you need.

    hashtag
    Simple Approach: Call from HTTP Handlers

    The straightforward way: run a standard HTTP server, call actors from handlers using node.Call(), let network transparency distribute requests across the cluster.

    This keeps HTTP and actors separate. HTTP handles protocol concerns - routing, middleware, headers, status codes. Actors handle business logic - state management, processing, coordination. Clean separation.

    hashtag
    Basic Pattern

    The HTTP server runs outside the actor system in a separate goroutine. Handlers call actors synchronously using node.Call(). Actors can be anywhere - same node, remote node, doesn't matter. Network transparency routes the call.

    hashtag
    Why This Works

    Call() blocks the HTTP handler goroutine, not an actor. Go's HTTP server creates one goroutine per connection. Blocking in a handler is normal - that goroutine waits, others continue serving requests.

    The actor receiving the call processes it asynchronously in its own message loop. Multiple handlers can call the same actor concurrently. The actor processes one request at a time from its mailbox. This isolates the actor from HTTP concurrency.

    Network transparency means the actor can be anywhere:

    Change Node to move the actor. Code stays the same. Distribute load across nodes by routing different requests to different actors.

    hashtag
    Cluster Load Distribution

    Network transparency means actors can run anywhere in the cluster. The HTTP gateway becomes a router that distributes requests across backend nodes.

    Simple consistent hashing distributes load evenly while maintaining request affinity:

    Requests for user "alice" always go to the same backend node. That node caches alice's data in memory. Subsequent requests hit warm cache. Change clusterSize to add nodes - hashing redistributes load automatically while preserving most affinity.

    For dynamic topology where nodes join and leave unpredictably, use application discovery. Central registrars (etcd, Saturn) track which nodes are running which applications in real-time:

    Application discovery returns all nodes currently running the service. Each node reports its weight. Nodes with higher weights (more resources, better hardware, closer proximity) receive proportionally more traffic. Nodes that crash disappear from discovery immediately. New nodes appear as soon as they register. The HTTP gateway adapts to cluster topology changes without restarts.

    For details on application discovery and central registrars, see .

    hashtag
    Standard HTTP Tooling

    This approach keeps the entire HTTP ecosystem available:

    OpenAPI generation: Tools like swag/swaggo analyze HTTP handlers and generate OpenAPI specs. They see standard net/http handlers, so generation works normally.

    Middleware: Standard HTTP middleware wraps handlers - authentication, logging, CORS, rate limiting. Actors are completely invisible to middleware.

    Routing: Use any router - http.ServeMux (Go 1.22+), gorilla/mux, chi, echo. They all work with standard handlers.

    Testing: Test HTTP handlers with httptest. Test actors separately with unit tests. Clean separation of concerns.

    The actor system is an implementation detail. HTTP sees standard handlers. Clients see standard HTTP. Deployment tools see standard HTTP servers. Only the handler implementation uses actors internally.

    hashtag
    When This Approach Works

    Use this when:

    • You need standard HTTP tooling (OpenAPI, gRPC-gateway, middleware ecosystems)

    • Load balancing happens at the nginx/kubernetes level, not actor level

    • Backpressure from actors doesn't matter (actors process at their speed, HTTP clients wait)

    This covers most HTTP/actor integration cases. The HTTP layer is stateless. Actors hold state and logic. HTTP routes requests to actors. Clean architecture.

    For details on synchronous request handling in actors, see .

    hashtag
    Meta-Process Approach: Deep Integration

    Meta-processes convert HTTP into asynchronous actor messages. Instead of calling actors synchronously from handlers, requests become messages flowing into the actor system.

    This approach enables:

    • Backpressure: actors control request rate through mailbox capacity

    • Addressable connections: each WebSocket/SSE connection becomes an independent actor with gen.Alias identifier - any actor anywhere in the cluster can send messages directly to specific client connections through network transparency. This is the killer feature for real-time systems (chat, multiplayer games, live dashboards, collaborative editing) where backend logic must push updates to specific clients across cluster nodes. Impossible with the simple approach.

    • Per-request routing: route to different actor pools based on request content

    Standard HTTP routing and middleware still work - meta.WebHandler implements http.Handler and integrates with http.ServeMux or any router. You can wrap handlers in middleware for authentication, logging, CORS. What you lose is introspection-based tooling (OpenAPI generation, gRPC-gateway) because request processing happens inside actors, invisible to HTTP layer analysis tools.

    hashtag
    Architecture

    Two meta-processes work together:

    meta.WebServer: External Reader runs http.Server.Serve(listener). Blocks there forever until listener fails. The http.Server creates its own goroutines for each HTTP connection - those goroutines call handlers, not the External Reader. Actor Handler never runs (no messages received).

    meta.WebHandler: Implements http.Handler interface. External Reader blocks in Start() waiting for termination. When http.Server (running in WebServer) accepts a connection, it spawns a goroutine that calls handler.ServeHTTP(). Inside ServeHTTP():

    1. Create context with timeout

    2. Send meta.MessageWebRequest to worker actor

    3. Block on <-ctx.Done() waiting for worker to call Done()

    Actor Handler never runs - HandleMessage() and HandleCall() are empty stubs.

    Worker actors receive meta.MessageWebRequest containing:

    • http.ResponseWriter - write response here

    • *http.Request - the HTTP request

    • Done() function - call this to unblock ServeHTTP

    hashtag
    Data Flow

    Compare this with typical meta-processes like TCP or UDP:

    TCP/UDP meta-processes:

    • External Reader actively loops reading from socket, sends messages to actors

    • Actor Handler receives messages from actors, writes to socket

    • Both goroutines do real work - continuous bidirectional I/O

    Web meta-processes:

    • WebServer's External Reader passively blocks in http.Server.Serve() doing nothing - http.Server does all the work internally

    • WebHandler's External Reader passively blocks on channel doing nothing - just waiting for termination

    • Neither has an active Actor Handler - no messages arrive in their mailboxes

    Web meta-processes are unusual. They use the meta-process mechanism not for bidirectional I/O but for lifecycle management and integration with the actor system. The External Reader goroutines exist only to keep the meta-process alive while http.Server runs. The actual HTTP handling happens in goroutines spawned by http.Server, which are completely outside the meta-process architecture.

    This works because http.Server already solves concurrency - it spawns goroutines per connection. The meta-process just wraps it for integration with actor lifecycle and messaging.

    hashtag
    Basic Setup

    When a request arrives:

    1. WebServer's External Reader is blocked in http.Server.Serve()

    2. http.Server accepts connection, spawns its own goroutine for this connection

    3. That goroutine calls handler.ServeHTTP() (handler is WebHandler)

    Critical: ServeHTTP() executes in http.Server goroutines, not in meta-process goroutines. WebHandler's External Reader remains blocked in Start() waiting for termination. WebHandler's Actor Handler never spawns because no messages arrive in its mailbox.

    hashtag
    Worker Implementation

    Workers receive meta.MessageWebRequest as regular messages in their mailbox:

    The pattern: receive MessageWebRequest, process it, write to ResponseWriter, call Done(). The Done() call unblocks the ServeHTTP() goroutine waiting in WebHandler.

    Using act.WebWorker: Framework provides act.WebWorker that automatically extracts MessageWebRequest, routes to HTTP-method-specific callbacks (HandleGet, HandlePost, etc.), and calls Done() after processing. Use this instead of manual message handling - it eliminates boilerplate and ensures Done() is always called. See for details.

    hashtag
    Concurrent Processing with act.Pool

    Single worker processes requests sequentially. Use act.Pool to process multiple requests concurrently:

    Pool distributes incoming requests across 20 workers. Each worker processes one request at a time. System handles 20 concurrent requests.

    Capacity control: PoolSize × WorkerMailboxSize defines maximum requests the backend accepts. With 20 workers and 10 mailbox size, system capacity is 220 requests (20 processing + 200 queued). Beyond this, requests are shed - pool cannot forward to workers with full mailboxes.

    This limits load on backend systems. Database handles 20 concurrent queries maximum. External API gets 20 parallel requests maximum. Worker mailboxes buffer bursts without overwhelming downstream services.

    Worker failures are handled automatically. Pool spawns replacement workers when crashes are detected. Other workers continue processing during restart.

    hashtag
    Stateful Connections: WebSocket

    HTTP request-response is stateless. WebSocket is the opposite - long-lived bidirectional connections remaining open for hours or days.

    The framework provides WebSocket meta-process implementation in the extra library (ergo.services/meta/websocket). Each connection becomes an independent meta-process with gen.Alias identifier, addressable from anywhere in the cluster.

    Each connection is an independent meta-process:

    • External Reader continuously reads messages from client

    • Actor Handler receives messages from backend actors, writes to client

    • Both operate simultaneously - full-duplex bidirectional communication

    Killer feature: cluster-wide addressability. Any actor on any node can send messages directly to specific client connections:

    Network transparency makes every WebSocket connection addressable like any other actor. Backend logic scattered across cluster nodes can push updates to specific clients without routing through intermediaries.

    This is impossible with the simple approach. node.Call() is request-response. WebSocket requires continuous streaming both directions. Meta-processes provide the architecture: one goroutine reading from client, another writing to client, both operating on the same connection.

    For WebSocket implementation and usage examples, see .

    hashtag
    Choosing an Approach

    Start with the simple approach. Use node.Call() from standard HTTP handlers. This works for most cases and keeps the entire HTTP ecosystem available - OpenAPI generation, middleware, familiar patterns.

    Move to meta-processes when you specifically need:

    WebSocket or long-lived connections: Each connection must be an addressable actor that backend logic can push updates to. The simple approach cannot do this - it's request-response only. Meta-processes make each connection an independent actor with cluster-wide addressability.

    Capacity control through mailbox limits: Backend accepts exactly PoolSize × WorkerMailboxSize requests, no more. Beyond this, requests are rejected. This prevents memory exhaustion during overload. The simple approach queues unbounded requests in HTTP server.

    The simple approach handles thousands of requests per second with proper actor distribution. Use meta-processes only when the simple approach cannot provide required capabilities.

    Actor

    The actor model requires sequential message processing - each actor handles one message at a time in a dedicated goroutine. This eliminates data races within the actor but shifts complexity to the message handling loop: reading from multiple mailbox queues in priority order, dispatching to different handlers based on message type, managing state transitions, converting exit signals to regular messages when trapping is enabled.

    You could implement this yourself with gen.ProcessBehavior, but you'd rewrite the same logic for every actor. act.Actor solves this. It implements the low-level gen.ProcessBehavior interface and provides a higher-level act.ActorBehavior interface with straightforward callbacks: Init for initialization, HandleMessage for asynchronous messages, HandleCall for synchronous requests, Terminate for cleanup. You write business logic, act.Actor handles the mailbox mechanics.

    hashtag
    Creating an Actor

    Embed act.Actor in your struct and implement the act.ActorBehavior callbacks you need:

    Spawn it like any process:

    The factory function is called each time you spawn. Each process gets a fresh instance with its own state. This isolation is fundamental to the actor model - actors share nothing except messages.

    hashtag
    Callback Interface

    act.ActorBehavior defines the callbacks act.Actor will invoke:

    All callbacks are optional. act.Actor provides default implementations that log warnings for unhandled messages. Implement only what you need.

    Since act.Actor embeds gen.Process, you have direct access to all process methods: Send, Call, Spawn, Link, RegisterName, etc. No need to store references - they're built in.

    hashtag
    Initialization

    Init runs once when the process spawns. The args parameter contains whatever you passed to Spawn:

    If Init returns an error, the process is cleaned up and removed. Spawn returns immediately with that error. Use this for validation: check arguments, verify resources, refuse to start if preconditions aren't met.

    During Init, the process is in ProcessStateInit. All operations are available: Spawn, Send, SetEnv, RegisterName, CreateAlias, RegisterEvent, Link*, Monitor*, Call*, and property setters.

    Any resources created during Init (names, aliases, events, links, monitors) are properly cleaned up if initialization fails.

    hashtag
    Message Handling

    Messages arrive in the mailbox and sit in one of four queues: Urgent, System, Main, or Log. act.Actor processes them in priority order:

    1. Urgent - Maximum priority messages (MessagePriorityMax)

    2. System - High priority messages (MessagePriorityHigh)

    3. Main - Normal priority messages (MessagePriorityNormal

    When a message arrives in Urgent, System, or Main, act.Actor calls HandleMessage:

    The return value determines whether the actor continues or terminates:

    • Return nil to keep running

    • Return gen.TerminateReasonNormal for clean shutdown

    • Return any other error to terminate (logged as error)

    The from parameter tells you who sent the message. Use it for replies. If you don't need replies, ignore it.

    hashtag
    Synchronous Requests

    When someone calls process.Call(pid, request), act.Actor invokes your HandleCall:

    The error return value controls process termination, not the caller's response:

    • (result, nil) - Send result to caller, continue running

    • (result, gen.TerminateReasonNormal) - Send result, then terminate cleanly

    To send an application error to the caller, return it as the result value:

    This separation between transport errors (err return from Call) and application errors (result as error) is fundamental to actor communication. See for deeper discussion of error channels and when to use SendResponseError.

    hashtag
    Asynchronous Handling of Synchronous Requests

    Sometimes you can't respond immediately. Maybe you need to query another service, or delegate work to a pool of workers. Return (nil, nil) from HandleCall to defer the response:

    The gen.Ref identifies the request. The caller blocks waiting for a response with that ref. You can send the response from any process - the one that received the request, a worker, or even a remote process. Just call SendResponse(callerPID, ref, result).

    The ref has a deadline (from the caller's timeout). Check if it's still alive before doing expensive work:

    hashtag
    Termination

    To stop an actor, return a non-nil error from HandleMessage or HandleCall:

    Termination reasons:

    • gen.TerminateReasonNormal - Clean shutdown, not logged as error

    • gen.TerminateReasonKill - Process was killed via node.Kill(pid)

    • gen.TerminateReasonPanic

    After termination is triggered, act.Actor calls your Terminate callback:

    At this point, the process is in ProcessStateTerminated and has been removed from the node. Most gen.Process methods return gen.ErrNotAllowed. You can still send messages (fire-and-forget), but you can't make calls, create links, or spawn children.

    If a panic occurs during Init, HandleMessage, or HandleCall, the framework catches it, logs the stack trace, and terminates the process with gen.TerminateReasonPanic. The Terminate callback still runs, giving you a chance to clean up.

    hashtag
    Trapping Exit Signals

    By default, when an actor receives an exit signal (via SendExit or from a linked process), it terminates immediately. Enable TrapExit to convert exit signals into regular messages:

    Exit signal messages:

    • gen.MessageExitPID - From a process (SendExit or link)

    • gen.MessageExitProcessID - From a named process link

    • gen.MessageExitAlias

    Exception: Exit signals from the parent process cannot be trapped. If your parent terminates (and you created a link with LinkParent option or via Link/LinkPID), you terminate regardless of TrapExit. This ensures supervision trees can forcefully terminate subtrees.

    Use TrapExit when you want to handle failures gracefully - log them, restart workers, switch to fallback services. Don't use it if you want standard supervision behavior (child fails → parent restarts it).

    hashtag
    Split Handle

    By default, HandleMessage and HandleCall are invoked regardless of how the process was addressed - by PID, by registered name, or by alias. Enable SetSplitHandle(true) to route based on address type:

    The same split applies to HandleCall* variants. Use this when you want different behavior for internal communication (PID) versus public API (registered name) versus temporary sessions (alias).

    Most actors don't need this. Leave split handle disabled and use HandleMessage/HandleCall for everything.

    hashtag
    Specialized Callbacks

    hashtag
    Logging

    If your actor is registered as a logger (via node.AddLogger(pid, level)), it receives log messages in the Log queue:

    Log messages have the lowest priority. They're processed after Urgent, System, and Main are empty. This prevents logging from starving regular message processing.

    hashtag
    Events

    If your actor subscribed to an event (via LinkEvent or MonitorEvent), it receives event messages:

    Events arrive in the System queue (high priority). Use them for cross-cutting concerns where multiple actors need to react to the same occurrence.

    hashtag
    Inspection

    Actors can expose runtime state for monitoring and debugging via the HandleInspect callback:

    Inspect the actor from within a process context or directly from the node:

    Both methods only work for local processes (same node). Inspection requests go to the Urgent queue and bypass normal message processing. Keep HandleInspect implementation fast - don't do expensive computations or I/O. Return only string values (serialization limitation). The optional item parameters allow filtering which fields to return, though most implementations ignore them and return all fields.

    hashtag
    Actor Pools

    For workload distribution, use act.Pool instead of implementing manual worker management. See for details.

    hashtag
    Patterns and Pitfalls

    Don't spawn goroutines in callbacks. The actor model is sequential - one message at a time. Spawning goroutines breaks this, introducing data races on actor state. If you need concurrency, spawn child actors and send them messages.

    Don't block on channels or mutexes. Callbacks run in the actor's goroutine. Blocking it starves message processing. Use async message passing (Send) instead of sync primitives.

    Don't store gen.Process references. The embedded act.Actor provides all process methods. Storing additional references wastes memory and can cause confusion about which instance is authoritative.

    Return errors for termination, not for caller responses. HandleCall's error return terminates the process. To send errors to callers, return them as the result value.

    Use ref.IsAlive() before expensive async work. When handling calls asynchronously, check if the caller is still waiting before spending resources on the response.

    Enable TrapExit only when needed. Default behavior (terminate on exit signal) works for most actors. Trap only when you have specific failure handling logic.

    Supervisor

    Actors fail. They panic, encounter errors, or lose external resources. In traditional systems, you add defensive code: catch exceptions, retry operations, validate state. This spreads failure handling throughout your codebase, mixing recovery logic with business logic.

    The actor model takes a different approach: let it crash. When an actor fails, terminate it cleanly and restart it in a known-good state. This requires something watching the actor and managing its lifecycle - a supervisor.

    act.Supervisor is an actor that manages child processes. It starts them during initialization, monitors them for failures, and applies restart strategies when they terminate. Supervisors can manage other supervisors, creating hierarchical fault tolerance trees where failures are isolated and recovered automatically.

    Like act.Actor, the act.Supervisor struct implements the low-level gen.ProcessBehavior

    // Local - errors are immediate
    err := process.Send(localPID, message)
    if err != nil {
        // ErrProcessUnknown or ErrProcessMailboxFull
        // You know immediately something is wrong
    }
    
    // Remote - errors are hidden
    err = process.Send(remotePID, message)
    if err != nil {
        // Only reports local problems (serialization, no connection)
        // Cannot report remote problems (process missing, mailbox full)
    }
    // Message sent to network, no idea if it arrived
    err := process.SendImportant(remotePID, message)
    if err != nil {
        // Immediate errors:
        // - ErrProcessUnknown: process doesn't exist on remote node
        // - ErrProcessMailboxFull: process exists but mailbox is full
        // - ErrTimeout: remote node received message but no confirmation
        // - ErrNoConnection: cannot reach remote node
    }
    // If no error, message is definitely in the recipient's mailbox
    func (a *Actor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case CriticalUpdate:
            // This message must be delivered or we need to know it failed
            if err := a.SendImportant(targetPID, msg); err != nil {
                a.Log().Error("failed to send critical update: %s", err)
                return err
            }
            a.Log().Info("critical update confirmed delivered")
        }
        return nil
    }
    func (a *Actor) Init(args ...any) error {
        // Enable important delivery for all messages from this process
        a.SetImportantDelivery(true)
        return nil
    }
    
    func (a *Actor) HandleMessage(from gen.PID, message any) error {
        // Send uses important delivery automatically
        err := a.Send(targetPID, message)
        if err != nil {
            // Immediate confirmation or error
        }
        return nil
    }
    // Caller
    result, err := process.Call(target, request)
    
    // Handler
    func (h *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        result := h.process(request)
        return result, nil  // Framework sends with SendResponse
    }
    // Caller
    result, err := process.Call(target, request)
    
    // Handler
    func (h *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        h.SetImportantDelivery(true)  // Or use SendResponseImportant explicitly
        result := h.process(request)
        return result, nil  // Framework sends with SendResponseImportant
    }
    // Caller
    result, err := process.CallImportant(target, request)
    
    // Handler
    func (h *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        result := h.process(request)
        return result, nil  // Regular SendResponse
    }
    // Caller
    result, err := process.CallImportant(target, request)
    
    // Handler
    func (h *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        h.SetImportantDelivery(true)
        result := h.process(request)
        return result, nil  // Framework sends with SendResponseImportant
    }
    type Coordinator struct {
        act.Actor
        participants []gen.PID
    }
    
    func (c *Coordinator) Prepare() error {
        c.SetImportantDelivery(true)  // FR-2PC for all messages
        
        // Phase 1: Prepare
        for _, p := range c.participants {
            result, err := c.CallImportant(p, PrepareRequest{})
            if err != nil {
                // Participant unreachable - abort
                return c.abort()
            }
            if result != "yes" {
                // Participant voted no - abort
                return c.abort()
            }
        }
        
        // Phase 2: Pre-commit (guaranteed delivery)
        for _, p := range c.participants {
            result, err := c.CallImportant(p, PreCommitRequest{})
            if err != nil {
                // This is a problem - participant didn't receive pre-commit
                // But FR-2PC guarantees we know immediately
                return c.handlePreCommitFailure(p, err)
            }
        }
        
        // Phase 3: Commit (guaranteed delivery)
        for _, p := range c.participants {
            _, err := c.CallImportant(p, CommitRequest{})
            if err != nil {
                // Participant didn't receive commit
                // Need recovery protocol
                return c.handleCommitFailure(p, err)
            }
        }
        
        return nil
    }
    // Local send - immediate error, important flag ignored
    err := process.SendImportant(localPID, message)
    if err != nil {
        // ErrProcessUnknown or ErrProcessMailboxFull
        // No ACK needed, mailbox operation is synchronous
    }
    route := gen.ApplicationRoute{
        Node:   node.Name(),
        Name:   "workers",
        Weight: 100,
        Mode:   gen.ApplicationModePermanent,
        State:  gen.ApplicationStateRunning,
    }
    registrar.RegisterApplicationRoute(route)
    registrar, _ := node.Network().Registrar()
    resolver := registrar.Resolver()
    routes, err := resolver.ResolveApplication("workers")
    // routes contains all nodes running "workers" application
    routes, _ := resolver.ResolveApplication("workers")
    // routes = []gen.ApplicationRoute{
    //   {Name: "workers", Node: "worker1@host1", Weight: 100, Mode: Permanent, State: Running},
    //   {Name: "workers", Node: "worker2@host2", Weight: 50,  Mode: Permanent, State: Running},
    //   {Name: "workers", Node: "worker3@host3", Weight: 200, Mode: Permanent, State: Running},
    // }
    registrar, _ := node.Network().Registrar()
    
    // Get single config item
    dbURL, err := registrar.ConfigItem("database_url")
    
    // Get multiple items
    config, err := registrar.Config("database_url", "cache_size", "log_level")
    // config = map[string]any{
    //     "database_url": "postgres://...",
    //     "cache_size": 1024,
    //     "log_level": "info",
    // }
    // For etcd registrar
    import "ergo.services/registrar/etcd"
    
    // For Saturn registrar
    import "ergo.services/registrar/saturn"
    
    registrar, _ := node.Network().Registrar()
    event, err := registrar.Event()
    if err != nil {
        // registrar doesn't support events (embedded registrar only)
    }
    
    // Link to the event to receive notifications
    process.LinkEvent(event)
    
    // In your HandleEvent callback (etcd example):
    func (w *Worker) HandleEvent(message gen.MessageEvent) error {
        switch ev := message.Message.(type) {
        
        case etcd.EventConfigUpdate:
            // Configuration item changed
            w.Log().Info("config updated: %s = %v", ev.Item, ev.Value)
            w.loadConfig()
            
        case etcd.EventNodeJoined:
            // New node joined the cluster
            w.Log().Info("node joined: %s", ev.Name)
            w.checkNewNode(ev.Name)
            
        case etcd.EventNodeLeft:
            // Node left the cluster
            w.Log().Info("node left: %s", ev.Name)
            w.handleNodeDown(ev.Name)
            
        case etcd.EventApplicationLoaded:
            // Application loaded on a node
            w.Log().Info("application %s loaded on %s (weight: %d)", 
                ev.Name, ev.Node, ev.Weight)
                
        case etcd.EventApplicationStarted:
            // Application started running
            w.Log().Info("application %s started on %s (mode: %s, weight: %d)", 
                ev.Name, ev.Node, ev.Mode, ev.Weight)
            w.refreshServices()
            
        case etcd.EventApplicationStopping:
            // Application is stopping
            w.Log().Info("application %s stopping on %s", ev.Name, ev.Node)
            
        case etcd.EventApplicationStopped:
            // Application stopped completely
            w.Log().Info("application %s stopped on %s", ev.Name, ev.Node)
            w.refreshServices()
            
        case etcd.EventApplicationUnloaded:
            // Application unloaded from node
            w.Log().Info("application %s unloaded from %s", ev.Name, ev.Node)
        }
        return nil
    }
    
    // For Saturn registrar, use saturn.EventConfigUpdate, saturn.EventNodeJoined, etc.
    // The event types are identical in structure but defined in separate packages.
    import "ergo.services/ergo/net/registrar"
    
    node, err := ergo.StartNode("myapp@localhost", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Registrar: registrar.Create(registrar.Options{
                Port: 4499, // default
                DisableServer: false, // allow server mode
            }),
        },
    })
    import "ergo.services/registrar/etcd"
    
    node, err := ergo.StartNode("[email protected]", gen.NodeOptions{
        Network: gen.NetworkOptions{
            Registrar: etcd.Create(etcd.Options{
                Endpoints: []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
                // ... authentication, TLS, etc
            }),
        },
    })
    network := node.Network()
    registrar, err := network.Registrar()
    if err != nil {
        // node has no registrar configured
    }
    
    info := registrar.Info()
    // info.Server - registrar endpoint
    // info.EmbeddedServer - true if running as server
    // info.SupportConfig - whether config storage is available
    // info.SupportRegisterApplication - whether app routing is available
    type ActorBehavior interface {
        gen.ProcessBehavior
    
        Init(args ...any) (Options, error)
    
        HandleMessage(from gen.PID, message any) error
        HandleCall(from gen.PID, ref gen.Ref, message any) (any, error)
        HandleEvent(event gen.MessageEvent) error
        HandleInspect(from gen.PID, item ...string) map[string]string
    
        CollectMetrics() error
        Terminate(reason error)
    }
    package main
    
    import (
        "ergo.services/actor/metrics"
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
    )
    
    func main() {
        node, _ := ergo.StartNode("mynode@localhost", gen.NodeOptions{})
        defer node.Stop()
    
        // Spawn metrics actor with defaults
        node.Spawn(metrics.Factory, gen.ProcessOptions{}, metrics.Options{})
    
        // Metrics available at http://localhost:3000/metrics
        node.Wait()
    }
    options := metrics.Options{
        Host:            "0.0.0.0",        // Listen on all interfaces
        Port:            9090,              // Prometheus default port
        CollectInterval: 5 * time.Second,  // Collect every 5 seconds
    }
    
    node.Spawn(metrics.Factory, gen.ProcessOptions{}, options)
    type AppMetrics struct {
        metrics.Actor
    
        activeUsers   prometheus.Gauge
        queueDepth    prometheus.Gauge
    }
    
    func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
        m.activeUsers = prometheus.NewGauge(prometheus.GaugeOpts{
            Name: "myapp_active_users",
            Help: "Current number of active users",
        })
    
        m.queueDepth = prometheus.NewGauge(prometheus.GaugeOpts{
            Name: "myapp_queue_depth",
            Help: "Current queue depth",
        })
    
        m.Registry().MustRegister(m.activeUsers, m.queueDepth)
    
        return metrics.Options{
            Port:            9090,
            CollectInterval: 5 * time.Second,
        }, nil
    }
    
    func (m *AppMetrics) CollectMetrics() error {
        // Called every CollectInterval
        // Query other processes for current state
        
        count, err := m.Call(userService, getActiveUsersMessage{})
        if err != nil {
            m.Log().Warning("failed to get user count: %s", err)
            return nil // Non-fatal, continue
        }
        m.activeUsers.Set(float64(count.(int)))
        
        depth, _ := m.Call(queueService, getDepthMessage{})
        m.queueDepth.Set(float64(depth.(int)))
        
        return nil
    }
    type AppMetrics struct {
        metrics.Actor
    
        requestsTotal  prometheus.Counter
        requestLatency prometheus.Histogram
    }
    
    func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
        m.requestsTotal = prometheus.NewCounter(prometheus.CounterOpts{
            Name: "myapp_requests_total",
            Help: "Total requests processed",
        })
    
        m.requestLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
            Name:    "myapp_request_duration_seconds",
            Help:    "Request latency distribution",
            Buckets: prometheus.DefBuckets,
        })
    
        m.Registry().MustRegister(m.requestsTotal, m.requestLatency)
    
        return metrics.Options{Port: 9090}, nil
    }
    
    func (m *AppMetrics) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case requestCompletedMessage:
            m.requestsTotal.Inc()
            m.requestLatency.Observe(msg.duration.Seconds())
        case errorOccurredMessage:
            m.errorsTotal.Inc()
        }
        return nil
    }
    // In your request handler actor
    func (h *RequestHandler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ProcessRequest:
            start := time.Now()
            // ... process request ...
            elapsed := time.Since(start)
            
            // Send metrics event
            h.Send(metricsPID, requestCompletedMessage{duration: elapsed})
        }
        return nil
    }
    scrape_configs:
      - job_name: 'ergo-nodes'
        static_configs:
          - targets:
              - 'localhost:3000'
              - 'node1.example.com:3000'
              - 'node2.example.com:3000'
        scrape_interval: 15s
    func (m *AppMetrics) HandleInspect(from gen.PID, item ...string) map[string]string {
        result := make(map[string]string)
        
        // Custom inspection logic
        result["status"] = "healthy"
        result["custom_info"] = "some value"
        
        return result
    }

    ergo_processes_running

    Gauge

    Processes actively handling messages. Low relative to total suggests most processes are idle (good) or blocked (bad - investigate what they're waiting for).

    ergo_processes_zombie

    Gauge

    Processes terminated but not yet fully cleaned up. These should be transient. Persistent zombies indicate bugs in termination handling.

    ergo_memory_used_bytes

    Gauge

    Total memory obtained from OS (uses runtime.MemStats.Sys).

    ergo_memory_alloc_bytes

    Gauge

    Bytes of allocated heap objects (uses runtime.MemStats.Alloc).

    ergo_cpu_user_seconds

    Gauge

    CPU time spent executing user code. Increases as the node does work. Rate of change indicates CPU utilization.

    ergo_cpu_system_seconds

    Gauge

    CPU time spent in kernel (system calls). High system time relative to user time suggests I/O bottlenecks or excessive syscalls.

    ergo_applications_total

    Gauge

    Number of applications loaded. Should match your expected count. Unexpected changes indicate applications starting or stopping.

    ergo_applications_running

    Gauge

    Applications currently active. Compare to total to identify stopped or failed applications.

    ergo_registered_names_total

    Gauge

    Processes registered with atom names. High counts suggest heavy use of named processes for routing.

    ergo_registered_aliases_total

    Gauge

    Total number of registered aliases. Includes aliases created by processes via CreateAlias() and aliases identifying meta-processes.

    ergo_registered_events_total

    Gauge

    Event subscriptions active in the node. High counts indicate extensive pub/sub usage.

    node

    Uptime of each connected remote node. Resets when the remote node restarts.

    ergo_remote_messages_in_total

    Gauge

    node

    Messages received from each remote node. Rate indicates traffic volume.

    ergo_remote_messages_out_total

    Gauge

    node

    Messages sent to each remote node. Asymmetric in/out rates may reveal routing issues.

    ergo_remote_bytes_in_total

    Gauge

    node

    Bytes received from each remote node. Disproportionate bytes-to-messages ratio suggests large messages or inefficient serialization.

    ergo_remote_bytes_out_total

    Gauge

    node

    Bytes sent to each remote node. Monitors network bandwidth usage per peer.

    spinner
    , default)
  • Log - Logging messages (lowest priority)

  • (nil, someError) - Terminate immediately with someError (caller times out)
    - Panic occurred in callback (framework catches it)
  • gen.TerminateReasonShutdown - Node is stopping (sent by parent or node)

  • Any other error - Application-specific failure (logged as error)

  • - From an alias link
  • gen.MessageExitEvent - From an event link

  • gen.MessageExitNode - From a node link (network disconnect)

  • Handle Sync
    Pool
    interface and has the embedded
    gen.Process
    interface. To create a supervisor, you embed
    act.Supervisor
    in your struct and implement the
    act.SupervisorBehavior
    interface:

    Only Init is mandatory. All other methods are optional - act.Supervisor provides default implementations that log warnings. The Init method returns SupervisorSpec which defines the supervisor's behavior, children, and restart strategy.

    hashtag
    Creating a Supervisor

    Embed act.Supervisor and implement Init to define the supervision spec:

    The supervisor spawns all children during Init (except Simple One For One, which starts with zero children). Each child is linked bidirectionally to the supervisor (LinkChild and LinkParent set automatically). If a child terminates, the supervisor receives an exit signal and applies the restart strategy.

    Children are started sequentially in declaration order. If any child's spawn fails (the factory's ProcessInit returns an error), the supervisor terminates immediately with that error. This ensures the supervision tree is fully initialized or not at all - no partial states.

    hashtag
    Supervision Types

    The Type field in SupervisorSpec determines what happens when a child fails.

    hashtag
    One For One

    Each child is independent. When one child terminates, only that child is restarted. Other children continue running unaffected.

    If worker2 crashes, the supervisor restarts only worker2. worker1 and worker3 keep running. Use this when children are independent - databases, caches, API handlers that don't depend on each other.

    Each child runs with a registered name (the Name from the spec). This means only one instance per child spec. To run multiple instances of the same worker, use Simple One For One instead.

    hashtag
    All For One

    Children are tightly coupled. When any child terminates, all children are stopped and restarted together.

    If cache crashes, the supervisor stops processor and api (in reverse order if KeepOrder is true, simultaneously otherwise), then restarts all three in declaration order. Use this when children share state or dependencies that can't survive partial failures.

    hashtag
    Rest For One

    When a child terminates, only children started after it are affected. Children started before it continue running.

    If cache crashes, the supervisor stops api, then restarts cache and api in order. database is unaffected. Use this for dependency chains where later children depend on earlier ones, but earlier ones don't depend on later ones.

    With KeepOrder: true, children are stopped sequentially (last to first). With KeepOrder: false, they stop simultaneously. Either way, restart happens in declaration order after all affected children have stopped.

    hashtag
    Simple One For One

    All children run the same code, spawned dynamically instead of at supervisor startup.

    The supervisor starts with zero children. Call supervisor.StartChild("worker", "custom-args") to spawn instances:

    Each instance is independent. They're not registered by name (no SpawnRegister), so you track them by PID. When an instance terminates, only that instance is restarted (if the restart strategy allows). Other instances continue running.

    Use Simple One For One for worker pools where you dynamically scale the number of identical workers based on load. The child spec is a template - each StartChild creates a new instance from that template.

    hashtag
    Restart Strategies

    The Restart.Strategy field determines when children are restarted.

    hashtag
    Transient (Default)

    Restart only on abnormal termination. If a child returns gen.TerminateReasonNormal or gen.TerminateReasonShutdown, it's not restarted:

    Use this for workers that can gracefully stop - maybe they finished their work, or received a shutdown command. Crashes (panics, errors, kills) trigger restarts. Normal termination doesn't.

    hashtag
    Temporary

    Never restart, regardless of termination reason:

    The child runs once. If it terminates (normal or crash), it stays terminated. Use this for initialization tasks or processes that shouldn't be restarted automatically.

    hashtag
    Permanent

    Always restart, regardless of termination reason:

    Even gen.TerminateReasonNormal triggers restart. Use this for critical processes that must always be running - maybe a health monitor or connection manager that should never stop.

    With Permanent strategy, DisableAutoShutdown is ignored, and the Significant flag has no effect - every child termination triggers restart.

    hashtag
    Restart Intensity

    Restarts aren't free. If a child crashes repeatedly, restarting it repeatedly just wastes resources. The Intensity and Period options limit restart frequency:

    The supervisor tracks restart timestamps (in milliseconds). When a child terminates and needs restart, the supervisor checks: have there been more than Intensity restarts in the last Period seconds? If yes, the restart intensity is exceeded. The supervisor stops all children and terminates itself with act.ErrSupervisorRestartsExceeded.

    Old restarts outside the period window are discarded from tracking. This is a sliding window: if your child crashes 5 times in 10 seconds, then runs stable for 11 seconds, then crashes again - the counter resets. It's 1 restart in the window, not 6 total.

    Default values are Intensity: 5 and Period: 5 if you don't specify them.

    hashtag
    Significant Children

    In All For One and Rest For One supervisors, the Significant flag marks children whose termination can trigger supervisor shutdown:

    With SupervisorStrategyTransient:

    • Significant child terminates normally → supervisor stops all children and terminates

    • Significant child crashes → restart strategy applies

    • Non-significant child → restart strategy applies regardless of termination reason

    With SupervisorStrategyTemporary:

    • Significant child terminates (any reason) → supervisor stops all children and terminates

    • Non-significant child → no restart, child stays terminated

    With SupervisorStrategyPermanent:

    • Significant flag is ignored

    • All terminations trigger restart

    For One For One and Simple One For One, Significant is always ignored.

    Use significant children when a specific child's clean termination means "mission accomplished, shut down the subtree." Example: a batch processor that finishes its work and terminates normally should stop the entire supervision tree, not get restarted.

    hashtag
    Auto Shutdown

    By default, if all children terminate normally (not crashes) and none are significant, the supervisor stops itself with gen.TerminateReasonNormal. This is auto shutdown.

    Enable DisableAutoShutdown to keep the supervisor running even with zero children:

    Auto shutdown is ignored for Simple One For One supervisors (they're designed for dynamic children) and ignored when using Permanent strategy.

    Use auto shutdown when your supervisor's purpose is managing those specific children. When they're all gone, the supervisor has no purpose. Disable it when the supervisor manages dynamically added children or should stay alive to accept management commands.

    hashtag
    Keep Order

    For All For One and Rest For One, the KeepOrder flag controls how children are stopped:

    With KeepOrder: true:

    • Children stop one at a time, last to first

    • Supervisor waits for each child to fully terminate before stopping the next

    • Slow but orderly - useful when children have shutdown dependencies

    With KeepOrder: false (default):

    • All affected children receive SendExit simultaneously

    • They terminate in parallel

    • Fast but unordered - use when children can shut down independently

    After stopping (either way), children restart sequentially in declaration order. KeepOrder only affects stopping, not starting.

    For One For One and Simple One For One, KeepOrder is ignored (only one child is affected).

    hashtag
    Dynamic Management

    Supervisors provide methods for runtime adjustments:

    Critical: These methods fail with act.ErrSupervisorStrategyActive if called while the supervisor is executing a restart strategy. The supervisor is in supStateStrategy mode - it's stopping children, waiting for exit signals, or starting replacements. You must wait for it to return to supStateNormal before making management calls.

    When the supervisor is applying a strategy, it processes only the Urgent queue (where exit signals arrive) and ignores System and Main queues. This ensures exit signals are handled promptly without interference from management commands or regular messages.

    For Simple One For One supervisors, StartChild with args stores those args for that specific child instance. When that instance restarts (due to crash, kill, etc.), it uses the stored args, not the template args from the spec. For other supervisor types (One For One, All For One, Rest For One), StartChild with args updates the spec's args for future restarts.

    hashtag
    Child Callbacks

    Enable EnableHandleChild: true to receive notifications when children start or stop:

    These callbacks run after the restart strategy completes. For example:

    1. Child crashes

    2. Supervisor applies restart strategy (stops affected children if needed)

    3. Supervisor starts replacement children

    4. Then HandleChildTerminate is called for the terminated child

    5. Then HandleChildStart is called for the replacement

    The callbacks are invoked as regular messages sent by the supervisor to itself. They arrive in the Main queue, so they're processed after the restart logic (which happens in the exit signal handler).

    If HandleChildStart or HandleChildTerminate returns an error, the supervisor terminates with that error. Use these callbacks for integration with external systems, not for restart decisions - restart logic is handled by the supervisor type and strategy.

    hashtag
    Supervisor as a Regular Actor

    Supervisors are actors. They have mailboxes, handle messages, and can communicate with other processes:

    This lets you build management APIs: query supervisor state, scale children dynamically, reconfigure at runtime. The supervisor processes these messages between handling exit signals.

    hashtag
    Observer Integration

    Supervisors provide runtime inspection via the HandleInspect method, which is automatically integrated with the Observer monitoring tool. When you call gen.Process.Inspect() on a supervisor, it returns detailed metrics about its current state:

    One For One / All For One / Rest For One:

    • type: Supervisor type ("One For One", "All For One", "Rest For One")

    • strategy: Restart strategy (Transient, Temporary, Permanent)

    • intensity: Maximum restart count within period

    • period: Time window in seconds for restart intensity

    • keep_order: Whether children stop sequentially (All/Rest For One only)

    • auto_shutdown: Whether supervisor stops when all children terminate

    • restarts_count: Number of restart timestamps currently tracked

    • children_total: Total child specs defined

    • children_running: Currently running children

    • children_disabled: Disabled children that won't restart

    Simple One For One:

    • type: "Simple One For One"

    • strategy: Restart strategy

    • intensity: Maximum restart count within period

    • period: Time window in seconds

    • restarts_count: Number of restart timestamps tracked

    • specs_total: Total child spec templates

    • specs_disabled: Disabled specs

    • instances_total: Total running instances across all specs

    • child:<name>: Number of running instances for specific child spec

    • child:<name>:args: Number of instances with custom args for specific child spec

    The Observer UI displays this information in real-time, letting you monitor supervision trees, track restart patterns, and identify failing components. You can also query this data programmatically:

    Both methods only work for local supervisors (same node). This integration makes it easy to diagnose issues in production: check restart counts to identify unstable processes, verify child counts match expected scaling, monitor which instances have custom configurations.

    hashtag
    Restart Intensity Behavior

    Understanding restart intensity is critical for reliable systems. Here's exactly how it works:

    The supervisor maintains a list of restart timestamps in milliseconds. When a child terminates and restart is needed:

    1. Append current timestamp to the list

    2. Remove timestamps older than Period seconds

    3. If list length > Intensity, intensity is exceeded

    4. If exceeded: stop all children, terminate supervisor with act.ErrSupervisorRestartsExceeded

    5. If not exceeded: proceed with restart

    Example with Intensity: 3, Period: 5:

    But if the child runs stable between crashes:

    The sliding window means intermittent failures don't accumulate. Only rapid repeated failures exceed intensity.

    hashtag
    Shutdown Behavior

    When a supervisor terminates (receives exit signal, calls terminate from HandleMessage, or crashes), it stops all children first:

    1. Send gen.TerminateReasonShutdown via SendExit to all running children

    2. Wait for all children to terminate

    3. Call Terminate callback

    4. Remove supervisor from node

    With KeepOrder: true (All For One / Rest For One), children stop sequentially. With KeepOrder: false, they stop in parallel. Either way, the supervisor waits for all to finish before terminating itself.

    If a non-child process sends the supervisor an exit signal (via Link or SendExit), the supervisor initiates shutdown. This is how parent supervisors stop child supervisors - send an exit signal, and the entire subtree shuts down cleanly.

    hashtag
    Dynamic Children (Simple One For One)

    Simple One For One supervisors start with empty children and spawn them on demand:

    Start instances with StartChild:

    Each call spawns a new worker. The args passed to StartChild are stored for that specific instance. When the restart strategy triggers (child crashes, exceeds intensity, etc.), the child restarts with the same args it was originally started with, not the template args from the spec. This ensures each worker instance maintains its configuration across restarts.

    Workers are not registered by name (no SpawnRegister). You track them by PID from the return value or via supervisor.Children().

    Disabling a child spec stops all running instances with that spec name:

    Simple One For One ignores DisableAutoShutdown - the supervisor never auto-shuts down, even with zero children. It's designed for dynamic workloads where zero children is a valid state.

    hashtag
    Patterns and Pitfalls

    Set restart intensity carefully. Too low and transient failures kill your supervisor. Too high and crash loops consume resources. Start with defaults (Intensity: 5, Period: 5) and tune based on observed behavior.

    Use Significant sparingly. Marking a child significant couples its lifecycle to the entire supervision tree. This is powerful but reduces isolation. Prefer non-significant children and handle critical failures at a higher supervision level.

    Don't call management methods during restart. StartChild, AddChild, EnableChild, DisableChild fail with ErrSupervisorStrategyActive if the supervisor is mid-restart. Wait for the restart to complete (check via Inspect or wait for HandleChildStart callback).

    Disable auto shutdown for dynamic supervisors. If your supervisor uses AddChild to add children at runtime, enable DisableAutoShutdown. Otherwise, it terminates when it starts with zero children or when all dynamically added children eventually stop.

    Use HandleChildStart for integration, not validation. By the time HandleChildStart is called, the child is already spawned and linked. Returning an error terminates the supervisor, but doesn't prevent the child from running. Use child's Init for validation instead.

    KeepOrder is only for stopping. Children always start sequentially in declaration order. KeepOrder controls only the stopping phase of All For One and Rest For One restarts.

    Simple One For One args are persistent per instance. Args passed to StartChild are stored and used for that specific instance across all restarts. If you start a worker with StartChild("worker", "config-A") and it crashes, the restarted instance receives "config-A" again, not the template args from the child spec. This persistence ensures each worker maintains its identity and configuration through failures. If you need different args for a restart, you must manually stop the old instance and start a new one with different args.

    type Worker struct {
        act.Actor
        counter int
    }
    
    func (w *Worker) Init(args ...any) error {
        w.counter = 0
        w.Log().Info("worker %s starting", w.PID())
        return nil
    }
    
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case IncrementRequest:
            w.counter += msg.Amount
            w.Send(from, IncrementResponse{Counter: w.counter})
        }
        return nil
    }
    
    func (w *Worker) Terminate(reason error) {
        w.Log().Info("worker stopped: %s", reason)
    }
    
    // Factory function for spawning
    func createWorker() gen.ProcessBehavior {
        return &Worker{}
    }
    pid, err := node.Spawn(createWorker, gen.ProcessOptions{})
    type ActorBehavior interface {
        gen.ProcessBehavior
        
        // Core lifecycle
        Init(args ...any) error
        HandleMessage(from gen.PID, message any) error
        HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
        Terminate(reason error)
        
        // Split handle callbacks (opt-in via SetSplitHandle)
        HandleMessageName(name gen.Atom, from gen.PID, message any) error
        HandleMessageAlias(alias gen.Alias, from gen.PID, message any) error
        HandleCallName(name gen.Atom, from gen.PID, ref gen.Ref, request any) (any, error)
        HandleCallAlias(alias gen.Alias, from gen.PID, ref gen.Ref, request any) (any, error)
        
        // Specialized callbacks
        HandleLog(message gen.MessageLog) error
        HandleEvent(message gen.MessageEvent) error
        HandleInspect(from gen.PID, item ...string) map[string]string
    }
    pid, err := node.Spawn(createWorker, gen.ProcessOptions{}, "config", 42)
    
    // In your actor:
    func (w *Worker) Init(args ...any) error {
        if len(args) > 0 {
            w.config = args[0].(string)
        }
        if len(args) > 1 {
            w.maxCount = args[1].(int)
        }
        return nil
    }
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case WorkRequest:
            result := w.process(msg)
            w.Send(from, WorkResponse{Result: result})
        
        case StatusQuery:
            w.Send(from, StatusResponse{Status: w.status})
        
        case StopCommand:
            return gen.TerminateReasonNormal  // Terminate gracefully
        }
        
        return nil  // Continue running
    }
    func (w *Worker) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case GetCounterRequest:
            return CounterResponse{Counter: w.counter}, nil
        
        case ResetCounterRequest:
            old := w.counter
            w.counter = 0
            return ResetResponse{OldValue: old}, nil
        
        default:
            w.Log().Warning("unknown request type: %T from %s", request, from)
            return nil, nil  // Don't respond to unknown requests
        }
    }
    func (w *Worker) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case DivideRequest:
            if req.Divisor == 0 {
                return fmt.Errorf("division by zero"), nil
            }
            return req.Dividend / req.Divisor, nil
        }
        
        w.Log().Warning("unknown request type: %T from %s", request, from)
        return nil, nil
    }
    
    // Caller side:
    result, err := process.Call(workerPID, DivideRequest{10, 0})
    if err != nil {
        // Framework error (timeout, process unknown, etc.)
        log.Printf("call failed: %s", err)
        return
    }
    
    if e, ok := result.(error); ok {
        // Application error returned by HandleCall
        log.Printf("operation failed: %s", e)
        return
    }
    
    // Success - use result
    log.Printf("result: %v", result)
    func (w *Worker) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case ExpensiveQuery:
            // Send to worker pool
            w.Send(w.workerPool, PoolRequest{
                Query:  req,
                Caller: from,
                Ref:    ref,
            })
            // Return nil, nil to handle asynchronously
            return nil, nil
        }
        return nil, nil
    }
    
    // Later, when the worker pool replies:
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case PoolResponse:
            // Send response to original caller
            w.SendResponse(msg.Caller, msg.Ref, msg.Result)
        }
        return nil
    }
    if !ref.IsAlive() {
        w.Log().Warning("caller timed out, discarding work")
        return nil
    }
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch message.(type) {
        case ShutdownCommand:
            return gen.TerminateReasonNormal  // Clean shutdown
        
        case PanicCommand:
            return fmt.Errorf("intentional failure")  // Error shutdown
        }
        return nil
    }
    func (w *Worker) Terminate(reason error) {
        w.Log().Info("worker %s stopping: %s", w.PID(), reason)
        // Clean up resources
        w.closeConnections()
        w.sendFinalStats()
    }
    func (w *Worker) Init(args ...any) error {
        w.SetTrapExit(true)
        return nil
    }
    
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case gen.MessageExitPID:
            w.Log().Info("linked process %s terminated: %s", msg.PID, msg.Reason)
            // Decide how to handle it
            if msg.Reason == gen.TerminateReasonPanic {
                // Linked worker panicked, maybe restart it
                w.restartWorker(msg.PID)
            }
            // Don't terminate - we're trapping
            return nil
        
        case gen.MessageExitNode:
            w.Log().Warning("node %s disconnected", msg.Name)
            // Handle network partition
            return nil
        }
        return nil
    }
    func (w *Worker) Init(args ...any) error {
        w.SetSplitHandle(true)
        w.RegisterName("worker_service")
        alias, _ := w.CreateAlias()
        w.publicAPI = alias
        return nil
    }
    
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        // Messages sent to PID directly (internal use)
        w.Log().Debug("internal message from %s", from)
        return nil
    }
    
    func (w *Worker) HandleMessageName(name gen.Atom, from gen.PID, message any) error {
        // Messages sent to registered name "worker_service" (public API)
        w.Log().Info("public API call via name %s", name)
        return nil
    }
    
    func (w *Worker) HandleMessageAlias(alias gen.Alias, from gen.PID, message any) error {
        // Messages sent to alias (temporary session)
        w.Log().Debug("session message via alias %s", alias)
        return nil
    }
    func (w *Worker) HandleLog(message gen.MessageLog) error {
        // Format and write log message
        fmt.Printf("[%s] %s: %s\n", message.Level, message.PID, message.Message)
        return nil
    }
    func (w *Worker) HandleEvent(message gen.MessageEvent) error {
        switch message.Name {
        case "config_updated":
            w.reloadConfig()
        case "cache_invalidated":
            w.clearCache()
        }
        return nil
    }
    func (w *Worker) HandleInspect(from gen.PID, item ...string) map[string]string {
        return map[string]string{
            "counter":     fmt.Sprintf("%d", w.counter),
            "status":      w.status,
            "queue_depth": fmt.Sprintf("%d", w.queueDepth),
        }
    }
    // From within another process
    info, err := process.Inspect(workerPID)
    
    // Directly from the node
    info, err := node.Inspect(workerPID)
    type SupervisorBehavior interface {
        gen.ProcessBehavior
    
        // Init invoked on supervisor spawn - MANDATORY
        Init(args ...any) (SupervisorSpec, error)
    
        // HandleChildStart invoked when a child starts (if EnableHandleChild is true)
        HandleChildStart(name gen.Atom, pid gen.PID) error
    
        // HandleChildTerminate invoked when a child terminates (if EnableHandleChild is true)
        HandleChildTerminate(name gen.Atom, pid gen.PID, reason error) error
    
        // HandleMessage invoked for regular messages
        HandleMessage(from gen.PID, message any) error
    
        // HandleCall invoked for synchronous requests
        HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
    
        // HandleEvent invoked for subscribed events
        HandleEvent(message gen.MessageEvent) error
    
        // HandleInspect invoked for inspection requests
        HandleInspect(from gen.PID, item ...string) map[string]string
    
        // Terminate invoked on supervisor termination
        Terminate(reason error)
    }
    type AppSupervisor struct {
        act.Supervisor
    }
    
    func (s *AppSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
        return act.SupervisorSpec{
            Type: act.SupervisorTypeOneForOne,
            Children: []act.SupervisorChildSpec{
                {
                    Name:    "database",
                    Factory: createDBWorker,
                    Args:    []any{"postgres://..."},
                },
                {
                    Name:    "api",
                    Factory: createAPIServer,
                    Args:    []any{8080},
                },
            },
            Restart: act.SupervisorRestart{
                Strategy:  act.SupervisorStrategyTransient,
                Intensity: 5,
                Period:    5,
            },
        }, nil
    }
    
    func createSupervisorFactory() gen.ProcessBehavior {
        return &AppSupervisor{}
    }
    
    // Spawn the supervisor
    pid, err := node.Spawn(createSupervisorFactory, gen.ProcessOptions{})
    Type: act.SupervisorTypeOneForOne,
    Children: []act.SupervisorChildSpec{
        {Name: "worker1", Factory: createWorker},
        {Name: "worker2", Factory: createWorker},
        {Name: "worker3", Factory: createWorker},
    },
    Type: act.SupervisorTypeAllForOne,
    Children: []act.SupervisorChildSpec{
        {Name: "cache", Factory: createCache},
        {Name: "processor", Factory: createProcessor},  // Depends on cache
        {Name: "api", Factory: createAPI},              // Depends on both
    },
    Type: act.SupervisorTypeRestForOne,
    Children: []act.SupervisorChildSpec{
        {Name: "database", Factory: createDB},       // Independent
        {Name: "cache", Factory: createCache},       // Depends on database
        {Name: "api", Factory: createAPI},           // Depends on cache
    },
    Type: act.SupervisorTypeSimpleOneForOne,
    Children: []act.SupervisorChildSpec{
        {
            Name:    "worker",
            Factory: createWorker,
            Args:    []any{"default-config"},
        },
    },
    // Start 5 worker instances
    for i := 0; i < 5; i++ {
        supervisor.StartChild("worker", fmt.Sprintf("worker-%d", i))
    }
    Restart: act.SupervisorRestart{
        Strategy: act.SupervisorStrategyTransient,  // Default
    }
    Restart: act.SupervisorRestart{
        Strategy: act.SupervisorStrategyTemporary,
    }
    Restart: act.SupervisorRestart{
        Strategy: act.SupervisorStrategyPermanent,
    }
    Restart: act.SupervisorRestart{
        Strategy:  act.SupervisorStrategyTransient,
        Intensity: 5,   // Maximum 5 restarts
        Period:    10,  // Within 10 seconds
    }
    Children: []act.SupervisorChildSpec{
        {
            Name:        "critical_service",
            Factory:     createCriticalService,
            Significant: true,  // If this stops cleanly, supervisor stops
        },
        {
            Name:    "helper",
            Factory: createHelper,
            // Significant: false (default)
        },
    },
    DisableAutoShutdown: false,  // Default - supervisor stops when children stop
    DisableAutoShutdown: true,  // Supervisor stays alive with zero children
    Restart: act.SupervisorRestart{
        KeepOrder: true,  // Stop sequentially in reverse order
    }
    // Start a child from the spec (if not already running)
    err := supervisor.StartChild("worker")
    
    // Start with different args (overrides spec)
    err := supervisor.StartChild("worker", "new-config")
    
    // Add a new child spec and start it
    err := supervisor.AddChild(act.SupervisorChildSpec{
        Name:    "new_worker",
        Factory: createWorker,
    })
    
    // Disable a child (stops it, won't restart on crash)
    err := supervisor.DisableChild("worker")
    
    // Re-enable a disabled child (starts it again)
    err := supervisor.EnableChild("worker")
    
    // Get list of children
    children := supervisor.Children()
    for _, child := range children {
        fmt.Printf("Spec: %s, PID: %s, Disabled: %v\n", 
            child.Spec, child.PID, child.Disabled)
    }
    func (s *AppSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
        return act.SupervisorSpec{
            EnableHandleChild: true,
            // ... rest of spec
        }, nil
    }
    
    func (s *AppSupervisor) HandleChildStart(name gen.Atom, pid gen.PID) error {
        s.Log().Info("child %s started with PID %s", name, pid)
        // Maybe register in service discovery, send init message
        return nil
    }
    
    func (s *AppSupervisor) HandleChildTerminate(name gen.Atom, pid gen.PID, reason error) error {
        s.Log().Info("child %s (PID %s) terminated: %s", name, pid, reason)
        // Maybe deregister from service discovery, clean up resources
        return nil
    }
    func (s *AppSupervisor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ScaleCommand:
            if msg.Up {
                s.AddWorkers(msg.Count)
            } else {
                s.RemoveWorkers(msg.Count)
            }
        
        case HealthCheckRequest:
            children := s.Children()
            s.Send(from, HealthResponse{
                Running: len(children),
                Healthy: s.countHealthy(children),
            })
        }
        return nil
    }
    
    func (s *AppSupervisor) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch request.(type) {
        case GetChildrenRequest:
            return s.Children(), nil
        }
        return nil, nil
    }
    // From within a process context
    info, err := process.Inspect(supervisorPID)
    
    // Directly from the node
    info, err := node.Inspect(supervisorPID)
    
    // Returns map[string]string with metrics above
    Time 0s:  Child crashes → restart (count: 1)
    Time 1s:  Child crashes → restart (count: 2)
    Time 2s:  Child crashes → restart (count: 3)
    Time 3s:  Child crashes → EXCEEDED (count: 4 within 5s window)
              → Stop all children, supervisor terminates
    Time 0s:  Child crashes → restart (count: 1)
    Time 6s:  Child crashes → restart (count: 1, previous outside window)
    Time 12s: Child crashes → restart (count: 1, previous outside window)
    Type: act.SupervisorTypeSimpleOneForOne,
    Children: []act.SupervisorChildSpec{
        {
            Name:    "worker",  // Template name
            Factory: createWorker,
            Args:    []any{"default-config"},
        },
    },
    // Start 10 workers with different args
    for i := 0; i < 10; i++ {
        supervisor.StartChild("worker", fmt.Sprintf("worker-%d", i))
    }
    // Stops all "worker" instances
    supervisor.DisableChild("worker")
    Transmits the frame over one of the TCP connections in the pool to the remote node
  • Receives acknowledgment if Important delivery is enabled

  • Routes the message to the recipient's mailbox
  • Sends acknowledgment back if Important delivery was requested

  • If delivery fails (no such process, mailbox full), the remote node sends an error response back
  • The sender waits for the response (either acknowledgment or error) with a timeout

  • Byte 6: Order byte (derived from sender PID for message ordering)
  • Byte 7: Message type (101 for PID message, 121 for call request, 129 for response, 200 for compressed, etc.)

  • Recipient PID (8 bytes)
  • EDF-encoded message payload

  • Important Delivery
    Network Stack
    Service Discovery
    You want simple deployment (separate HTTP gateway, actor backend)
  • Unified monitoring: HTTP requests visible as actor messages in system introspection

  • Return (http.Server sends response to client)

    All real work happens in ServeHTTP() executed by http.Server goroutines

  • ServeHTTP() sends meta.MessageWebRequest to worker actor using Send()

  • ServeHTTP() blocks on <-ctx.Done()

  • Worker actor receives message in its mailbox, processes it in HandleMessage()

  • Worker writes HTTP response to ResponseWriter, calls Done()

  • Done() cancels context, unblocking ServeHTTP()

  • ServeHTTP() returns, connection goroutine completes

  • Connection has state (subscriptions, session data) managed by the actor
    Service Discovery
    Handling Sync Requests
    WebWorker
    WebSocket

    TCP

    Network services need to accept TCP connections, read data from sockets, and write responses - all blocking operations that don't fit the one-message-at-a-time actor model. You could spawn goroutines for each connection, but this breaks actor isolation. You need synchronization, careful lifecycle management, and lose the benefits of supervision trees.

    TCP meta-processes solve this by wrapping socket I/O in actors. The framework handles accept loops, connection management, and data buffering. Your actors receive messages when connections arrive or data is read. To send data, you send a message to the connection's meta-process. The actor model stays intact while integrating with blocking TCP operations.

    Ergo provides two TCP meta-processes: TCPServer for accepting connections, and TCPConnection for handling established connections (both incoming and outgoing).

    hashtag

    UDP

    UDP is fundamentally different from TCP. There are no connections, no ordering guarantees, no reliability. Datagrams arrive independently, potentially out of order, possibly duplicated, or lost entirely. This makes UDP simpler than TCP, but also requires different handling patterns.

    Traditional UDP servers use blocking ReadFrom calls in loops. This doesn't fit the actor model's one-message-at-a-time processing. You could spawn goroutines to read packets, but this breaks actor isolation and requires manual synchronization.

    UDP meta-process wraps the socket in an actor. It runs a read loop in the Start goroutine, sending each received datagram as a message to your actor. To send datagrams, you send messages to the UDP server's meta-process. The actor model stays intact while integrating with blocking UDP operations.

    Unlike TCP, UDP has no connections. One meta-process handles the entire socket - all incoming datagrams from all remote addresses. There's no per-connection state, no connection lifecycle, no connect/disconnect messages. Just datagrams in, datagrams out.

    process.Send(pid, OrderRequest{OrderID: 12345, Items: []string{"item1", "item2"}})
    type Order struct {
        ID    int64
        Items []string
    }
    
    func init() {
        edf.RegisterTypeOf(Order{})  // Analyzed once, functions built
    }
    
    // Later, during message sending:
    process.Send(to, Order{ID: 42, Items: []string{"item1"}})  // Uses pre-built encoder
    package orders
    
    type OrderV1 struct { ID int64 }                    // #github.com/myapp/orders/OrderV1
    type OrderV2 struct { ID int64; Priority int }      // #github.com/myapp/orders/OrderV2
    type Order struct {
        ID    int64
        Items []string
    }
    
    func init() {
        edf.RegisterTypeOf(Order{})
    }
    type Order struct {
        ID    int64   // Exported - part of the contract
        items []Item  // Unexported - internal state, registration fails
    }
    type Order struct {
        ID    int64
        Cache *OrderCache  // Registration fails - pointer is local optimization
    }
    type Address struct {
        City   string
        Street string
    }
    
    type Person struct {
        Name    string
        Address Address
    }
    
    func init() {
        edf.RegisterTypeOf(Address{})  // register child first
        edf.RegisterTypeOf(Person{})   // then parent
    }
    type Config struct {
        public  string
        private int
    }
    
    // Option 1: edf.Marshaler/Unmarshaler (recommended for performance)
    func (c Config) MarshalEDF(w io.Writer) error {
        buf := make([]byte, 0, 256)
        buf = append(buf, c.public...)
        buf = binary.BigEndian.AppendUint64(buf, uint64(c.private))
        _, err := w.Write(buf)
        return err
    }
    
    func (c *Config) UnmarshalEDF(b []byte) error {
        c.public = string(b[:len(b)-8])
        c.private = int(binary.BigEndian.Uint64(b[len(b)-8:]))
        return nil
    }
    
    // Option 2: encoding.BinaryMarshaler/Unmarshaler (standard interface)
    func (c Config) MarshalBinary() ([]byte, error) {
        buf := make([]byte, 0, 256)
        buf = append(buf, c.public...)
        buf = binary.BigEndian.AppendUint64(buf, uint64(c.private))
        return buf, nil
    }
    
    func (c *Config) UnmarshalBinary(b []byte) error {
        c.public = string(b[:len(b)-8])
        c.private = int(binary.BigEndian.Uint64(b[len(b)-8:]))
        return nil
    }
    var (
        ErrInvalidOrder = errors.New("invalid order")
        ErrOutOfStock   = errors.New("out of stock")
    )
    
    func init() {
        edf.RegisterError(ErrInvalidOrder)
        edf.RegisterError(ErrOutOfStock)
    }
    pid, err := node.Spawn(createWorker, gen.ProcessOptions{
        Compression: gen.Compression{
            Enable:    true,
            Type:      gen.CompressionTypeGZIP,
            Level:     gen.CompressionLevelDefault,
            Threshold: 1024,
        },
    })
    process.SetCompression(true)
    process.SetCompressionType(gen.CompressionTypeGZIP)
    process.SetCompressionLevel(gen.CompressionLevelBestSpeed)
    process.SetCompressionThreshold(2048)
    err := process.SendImportant(remotePID, message)
    if err != nil {
        // Definitely failed - remote process doesn't exist,
        // or mailbox is full, or connection dropped
    }
    func main() {
        // Start node
        node, err := ergo.StartNode("gateway@localhost", gen.NodeOptions{})
        if err != nil {
            panic(err)
        }
        defer node.Stop()
    
        // Start HTTP server with node reference
        server := &APIServer{node: node}
        if err := server.Start(); err != nil {
            panic(err)
        }
    }
    
    type APIServer struct {
        node gen.Node
        mux  *http.ServeMux
    }
    
    func (a *APIServer) Start() error {
        a.mux = http.NewServeMux()
        a.mux.HandleFunc("/users/{id}", a.handleGetUser)
        a.mux.HandleFunc("/orders", a.handleCreateOrder)
    
        return http.ListenAndServe(":8080", a.mux)
    }
    
    func (a *APIServer) handleGetUser(w http.ResponseWriter, r *http.Request) {
        userID := r.PathValue("id")
    
        // Call actor anywhere in the cluster
        result, err := a.node.Call(
            gen.ProcessID{Name: "user-service", Node: "backend@node1"},
            GetUserRequest{ID: userID},
        )
    
        if err != nil {
            http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
            return
        }
    
        if errResult, ok := result.(error); ok {
            http.Error(w, errResult.Error(), http.StatusNotFound)
            return
        }
    
        user := result.(User)
        json.NewEncoder(w).Encode(user)
    }
    // Same call works regardless of actor location
    result, err := node.Call(
        gen.ProcessID{Name: "user-service", Node: "backend@node1"},
        request,
    )
    func (a *APIServer) handleGetUser(w http.ResponseWriter, r *http.Request) {
        userID := r.PathValue("id")
    
        // Route requests for the same user to the same node
        // This improves cache locality - user data stays hot
        nodeID := consistentHash(userID, a.clusterSize)
        targetNode := fmt.Sprintf("backend@node%d", nodeID)
    
        result, err := a.node.Call(
            gen.ProcessID{Name: "user-service", Node: targetNode},
            GetUserRequest{ID: userID},
        )
    
        // handle result...
    }
    func (a *APIServer) handleRequest(w http.ResponseWriter, r *http.Request) {
        // Application discovery requires central registrar (etcd or Saturn)
        // See: networking/service-discovering.md
        registrar, err := a.node.Network().Registrar()
        if err != nil {
            http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
            return
        }
    
        resolver := registrar.Resolver()
        routes, err := resolver.ResolveApplication("user-service")
        if err != nil || len(routes) == 0 {
            http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
            return
        }
    
        // Select node based on weight, load, health, proximity
        target := a.selectNode(routes)
    
        result, err := a.node.Call(
            gen.ProcessID{Name: "user-service", Node: target.Node},
            GetUserRequest{ID: r.PathValue("id")},
        )
    
        // handle result...
    }
    
    func (a *APIServer) selectNode(routes []gen.ApplicationRoute) gen.ApplicationRoute {
        // Weighted random selection
        totalWeight := 0
        for _, r := range routes {
            totalWeight += r.Weight
        }
    
        pick := rand.Intn(totalWeight)
        for _, r := range routes {
            pick -= r.Weight
            if pick < 0 {
                return r
            }
        }
    
        return routes[0]
    }
    // Standard middleware works
    func authMiddleware(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if !isAuthorized(r) {
                http.Error(w, "Unauthorized", http.StatusUnauthorized)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
    
    mux.Handle("/api/", authMiddleware(http.HandlerFunc(a.handleAPI)))
    type WebService struct {
        act.Actor
    }
    
    func (w *WebService) Init(args ...any) error {
        // Spawn worker that will handle HTTP requests
        _, err := w.SpawnRegister("web-worker",
            func() gen.ProcessBehavior { return &WebWorker{} },
            gen.ProcessOptions{},
        )
        if err != nil {
            return err
        }
    
        // Create HTTP multiplexer
        mux := http.NewServeMux()
    
        // Create handler meta-process pointing to worker
        handler := meta.CreateWebHandler(meta.WebHandlerOptions{
            Worker:         "web-worker",
            RequestTimeout: 5 * time.Second,
        })
    
        // Spawn handler meta-process
        handlerID, err := w.SpawnMeta(handler, gen.MetaOptions{})
        if err != nil {
            return err
        }
    
        // Register handler with mux (handler implements http.Handler)
        // Standard middleware works - handler is just http.Handler
        mux.Handle("/", authMiddleware(rateLimitMiddleware(handler)))
    
        // Create web server meta-process
        server, err := meta.CreateWebServer(meta.WebServerOptions{
            Host:    "localhost",
            Port:    8080,
            Handler: mux,
        })
        if err != nil {
            return err
        }
    
        // Spawn server meta-process
        serverID, err := w.SpawnMeta(server, gen.MetaOptions{})
        if err != nil {
            server.Terminate(err)
            return err
        }
    
        w.Log().Info("HTTP server listening on :8080 (server=%s, handler=%s)",
            serverID, handlerID)
        return nil
    }
    type WebWorker struct {
        act.Actor
    }
    
    func (w *WebWorker) HandleMessage(from gen.PID, message any) error {
        request, ok := message.(meta.MessageWebRequest)
        if !ok {
            return nil
        }
    
        defer request.Done()  // Always call Done to unblock ServeHTTP
    
        // Process HTTP request
        switch request.Request.Method {
        case "GET":
            user := w.getUserFromDB(request.Request.URL.Query().Get("id"))
            json.NewEncoder(request.Response).Encode(user)
    
        case "POST":
            var order Order
            json.NewDecoder(request.Request.Body).Decode(&order)
            w.createOrder(order)
            request.Response.WriteHeader(http.StatusCreated)
    
        default:
            http.Error(request.Response, "Method not supported", http.StatusMethodNotAllowed)
        }
    
        return nil
    }
    type WebWorkerPool struct {
        act.Pool
    }
    
    func (p *WebWorkerPool) Init(args ...any) (act.PoolOptions, error) {
        return act.PoolOptions{
            PoolSize:          20,   // 20 concurrent workers
            WorkerMailboxSize: 10,   // Each worker queues up to 10 requests
            WorkerFactory:     func() gen.ProcessBehavior { return &WebWorker{} },
        }, nil
    }
    
    func (w *WebService) Init(args ...any) error {
        // Spawn pool with registered name
        _, err := w.SpawnRegister("web-worker",
            func() gen.ProcessBehavior { return &WebWorkerPool{} },
            gen.ProcessOptions{},
        )
        if err != nil {
            return err
        }
    
        handler := meta.CreateWebHandler(meta.WebHandlerOptions{
            Worker: "web-worker",
        })
    
        _, err = w.SpawnMeta(handler, gen.MetaOptions{})
        // rest of setup...
    }
    // Chat room broadcasts to all connected clients
    for _, connAlias := range room.connections {
        room.Send(connAlias, ChatMessage{From: sender, Text: text})
    }
    
    // Game server on node1 pushes update to player connection on node2
    gameServer.Send(playerConnAlias, StateUpdate{HP: hp, Position: pos})
    
    // Backend actor pushes notification to user's browser
    backend.Send(userConnAlias, Notification{Text: "Task completed"})
    spinner
    spinner
    spinner
    spinner
    spinner
    TCP Server: Accepting Connections

    Create a TCP server with meta.CreateTCPServer:

    The server opens a TCP socket and enters an accept loop. When a connection arrives, the server spawns a new TCPConnection meta-process to handle it. Each connection runs in its own meta-process, isolated from other connections.

    If SpawnMeta fails, you must call server.Terminate(err) to close the listening socket. Without this, the port remains bound and unusable until the process exits.

    The server runs forever, accepting connections and spawning handlers. When the parent actor terminates, the server terminates too (cascading termination), closing the listening socket and stopping all connection handlers.

    hashtag
    TCP Connection: Handling I/O

    When the server accepts a connection, it automatically spawns a TCPConnection meta-process. This meta-process reads data from the socket and sends it to your actor. To write data, you send messages to the connection's meta-process.

    MessageTCPConnect arrives when the connection is established. It contains the connection's meta-process ID (m.ID), remote address, and local address. Save the ID if you need to track connections or send data later.

    MessageTCP arrives when data is read from the socket. m.Data contains the bytes read (up to ReadBufferSize at a time). To send data, send a MessageTCP back to the connection's ID. The meta-process writes it to the socket.

    MessageTCPDisconnect arrives when the connection closes (client disconnected, network error, or you terminated the connection). After this, the connection meta-process is dead - sending to its ID returns an error.

    If the connection meta-process cannot send messages to your actor (actor crashed, mailbox full), it terminates the connection and stops. This ensures failed actors don't leak connections.

    hashtag
    Routing to Workers

    By default, all connections send messages to the parent actor - the one that spawned the server. For a server handling many connections, this creates a bottleneck. All connections compete for the parent's mailbox, and messages are processed sequentially.

    Use ProcessPool to distribute connections across multiple workers:

    The server distributes connections round-robin across the pool. Connection 1 goes to tcp_worker_0, connection 2 goes to tcp_worker_1, and so on. After tcp_worker_9, it wraps back to tcp_worker_0.

    Each worker handles its connections independently. If a worker crashes, its connections terminate (they can't send messages anymore). The supervisor restarts the worker, which begins handling new connections. The distribution is stateless - the server doesn't track which worker handles which connection.

    Do not use act.Pool in ProcessPool. act.Pool forwards messages to any available worker, breaking the connection-to-worker binding. If connection A sends message 1 to worker X and message 2 to worker Y, the protocol state becomes corrupted. Use a list of individual process names instead.

    Workers are typically actors that maintain per-connection state:

    hashtag
    Client Connections

    To initiate outgoing TCP connections, use meta.CreateTCPConnection:

    CreateTCPConnection connects to the remote host immediately. If the connection fails (host unreachable, connection refused), it returns an error. If successful, it returns a meta-process behavior ready to spawn.

    The spawned meta-process sends MessageTCPConnect when ready, then streams received data as MessageTCP messages. To send data, send MessageTCP to the connection's ID.

    Client connections use the same TCPConnection meta-process as server-side connections. The only difference is how they're created: CreateTCPConnection initiates a connection, while the server spawns connections automatically on accept.

    hashtag
    Chunking: Message Framing

    Raw TCP is a byte stream, not a message stream. If you send two 100-byte messages, they might arrive as one 200-byte read, or three reads (150 bytes, 40 bytes, 10 bytes). You must frame messages to detect boundaries.

    Enable chunking for automatic framing:

    Fixed-length messages:

    Every MessageTCP contains exactly 256 bytes. The meta-process buffers reads until 256 bytes accumulate, then sends them. If a socket read returns 512 bytes, you receive two MessageTCP messages.

    Header-based messages:

    The meta-process reads the 4-byte header, extracts the length as a big-endian integer, waits for the full payload, then sends the complete message (header + payload) as one MessageTCP.

    Protocol example:

    You receive:

    • First MessageTCP: 14 bytes (4 + 10)

    • Second MessageTCP: 260 bytes (4 + 256)

    If both messages arrive in one socket read (274 bytes total), the meta-process splits them automatically. If the header arrives first and the payload arrives later (slow connection), the meta-process waits for the complete message.

    MaxLength protects against malformed or malicious messages. If the header claims a message is 4GB, the meta-process terminates with gen.ErrTooLarge instead of allocating 4GB of memory.

    HeaderLengthSize can be 1, 2, or 4 bytes (big-endian). HeaderLengthPosition specifies the offset within the header. Example for a protocol with type + flags + length:

    Without chunking, you receive raw bytes as the meta-process reads them. You must buffer and frame messages yourself - typically by accumulating data in your actor's state and detecting message boundaries manually.

    hashtag
    Buffer Management

    The meta-process allocates buffers for reading socket data. By default, each read allocates a new buffer, which becomes garbage after you process it. For high-throughput servers, this causes GC pressure.

    Use a buffer pool:

    The meta-process gets buffers from the pool when reading. When you receive MessageTCP, the Data field is a buffer from the pool. Return it to the pool after processing:

    When you send MessageTCP to write data, the meta-process automatically returns the buffer to the pool after writing (if a pool is configured). Don't use the buffer after sending.

    If you need to store data beyond the current message, copy it:

    Buffer pools are essential for servers handling thousands of connections or high throughput. For low-volume clients, the GC overhead is negligible - skip the pool for simplicity.

    hashtag
    Write Keepalive

    Some protocols require periodic writes to keep connections alive. If no data is sent for a timeout period, the peer disconnects. You could send keepalive messages with timers, but this is tedious and error-prone.

    Enable automatic keepalive:

    The meta-process wraps the socket with a keepalive writer. If nothing is written for 30 seconds, it automatically sends a null byte. The peer receives it as normal data. Design your protocol to ignore keepalive messages.

    Keepalive bytes can be anything: a ping message, a heartbeat packet, or a protocol-specific keepalive. The peer sees them as regular socket data.

    This is application-level keepalive (layer 7), not TCP keepalive (layer 4). Both can be used simultaneously.

    hashtag
    TCP Keepalive (OS-Level)

    TCP has built-in keepalive at the protocol level. Enable it with KeepAlivePeriod:

    The OS sends TCP keepalive probes every 60 seconds when the connection is idle. If the peer doesn't respond, the connection is closed. This detects dead connections (network partition, crashed peer) without application involvement.

    Set KeepAlivePeriod to 0 to disable TCP keepalive (default). Set it to -1 for OS default behavior (typically 2 hours on Linux, varies by platform).

    TCP keepalive (OS-level) and write buffer keepalive (application-level) serve different purposes:

    • TCP keepalive: Detects dead connections

    • Write keepalive: Satisfies application protocols that require periodic data

    Most servers need TCP keepalive to clean up dead connections. Some protocols also need write keepalive to satisfy their requirements.

    hashtag
    TLS Encryption

    Enable TLS with a certificate manager:

    The server wraps accepted connections with TLS. The certificate manager provides certificates dynamically (for SNI, certificate rotation, etc.). See CertManager for details.

    For client connections:

    The client establishes a TLS connection during CreateTCPConnection. By default, the client verifies the server's certificate. To skip verification (testing only):

    Never use InsecureSkipVerify in production. It disables certificate validation, making you vulnerable to man-in-the-middle attacks.

    With TLS enabled, data is encrypted automatically. Your actor sends and receives plaintext MessageTCP - the meta-process handles encryption/decryption transparently.

    hashtag
    Process Routing

    For both server and client connections, you can route messages to a specific process:

    If Process is not set (client) or ProcessPool is empty (server), messages go to the parent actor.

    For servers, ProcessPool enables load distribution. For clients, Process enables separation of concerns - the actor that initiates connections doesn't need to handle the protocol.

    hashtag
    Inspection

    TCP meta-processes support inspection for debugging:

    Use this for monitoring, debugging, or displaying connection status in management interfaces.

    hashtag
    Patterns and Pitfalls

    Pattern: Connection registry

    Track all active connections. Useful for monitoring, rate limiting, or forced disconnection.

    Pattern: Protocol state machine

    Maintain per-connection protocol state for complex protocols with multiple stages (handshake, authentication, data transfer).

    Pattern: Broadcast to all connections

    Send the same data to all active connections. Useful for chat servers, pub/sub systems, or monitoring dashboards.

    Pitfall: Not handling MessageTCPDisconnect

    After disconnect, the connection state remains in memory forever. Always clean up on disconnect.

    Pitfall: act.Pool in ProcessPool

    If worker_pool is an act.Pool, messages from one connection are distributed across multiple workers. Connection A's messages might go to worker 1, then worker 2, then worker 1 again. Protocol state is split across workers, causing corruption.

    Use individual process names, not pools.

    Pitfall: Blocking in message handler

    If the worker handles multiple connections, one slow operation blocks all of them. The worker can't process messages from other connections while blocked.

    Solution: Spawn a goroutine for slow operations, or use a worker pool (one worker per connection).

    Pitfall: Forgetting to return buffers

    Pool buffers are reused. Storing them directly leads to data corruption. Always copy, then return.

    TCP meta-processes handle the complexity of socket I/O, connection management, and buffering - letting you focus on protocol implementation while maintaining the actor model's isolation and supervision benefits.

    hashtag
    Creating a UDP Server

    Create a UDP server with meta.CreateUDPServer:

    The server opens a UDP socket and enters a read loop. For each received datagram, it sends MessageUDP to your actor. Your actor processes it and optionally sends a response by sending MessageUDP back to the server's meta-process ID.

    If SpawnMeta fails, call server.Terminate(err) to close the socket. Without this, the port remains bound until the process exits.

    The server runs forever, reading datagrams and forwarding them as messages. When the parent actor terminates, the server terminates too (cascading termination), closing the socket.

    hashtag
    Handling Datagrams

    The UDP server sends MessageUDP for each received datagram:

    MessageUDP contains:

    • ID: The UDP server's meta-process ID (same for all datagrams)

    • Addr: Remote address that sent this datagram (net.Addr - typically *net.UDPAddr)

    • Data: The datagram payload (up to BufferSize bytes)

    To send a datagram, send MessageUDP to the server's ID with the destination address and payload. The server writes it to the socket with WriteTo. The ID field is ignored when sending (it's only used for incoming datagrams).

    Unlike TCP:

    • No connect/disconnect messages - datagrams are independent

    • Addr changes for each datagram - track remote addresses yourself if needed

    • No message framing - each UDP datagram is a complete message

    • No ordering guarantees - process datagrams as they arrive

    hashtag
    Connectionless Nature

    UDP has no connections. Each datagram is independent. The same remote address might send multiple datagrams, but there's no session state. If you need state per remote address, maintain it yourself:

    Because UDP has no connection lifecycle, you need application-level timeout logic to clean up stale state. The server doesn't know when clients "disconnect" - they just stop sending datagrams.

    hashtag
    Routing to Workers

    By default, all datagrams go to the parent actor. For servers handling high datagram rates, this creates a bottleneck. Use Process to route to a different handler:

    All datagrams go to metrics_collector instead of the parent. This enables separation of concerns - the actor that creates the UDP server doesn't need to handle datagrams.

    Unlike TCP's ProcessPool, UDP only has a single Process field. You can route to an act.Pool:

    Each datagram is forwarded to the pool, which distributes them across workers. This works for UDP because datagrams are independent - there's no per-connection state to corrupt. For TCP, ProcessPool uses round-robin to maintain connection-to-worker binding. For UDP, the pool can distribute freely.

    Use pools when datagram processing is CPU-intensive or slow (database writes, external API calls). Workers process datagrams in parallel, maximizing throughput.

    hashtag
    Buffer Management

    The UDP server allocates a buffer for each datagram read. By default, it allocates a new buffer every time, which becomes garbage after you process it. For high datagram rates, this causes GC pressure.

    Use a buffer pool:

    The server gets buffers from the pool when reading. When you receive MessageUDP, the Data field is a buffer from the pool. Return it to the pool after processing:

    When you send MessageUDP to write a datagram, the server automatically returns the buffer to the pool after writing (if a pool is configured). Don't use the buffer after sending.

    If you need to store data beyond the current message, copy it:

    Buffer pools are essential for servers receiving thousands of datagrams per second. For low-volume servers (a few datagrams per second), the GC overhead is negligible - skip the pool for simplicity.

    hashtag
    Buffer Size

    UDP datagrams are limited by the network's Maximum Transmission Unit (MTU). IPv4 networks typically have 1500-byte MTU, IPv6 has 1280-byte minimum. After subtracting IP and UDP headers (28 bytes for IPv4, 48 bytes for IPv6), you get:

    • IPv4 safe maximum: 1472 bytes (1500 - 28)

    • IPv6 safe maximum: 1232 bytes (1280 - 48)

    • Internet-safe maximum: 512 bytes (DNS requirement)

    Datagrams larger than MTU are fragmented at the IP layer. Fragmented datagrams are reassembled by the receiving OS before ReadFrom returns. However, if any fragment is lost, the entire datagram is discarded - UDP reliability degrades.

    The default BufferSize is 65000 bytes (close to UDP's theoretical maximum of 65507 bytes). This handles any UDP datagram, but it's wasteful if your protocol uses smaller messages:

    If a datagram is larger than BufferSize, it's truncated - you receive only the first BufferSize bytes. The rest is discarded. Set BufferSize to the maximum expected datagram size for your protocol.

    Smaller buffers reduce memory usage (important with buffer pools). Larger buffers avoid truncation but waste memory if datagrams are typically small.

    hashtag
    No Chunking

    Unlike TCP, UDP meta-process has no chunking support. UDP datagrams are atomic - each datagram is a complete message. There's no byte stream to split or reassemble. The protocol boundary is the datagram boundary.

    If your protocol sends multi-datagram messages, you must handle reassembly yourself:

    UDP delivers datagrams out of order. Fragment 2 might arrive before fragment 1. Your reassembly logic must handle this. Use sequence numbers, timeouts for incomplete sets, and protection against memory exhaustion (limit maximum incomplete messages).

    Most UDP protocols avoid multi-datagram messages entirely. Keep messages under MTU size for reliability and simplicity.

    hashtag
    Unreliability and Idempotence

    UDP datagrams can be:

    • Lost: Network congestion, router overload, buffer overflow

    • Duplicated: Network retransmission, switch mirroring

    • Reordered: Different paths through the network

    • Corrupted: Rare, but possible despite checksums

    Design your protocol to handle these:

    Loss tolerance: Don't rely on every datagram arriving. Either accept loss (game state updates, sensor readings) or implement application-level acknowledgment and retransmission.

    Duplicate tolerance: Process datagrams idempotently. If the same datagram arrives twice, the result is the same. Use sequence numbers to detect and discard duplicates:

    Reordering tolerance: Don't assume datagrams arrive in send order. Use timestamps or sequence numbers to handle reordering:

    Corruption detection: UDP has a 16-bit checksum, but it's weak. Critical data should have application-level integrity checks (CRC32, hash, signature).

    Most importantly: design your protocol so datagram loss doesn't break functionality. UDP is for scenarios where loss is acceptable (real-time updates) or where you implement your own reliability layer (QUIC, custom protocols).

    hashtag
    Inspection

    UDP server supports inspection for debugging:

    Use this for monitoring datagram counts, bandwidth usage, or displaying server status.

    hashtag
    Patterns and Pitfalls

    Pattern: Metrics aggregation

    Aggregate many datagrams into periodic summaries. Lossy protocols (like StatsD) rely on volume - losing a few datagrams doesn't affect aggregate accuracy.

    Pattern: Request-response with timeout

    Implement application-level reliability with timeouts and retries. UDP doesn't guarantee delivery, so you must detect and handle failures.

    Pattern: Broadcast responder

    Respond to broadcast discovery requests. Track sender address from MessageUDP.Addr and reply directly.

    Pitfall: Not returning buffers

    Pool buffers are reused immediately. Storing them leads to data corruption when the pool reuses the buffer for the next datagram.

    Pitfall: Assuming reliability

    Some chunks will be lost. The server waits forever for missing chunks, or processes incomplete data. Either accept loss (send redundant data) or implement acknowledgment and retransmission.

    Pitfall: Large datagrams

    IP-level fragmentation significantly increases loss probability. If any fragment is lost, the entire datagram is discarded. Keep datagrams under 1472 bytes for reliability, or 512 bytes for internet-wide compatibility.

    Pitfall: Not handling duplicates

    Network equipment can duplicate UDP datagrams (switch mirroring, retransmission logic). Process commands idempotently or track sequence numbers.

    UDP meta-process handles the complexity of socket I/O and datagram delivery while maintaining actor isolation. Design your protocol for UDP's unreliable, unordered, connectionless nature - and leverage its simplicity and low latency where reliability isn't critical.

    type EchoServer struct {
        act.Actor
    }
    
    func (e *EchoServer) Init(args ...any) error {
        options := meta.TCPServerOptions{
            Host: "0.0.0.0",  // Listen on all interfaces
            Port: 8080,
        }
        
        server, err := meta.CreateTCPServer(options)
        if err != nil {
            return fmt.Errorf("failed to create TCP server: %w", err)
        }
        
        // Start the server meta-process
        serverID, err := e.SpawnMeta(server, gen.MetaOptions{})
        if err != nil {
            // Failed to spawn - close the listening socket
            server.Terminate(err)
            return fmt.Errorf("failed to spawn TCP server: %w", err)
        }
        
        e.Log().Info("TCP server listening on %s:%d (id: %s)", 
            options.Host, options.Port, serverID)
        return nil
    }
    func (e *EchoServer) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCPConnect:
            // New connection established
            e.Log().Info("client connected: %s -> %s (id: %s)", 
                m.RemoteAddr, m.LocalAddr, m.ID)
            
            // Send welcome message
            e.Send(m.ID, meta.MessageTCP{
                Data: []byte("Welcome to echo server!\n"),
            })
            
        case meta.MessageTCP:
            // Received data from client
            e.Log().Info("received %d bytes from %s", len(m.Data), m.ID)
            
            // Echo it back
            e.Send(m.ID, meta.MessageTCP{
                Data: m.Data,
            })
            
        case meta.MessageTCPDisconnect:
            // Connection closed
            e.Log().Info("client disconnected: %s", m.ID)
        }
        return nil
    }
    type TCPDispatcher struct {
        act.Actor
    }
    
    func (d *TCPDispatcher) Init(args ...any) error {
        // Start worker pool
        for i := 0; i < 10; i++ {
            workerName := gen.Atom(fmt.Sprintf("tcp_worker_%d", i))
            _, err := d.SpawnRegister(workerName, createWorker, gen.ProcessOptions{})
            if err != nil {
                return err
            }
        }
        
        // Configure server with worker pool
        options := meta.TCPServerOptions{
            Port: 8080,
            ProcessPool: []gen.Atom{
                "tcp_worker_0",
                "tcp_worker_1",
                "tcp_worker_2",
                "tcp_worker_3",
                "tcp_worker_4",
                "tcp_worker_5",
                "tcp_worker_6",
                "tcp_worker_7",
                "tcp_worker_8",
                "tcp_worker_9",
            },
        }
        
        server, err := meta.CreateTCPServer(options)
        if err != nil {
            return err
        }
        
        _, err = d.SpawnMeta(server, gen.MetaOptions{})
        if err != nil {
            server.Terminate(err)
            return err
        }
        
        return nil
    }
    type TCPWorker struct {
        act.Actor
        connections map[gen.Alias]*ConnectionState
    }
    
    type ConnectionState struct {
        remoteAddr net.Addr
        buffer     []byte
        // ... protocol state
    }
    
    func (w *TCPWorker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCPConnect:
            w.connections[m.ID] = &ConnectionState{
                remoteAddr: m.RemoteAddr,
            }
            
        case meta.MessageTCP:
            state := w.connections[m.ID]
            w.processData(m.ID, state, m.Data)
            
        case meta.MessageTCPDisconnect:
            delete(w.connections, m.ID)
        }
        return nil
    }
    type HTTPClient struct {
        act.Actor
        connID gen.Alias
    }
    
    func (c *HTTPClient) Init(args ...any) error {
        options := meta.TCPConnectionOptions{
            Host: "example.com",
            Port: 80,
        }
        
        connection, err := meta.CreateTCPConnection(options)
        if err != nil {
            return fmt.Errorf("failed to connect: %w", err)
        }
        
        connID, err := c.SpawnMeta(connection, gen.MetaOptions{})
        if err != nil {
            connection.Terminate(err)
            return fmt.Errorf("failed to spawn connection: %w", err)
        }
        
        c.connID = connID
        return nil
    }
    
    func (c *HTTPClient) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCPConnect:
            // Connection established, send HTTP request
            request := "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
            c.Send(m.ID, meta.MessageTCP{
                Data: []byte(request),
            })
            
        case meta.MessageTCP:
            // Received HTTP response
            c.Log().Info("response: %s", string(m.Data))
            
        case meta.MessageTCPDisconnect:
            // Server closed connection
            c.Log().Info("connection closed by server")
        }
        return nil
    }
    options := meta.TCPServerOptions{
        Port: 8080,
        ReadChunk: meta.ChunkOptions{
            Enable:      true,
            FixedLength: 256,  // Every message is exactly 256 bytes
        },
    }
    options := meta.TCPServerOptions{
        Port: 8080,
        ReadBufferSize: 8192,
        ReadChunk: meta.ChunkOptions{
            Enable: true,
            
            // Protocol: [4-byte length][payload]
            HeaderSize:                 4,
            HeaderLengthPosition:       0,
            HeaderLengthSize:           4,
            HeaderLengthIncludesHeader: false,  // Length is payload only
            
            MaxLength: 1048576,  // Max 1MB per message
        },
    }
    Message 1: [0x00 0x00 0x00 0x0A] [10 bytes payload]
    Message 2: [0x00 0x00 0x01 0x00] [256 bytes payload]
    // Protocol: [type][flags][length-MSB][length-LSB][payload]
    ReadChunk: meta.ChunkOptions{
        Enable:               true,
        HeaderSize:           4,
        HeaderLengthPosition: 2,  // Length starts at byte 2
        HeaderLengthSize:     2,  // 2-byte length
    }
    bufferPool := &sync.Pool{
        New: func() any {
            return make([]byte, 8192)
        },
    }
    
    options := meta.TCPServerOptions{
        Port:           8080,
        ReadBufferSize: 8192,
        ReadBufferPool: bufferPool,
    }
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCP:
            // Process data
            result := w.processPacket(m.Data)
            
            // Send response
            w.Send(m.ID, meta.MessageTCP{Data: result})
            
            // Return read buffer to pool
            bufferPool.Put(m.Data)
        }
        return nil
    }
    case meta.MessageTCP:
        state := w.connections[m.ID]
        
        // Store in connection state - must copy
        state.buffer = append(state.buffer, m.Data...)
        
        // Return original buffer
        bufferPool.Put(m.Data)
    options := meta.TCPServerOptions{
        Port: 8080,
        WriteBufferKeepAlive:       []byte{0x00},  // Send null byte
        WriteBufferKeepAlivePeriod: 30 * time.Second,
    }
    options := meta.TCPServerOptions{
        Port: 8080,
        Advanced: meta.TCPAdvancedOptions{
            KeepAlivePeriod: 60 * time.Second,
        },
    }
    certManager := createCertManager()  // Your certificate manager
    
    options := meta.TCPServerOptions{
        Port:        8443,
        CertManager: certManager,
    }
    options := meta.TCPConnectionOptions{
        Host:        "example.com",
        Port:        443,
        CertManager: certManager,
    }
    options := meta.TCPConnectionOptions{
        Host:               "self-signed-server.local",
        Port:               443,
        CertManager:        certManager,
        InsecureSkipVerify: true,  // Don't verify server certificate
    }
    // Server: all connections send messages to "connection_manager"
    serverOpts := meta.TCPServerOptions{
        Port:        8080,
        ProcessPool: []gen.Atom{"connection_manager"},
    }
    
    // Client: this connection sends messages to "http_handler"
    clientOpts := meta.TCPConnectionOptions{
        Host:    "example.com",
        Port:    80,
        Process: "http_handler",
    }
    // Inspect server
    serverInfo, _ := process.Call(serverID, gen.Inspect{})
    // Returns: map[string]string{"listener": "0.0.0.0:8080"}
    
    // Inspect connection
    connInfo, _ := process.Call(connID, gen.Inspect{})
    // Returns: map[string]string{
    //     "local":     "192.168.1.10:8080",
    //     "remote":    "192.168.1.20:54321",
    //     "process":   "tcp_worker_3",
    //     "bytes in":  "1048576",
    //     "bytes out": "524288",
    // }
    type ConnectionManager struct {
        act.Actor
        connections map[gen.Alias]*ConnectionInfo
    }
    
    type ConnectionInfo struct {
        remoteAddr net.Addr
        startTime  time.Time
    }
    
    func (m *ConnectionManager) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case meta.MessageTCPConnect:
            m.connections[msg.ID] = &ConnectionInfo{
                remoteAddr: msg.RemoteAddr,
                startTime:  time.Now(),
            }
            m.Log().Info("connection #%d: %s", len(m.connections), msg.RemoteAddr)
            
        case meta.MessageTCPDisconnect:
            info := m.connections[msg.ID]
            duration := time.Since(info.startTime)
            m.Log().Info("connection closed: %s (duration: %s)", 
                info.remoteAddr, duration)
            delete(m.connections, msg.ID)
        }
        return nil
    }
    type ProtocolHandler struct {
        act.Actor
        connections map[gen.Alias]*ProtocolState
    }
    
    type ProtocolState struct {
        state  int  // Current state in protocol state machine
        buffer []byte
    }
    
    func (h *ProtocolHandler) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCPConnect:
            h.connections[m.ID] = &ProtocolState{state: STATE_INITIAL}
            
        case meta.MessageTCP:
            state := h.connections[m.ID]
            state.buffer = append(state.buffer, m.Data...)
            
            // Process buffered data according to current state
            for {
                complete, nextState := h.processState(m.ID, state)
                if !complete {
                    break
                }
                state.state = nextState
            }
            
            bufferPool.Put(m.Data)
        }
        return nil
    }
    func (m *ConnectionManager) broadcastMessage(data []byte) {
        for connID := range m.connections {
            m.Send(connID, meta.MessageTCP{Data: data})
        }
    }
    // WRONG: Connection state leaked
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCPConnect:
            w.connections[m.ID] = &State{}
            
        case meta.MessageTCP:
            w.connections[m.ID].process(m.Data)
            
        // No MessageTCPDisconnect handler!
        }
        return nil
    }
    // WRONG: Protocol state corrupted
    options := meta.TCPServerOptions{
        ProcessPool: []gen.Atom{"worker_pool"},  // Don't use act.Pool!
    }
    // WRONG: Blocks actor, stalls other connections
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCP:
            // Slow database query
            result := w.db.Query("SELECT * FROM large_table")
            w.Send(m.ID, meta.MessageTCP{Data: result})
        }
        return nil
    }
    // WRONG: Buffer leaked
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCP:
            // Store data, never return buffer
            w.dataQueue = append(w.dataQueue, m.Data)
        }
        return nil
    }
    
    // CORRECT: Copy if storing
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageTCP:
            copied := make([]byte, len(m.Data))
            copy(copied, m.Data)
            w.dataQueue = append(w.dataQueue, copied)
            bufferPool.Put(m.Data)
        }
        return nil
    }
    type DNSServer struct {
        act.Actor
        udpID gen.Alias
    }
    
    func (d *DNSServer) Init(args ...any) error {
        options := meta.UDPServerOptions{
            Host:       "0.0.0.0",
            Port:       53,
            BufferSize: 512,  // DNS messages are typically small
        }
        
        server, err := meta.CreateUDPServer(options)
        if err != nil {
            return fmt.Errorf("failed to create UDP server: %w", err)
        }
        
        udpID, err := d.SpawnMeta(server, gen.MetaOptions{})
        if err != nil {
            // Failed to spawn - close the socket
            server.Terminate(err)
            return fmt.Errorf("failed to spawn UDP server: %w", err)
        }
        
        d.udpID = udpID
        d.Log().Info("DNS server listening on %s:%d (id: %s)", 
            options.Host, options.Port, udpID)
        return nil
    }
    func (d *DNSServer) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            // Received UDP datagram
            d.Log().Info("received %d bytes from %s", len(m.Data), m.Addr)
            
            // Parse DNS query
            query, err := d.parseDNSQuery(m.Data)
            if err != nil {
                d.Log().Warning("invalid DNS query from %s: %s", m.Addr, err)
                return nil
            }
            
            // Build DNS response
            response := d.buildDNSResponse(query)
            
            // Send response back to the same address
            d.Send(d.udpID, meta.MessageUDP{
                Addr: m.Addr,
                Data: response,
            })
        }
        return nil
    }
    type GameServer struct {
        act.Actor
        udpID   gen.Alias
        players map[string]*PlayerState  // Key: remote address string
    }
    
    type PlayerState struct {
        addr       net.Addr
        lastSeen   time.Time
        position   Vector3
        health     int
    }
    
    func (g *GameServer) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            addrStr := m.Addr.String()
            
            // Get or create player state
            player, exists := g.players[addrStr]
            if !exists {
                player = &PlayerState{
                    addr:   m.Addr,
                    health: 100,
                }
                g.players[addrStr] = player
                g.Log().Info("new player: %s", addrStr)
            }
            
            // Update last seen
            player.lastSeen = time.Now()
            
            // Process game packet
            g.processGamePacket(player, m.Data)
            
        case CleanupTick:
            // Remove stale players
            now := time.Now()
            for addr, player := range g.players {
                if now.Sub(player.lastSeen) > 30*time.Second {
                    delete(g.players, addr)
                    g.Log().Info("player timeout: %s", addr)
                }
            }
        }
        return nil
    }
    options := meta.UDPServerOptions{
        Port:    8125,
        Process: "metrics_collector",
    }
    // Start worker pool
    poolPID, _ := process.SpawnRegister("udp_pool", createWorkerPool, gen.ProcessOptions{})
    
    options := meta.UDPServerOptions{
        Port:    8125,
        Process: "udp_pool",  // act.Pool is OK for UDP
    }
    bufferPool := &sync.Pool{
        New: func() any {
            return make([]byte, 1500)  // MTU size
        },
    }
    
    options := meta.UDPServerOptions{
        Port:       8125,
        BufferSize: 1500,
        BufferPool: bufferPool,
    }
    func (s *StatsServer) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            // Process datagram
            s.processMetric(m.Data)
            
            // Return buffer to pool
            bufferPool.Put(m.Data)
        }
        return nil
    }
    case meta.MessageUDP:
        // Store in queue - must copy
        copied := make([]byte, len(m.Data))
        copy(copied, m.Data)
        s.queue = append(s.queue, copied)
        
        // Return original buffer
        bufferPool.Put(m.Data)
    // DNS server - queries rarely exceed 512 bytes
    options := meta.UDPServerOptions{
        Port:       53,
        BufferSize: 512,
    }
    
    // Game server - small position updates
    options := meta.UDPServerOptions{
        Port:       9999,
        BufferSize: 128,
    }
    
    // Media streaming - large packets OK
    options := meta.UDPServerOptions{
        Port:       5004,
        BufferSize: 8192,
    }
    type ReassemblyHandler struct {
        act.Actor
        fragments map[uint32]*FragmentSet  // Key: message ID
    }
    
    type FragmentSet struct {
        fragments []*Fragment
        received  map[int]bool
        total     int
    }
    
    func (r *ReassemblyHandler) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            // Parse fragment header
            msgID, fragNum, totalFrags := r.parseFragmentHeader(m.Data)
            
            // Get or create fragment set
            set, exists := r.fragments[msgID]
            if !exists {
                set = &FragmentSet{
                    fragments: make([]*Fragment, totalFrags),
                    received:  make(map[int]bool),
                    total:     totalFrags,
                }
                r.fragments[msgID] = set
            }
            
            // Store fragment
            set.fragments[fragNum] = &Fragment{data: m.Data}
            set.received[fragNum] = true
            
            // Check if complete
            if len(set.received) == set.total {
                complete := r.reassemble(set.fragments)
                r.processMessage(complete)
                delete(r.fragments, msgID)
            }
            
            bufferPool.Put(m.Data)
        }
        return nil
    }
    type Player struct {
        lastSequence uint32
    }
    
    func (g *GameServer) processGamePacket(player *Player, data []byte) {
        seq := binary.BigEndian.Uint32(data[0:4])
        
        // Discard old/duplicate packets
        if seq <= player.lastSequence {
            return
        }
        
        player.lastSequence = seq
        // Process packet
    }
    type Measurement struct {
        timestamp time.Time
        value     float64
    }
    
    func (s *StatsCollector) processMeasurement(m Measurement) {
        // Store measurements in order by timestamp
        s.insertSorted(m)
    }
    serverInfo, _ := process.Call(udpID, gen.Inspect{})
    // Returns: map[string]string{
    //     "listener":  "0.0.0.0:8125",
    //     "process":   "metrics_collector",
    //     "bytes in":  "10485760",
    //     "bytes out": "1048576",
    // }
    type MetricsCollector struct {
        act.Actor
        metrics map[string]*Metric
    }
    
    func (m *MetricsCollector) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case meta.MessageUDP:
            // Parse StatsD format: "metric.name:value|type"
            name, value, metricType := m.parseStatsD(msg.Data)
            
            metric := m.metrics[name]
            if metric == nil {
                metric = &Metric{}
                m.metrics[name] = metric
            }
            
            metric.update(value, metricType)
            bufferPool.Put(msg.Data)
            
        case FlushTick:
            // Periodically flush aggregated metrics
            m.flushMetrics()
            m.metrics = make(map[string]*Metric)
        }
        return nil
    }
    type DNSClient struct {
        act.Actor
        udpID   gen.Alias
        pending map[uint16]*PendingQuery  // Key: DNS query ID
    }
    
    func (c *DNSClient) query(domain string) {
        queryID := c.nextQueryID()
        query := c.buildDNSQuery(queryID, domain)
        
        // Send query
        c.Send(c.udpID, meta.MessageUDP{
            Addr: c.dnsServerAddr,
            Data: query,
        })
        
        // Store pending query with timeout
        c.pending[queryID] = &PendingQuery{
            domain:  domain,
            sent:    time.Now(),
            timeout: c.SendAfter(c.PID(), QueryTimeout{queryID}, 5*time.Second),
        }
    }
    
    func (c *DNSClient) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            // Parse DNS response
            queryID := binary.BigEndian.Uint16(m.Data[0:2])
            pending := c.pending[queryID]
            if pending != nil {
                pending.timeout()  // Cancel timeout
                c.handleResponse(m.Data)
                delete(c.pending, queryID)
            }
            bufferPool.Put(m.Data)
            
        case QueryTimeout:
            // Query timed out - maybe retry
            pending := c.pending[m.queryID]
            if pending != nil {
                c.Log().Warning("DNS query timeout: %s", pending.domain)
                delete(c.pending, m.queryID)
            }
        }
        return nil
    }
    type DiscoveryServer struct {
        act.Actor
        udpID gen.Alias
    }
    
    func (d *DiscoveryServer) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            if string(m.Data) == "DISCOVER" {
                response := d.buildDiscoveryResponse()
                
                // Reply to sender
                d.Send(d.udpID, meta.MessageUDP{
                    Addr: m.Addr,
                    Data: response,
                })
            }
            bufferPool.Put(m.Data)
        }
        return nil
    }
    // WRONG: Buffer leaked
    func (s *Server) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            // Store in queue without copying
            s.queue = append(s.queue, m.Data)  // Buffer still referenced!
        }
        return nil
    }
    
    // CORRECT: Copy before storing
    func (s *Server) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessageUDP:
            copied := make([]byte, len(m.Data))
            copy(copied, m.Data)
            s.queue = append(s.queue, copied)
            bufferPool.Put(m.Data)  // Return original
        }
        return nil
    }
    // WRONG: Assumes all datagrams arrive
    func (c *Client) sendTransaction(tx Transaction) {
        // Send 10 chunks
        for i := 0; i < 10; i++ {
            chunk := tx.getChunk(i)
            c.Send(c.udpID, meta.MessageUDP{
                Addr: c.serverAddr,
                Data: chunk,
            })
        }
        // Server will process when all 10 arrive... right? WRONG!
    }
    // WRONG: Likely to be fragmented or lost
    data := make([]byte, 8000)  // 8KB datagram
    c.Send(c.udpID, meta.MessageUDP{
        Addr: serverAddr,
        Data: data,
    })
    // WRONG: Processes duplicate commands
    func (g *GameServer) handleCommand(player *Player, cmd Command) {
        switch cmd.Type {
        case CmdFireWeapon:
            player.ammo--  // Duplicate datagram = fire twice!
            g.spawnProjectile(player)
        }
    }
    
    // CORRECT: Idempotent with sequence tracking
    func (g *GameServer) handleCommand(player *Player, cmd Command) {
        if cmd.Sequence <= player.lastSequence {
            return  // Duplicate or old command
        }
        player.lastSequence = cmd.Sequence
        
        switch cmd.Type {
        case CmdFireWeapon:
            player.ammo--
            g.spawnProjectile(player)
        }
    }

    Message Versioning

    Evolving message contracts in distributed clusters

    Distributed systems evolve. Services gain features, data models change, and deployments happen gradually. During a rolling upgrade, some nodes run new code while others still run the old version. A message sent from a new node must be understood by an old node, and vice versa.

    EDF serializes messages by their exact Go type. Change a struct - and you have a new, incompatible type. This is intentional: explicit versioning catches breaking changes at compile time rather than hiding them until production.

    This article explains how to version messages so your cluster handles upgrades gracefully.

    hashtag
    Explicit Versioning

    Leader

    Distributed leader election for coordinating work across a cluster

    Distributed systems often require coordination - ensuring only one node writes to prevent conflicts, scheduling tasks exactly once, or managing exclusive access to shared resources. This coordination demands selecting one node as the leader while others follow. The leader actor implements this election mechanism, handling failures, network issues, and dynamic cluster changes automatically.

    When you embed leader.Actor in your process, it participates in distributed leader election with other instances across your Ergo cluster. The framework manages the election protocol - tracking terms, exchanging votes, broadcasting heartbeats. Your code focuses on what matters: what to do when elected leader, and how to behave as a follower.

    hashtag

    spinner
    spinner
    spinner

    Unlike Protobuf or Avro, EDF does not provide automatic backward compatibility. There are no optional fields, no field numbers, no schema evolution. A struct is its type. Change the struct - create a new type.

    The approach is straightforward: create a new type for each version.

    Both types coexist in the codebase. The receiver handles whichever version arrives:

    All message types must be registered with EDF before connection establishment:

    For details on EDF and type registration, see Network Transparency.

    hashtag
    Versioning Strategies

    There are two ways to organize versioned types: version in the type name or version in the package path. Both work with EDF. Choose based on your team's preferences.

    Important: Do not confuse package path versioning with Go modules v2+. Go modules v2+ requires changing both go.mod and all import paths when bumping major version (company.com/events/v2). This forces all consumers to update imports simultaneously, creates diamond dependency problemsarrow-up-right, and generally causes more pain than it solves. Keep your module below v2.0.0 to avoid triggering this mechanism.

    hashtag
    Version in Type Name

    All versions live in the same package:

    Handler uses type names directly:

    Advantages:

    • Single import for all versions

    • All versions visible in one place - evolution is clear

    • One registration file for all types

    • Simpler directory structure

    hashtag
    Version in Package Path

    Each version is a separate package:

    Handler uses package aliases:

    Advantages:

    • Clean type names without version suffix

    • Familiar to Protobuf users

    • Clear directory separation between versions

    • Removing a version means deleting a directory

    hashtag
    Module Organization

    For projects where message versions evolve in parallel, place go.mod in each domain directory:

    The /v1/ and /v2/ segments are in the middle of the module path, not at the end. Go only applies v2+ import path requirements when /vN is the final path element, so company.com/messaging/v1/events is safe.

    This structure allows:

    • V1 to continue receiving new message types while V2 is developed

    • Each domain to have isolated dependencies

    • Clean removal - deleting a directory removes the module entirely

    Tagging submodules: Git tags for nested modules must include the path prefix. For module company.com/messaging/v1/events located at v1/events/, use tag v1/events/v0.1.0, not just v0.1.0.

    hashtag
    Which to Choose

    This documentation uses version in type name for examples. The approach keeps related versions together and requires less import management. However, version in path is equally valid if your team prefers cleaner type names.

    Whichever you choose, stay consistent across the codebase.

    The versioning mechanism is clear. The next question: where should these types live, and who controls their evolution?

    hashtag
    Message Scopes

    The answer depends on how the message is used. Not all messages are equal - some travel between two specific services, others broadcast across the entire cluster.

    hashtag
    Private Messages

    Direct communication between specific services. Request/response patterns between known parties.

    Owner: receiver

    Payment Service defines what it accepts. Order Service adapts to Payment's contract.

    hashtag
    Cluster-Wide Events

    Domain events published to multiple subscribers. Any service can subscribe.

    Owner: shared repository

    Events represent domain facts, not service-specific contracts. Ownership belongs to a shared module that all services import.

    For event publishing patterns, see Events.

    hashtag
    Ownership Rules

    Scope determines ownership. Who decides when to create V2? Who approves changes?

    Scope
    Owner
    Module
    Changes approved by

    Private messages

    Receiver

    receiver-api/

    Receiver team

    Cluster-wide events

    Shared

    The receiver owns private contracts because it implements the logic. Multiple senders may use the same contract, but they all adapt to what the receiver accepts. This follows the Consumer-Driven Contractsarrow-up-right pattern. Events are shared because they represent domain facts, not service-specific APIs.

    hashtag
    Private Contract Ownership

    Payment Service owns its API contract:

    Order Service imports and uses it:

    Payment team decides when to create V2. Order team adapts.

    hashtag
    Cluster Event Ownership

    Events require broader coordination:

    Breaking changes require sign-off from all consumers.

    hashtag
    Repository Organization

    With ownership defined, the repository structure follows naturally. Private contracts live with their receivers. Cluster-wide events live in a shared module.

    hashtag
    Version in Type Name

    hashtag
    Version in Package Path

    hashtag
    Registration Helper

    All message types must be registered with EDF before connection establishment - during handshake, nodes exchange their registered type lists which become the encoding dictionaries. Registration typically happens in init() functions before node startup. There are two approaches: centralized registration in the shared module or manual registration in each client.

    Centralized registration uses init() to register all types when the package is imported:

    When clients import the package to use message types, init() runs automatically at program startup and registers all types:

    No risk of forgetting a type.

    Manual registration means each client registers only the types it uses. This gives more control but introduces risk: a missing registration is only detected at runtime - "no encoder for type" when sending, "unknown reg type for decoding" when receiving. For most projects, centralized registration is simpler and safer. Choose based on your needs.

    For message isolation patterns within a single codebase, see Project Structure.

    hashtag
    Compatibility Rules

    EDF enforces strict type identity. Any struct change breaks wire compatibility.

    Change
    Compatible
    Action

    Add field

    No

    Create new version

    Remove field

    No

    Create new version

    This differs from Protobuf/Avro where adding optional fields is compatible. In EDF, every change requires explicit versioning.

    Yes, this means more work upfront. But consider the alternative: Protobuf lets you add an optional Priority field, and everything "just works" - until you spend three days debugging why orders aren't prioritized correctly. Turns out half your cluster sends the new field, half ignores it, and the receivers silently default missing values to zero. Good luck finding that in logs.

    EDF makes this impossible. The receiver either handles OrderV2 with its Priority field, or it doesn't - and you know this at compile time, not at 3 AM when on-call.

    hashtag
    Version Lifecycle

    With compatibility rules clear, how do versions evolve over time?

    hashtag
    When to Create New Version

    Any change from the compatibility table above requires a new version. Additionally, create a new version when changing field semantics (same type, different meaning).

    hashtag
    Deprecation

    Mark deprecated versions:

    Log when receiving deprecated versions:

    hashtag
    Removal

    Remove only when:

    1. All senders upgraded to V2

    2. Monitoring confirms zero V1 traffic

    3. Deprecation period passed

    Remove in order:

    1. Stop accepting (return error for V1)

    2. Remove from registration

    3. Delete type definition

    hashtag
    Rolling Upgrades

    Back to the scenario from the introduction: you're deploying a new version, nodes restart one by one, and for some time the cluster runs mixed code versions. How do you handle this?

    hashtag
    Upgrade Strategy

    1. Deploy V2 types to shared module

    2. Update receivers to handle V1 and V2

    3. Rolling restart receiver nodes

    4. Update senders to send V2

    5. Rolling restart sender nodes

    6. Deprecate V1 after all nodes upgraded

    7. Remove V1 after deprecation period

    hashtag
    Coexistence Period

    Receivers must support both versions during the upgrade window.

    For deployment patterns with weighted routing, see Building a Cluster.

    hashtag
    Anti-Corruption Layer

    Supporting multiple versions means your handler has multiple code paths. As versions accumulate, this becomes messy. The Anti-Corruption Layer pattern isolates version translation:

    Use in handler:

    Single implementation handles V2. ACL converts V1 to V2. When V1 is removed, delete the ACL function - no changes to business logic needed.

    hashtag
    Contract Testing

    With version handling and ACL in place, how do you verify it actually works? Contract testsarrow-up-right verify compatibility:

    Test ACL conversion:

    Run contract tests in CI before merging changes to shared modules.

    For actor testing patterns, see Unit Testing.

    hashtag
    Naming Conventions

    Consistent naming makes code self-documenting. When you see a type name, you should immediately know: is this async or sync? Is it a request or event? What version?

    hashtag
    Async Messages

    Prefix with Message, suffix with version:

    The prefix signals fire-and-forget semantics. When reading code, MessageXXX means no response is expected. If someone writes Call(pid, MessageOrderShippedV1{}), the mismatch is immediately visible.

    hashtag
    Sync Messages

    Use Request/Response suffix:

    Paired naming makes contracts explicit. ChargeRequest implies ChargeResponse exists. The caller knows to expect a result.

    hashtag
    Events

    Domain events use past tense without prefix:

    Events describe facts that already happened, not requests for action. Past tense (Created, Received) distinguishes them from commands (Create, Charge).

    hashtag
    Version Suffix

    If using version in type name strategy, always suffix with version number:

    If using version in path strategy, the package path carries the version and type names stay clean.

    hashtag
    Common Mistakes

    These patterns emerge repeatedly in production systems. Avoid them:

    Changing existing type instead of creating new version

    Forgetting to register new types

    Long coexistence periods

    Supporting V1 for months creates maintenance burden. Set clear deprecation deadlines and enforce them.

    Registering after connection established

    Types must be registered before node starts. Dynamic registration requires connection cycling.

    hashtag
    Summary

    Message versioning in EDF is explicit by design. No hidden compatibility rules, no runtime surprises.

    Aspect
    Private Messages
    Cluster Events

    Nature

    Service API contract

    Domain fact

    Owner

    Receiver (implements logic)

    Shared (belongs to domain)

    Key principles:

    • Version in type name or package path, never in Go module path

    • Receiver owns private contracts

    • Shared repository for domain events

    • Test version compatibility

    • Set deprecation deadlines

    • Use ACL to isolate version translation

    The Coordination Problem

    Consider a typical scenario: you have a multi-replica service that needs to perform periodic cleanup. If every replica runs cleanup independently, you waste resources and might corrupt data through concurrent modifications. You need exactly one replica to run cleanup while others stand ready to take over if it fails.

    Traditional solutions involve external systems - ZooKeeper, etcd, or distributed locks in databases. These work, but add operational complexity. You need to deploy and maintain additional infrastructure. Your application depends on external services being available, correctly configured, and network-accessible. Each external dependency is another potential failure point.

    The leader actor embeds coordination directly into your Ergo cluster. No external dependencies. Election happens through actor message passing using the same network protocols your application already uses. If your Ergo nodes can communicate with each other, they can elect a leader.

    hashtag
    How Election Works

    The election protocol follows Raft consensus principles, adapted for actor message passing. Understanding the mechanism requires knowing about three concepts: states, terms, and quorum.

    hashtag
    States and Transitions

    Every process starts as a follower. This is the initial state - passive, waiting to hear from a leader. If no heartbeats arrive within the election timeout, the follower transitions to candidate and starts an election. If the candidate receives enough votes, it becomes leader. If it discovers another leader or loses the election, it reverts to follower.

    The transitions are deliberate. Followers conserve resources by remaining passive. Only when leadership is needed (timeout occurs) does a node become active by candidacy. Leadership is earned through votes, not asserted unilaterally.

    hashtag
    Terms and Logical Time

    Elections happen in numbered terms. Terms increment monotonically - term 1, term 2, term 3, and so on. Each term has at most one leader. When a candidate starts an election, it increments the term. When nodes communicate, they include their current term. If a node sees a higher term, it updates immediately and acknowledges the new term.

    Terms solve a subtle problem: distinguishing stale information from current state. Without terms, a network partition could cause confusion - is this heartbeat from the current leader, or from a partitioned node that thinks it's still leader? Terms provide a logical clock that orders events without requiring synchronized system clocks.

    This mechanism ensures that newer elections always supersede older ones, regardless of network delays or partitions.

    hashtag
    Quorum and Split-Brain Prevention

    To become leader, a candidate needs votes from a majority of nodes. In a three-node cluster, that's two votes (including voting for itself). In a five-node cluster, three votes. The majority requirement prevents split-brain - a dangerous scenario where multiple nodes believe they're leader simultaneously.

    Consider a network partition splitting five nodes into groups of 3 and 2:

    Only the majority side can elect a leader. The minority side remains leaderless, preventing conflicting leadership. When the partition heals, the minority nodes recognize the higher term from the majority side's leader and follow it.

    hashtag
    Election Sequence

    Here's what happens when a cluster starts:

    Election timeouts are randomized, so typically one node times out first and wins the election before others start their own campaigns. This reduces the chance of split votes.

    hashtag
    Leader Maintenance

    Once elected, the leader sends periodic heartbeats to all followers:

    Heartbeats serve two purposes: they suppress elections on followers (by resetting their timeouts), and they act as a liveness signal. If heartbeats stop, followers know the leader has failed and trigger a new election.

    hashtag
    Peer Discovery

    Nodes discover each other dynamically. You provide bootstrap addresses - a list of known peers to contact initially. When a node sends or receives election messages, it monitors the sender. Over time, all nodes discover all peers, even if they didn't initially know about each other.

    Discovery is automatic. You can provide a bootstrap list for faster initial synchronization, or start with an empty list and add peers dynamically using the Join() method. Bootstrap accelerates cluster formation but isn't required - nodes discover each other through any election message exchange.

    hashtag
    Using the Leader Actor

    To create a leader-electing process, embed leader.Actor in your struct and implement the leader.ActorBehavior interface:

    Spawn it like any actor, passing cluster configuration:

    When you spawn identical processes on three nodes with the same ClusterID and Bootstrap, they form a cluster. Within milliseconds, one becomes leader and starts processing tasks. The others stand by as followers.

    hashtag
    The ActorBehavior Interface

    The interface extends gen.ProcessBehavior with leader-specific callbacks:

    hashtag
    Mandatory Callbacks

    Init returns election configuration. The Options specify ClusterID (identifying which cluster this process belongs to), Bootstrap (initial peers to contact), and optional timing parameters for election and heartbeat intervals.

    HandleBecomeLeader is called when this process becomes leader. Start exclusive work here - processing task queues, scheduling cron jobs, claiming resources. Return an error to reject leadership and trigger a new election.

    HandleBecomeFollower is called when this process follows a leader. The leader parameter identifies the leader's PID. If leader is empty (gen.PID{}), it means no leader is currently elected. Stop exclusive work here. Followers should redirect requests to the leader or buffer them until leadership is established.

    hashtag
    Optional Callbacks

    HandlePeerJoined notifies when a new peer joins the cluster. Use this to track cluster size for capacity planning, or to send initialization messages to newcomers.

    HandlePeerLeft notifies when a peer crashes or disconnects. Use this to detect cluster degradation or to clean up peer-specific state.

    HandleTermChanged notifies when the election term increases. This is useful for distributed log replication or versioned command processing - the term can serve as a logical timestamp for ordering operations.

    The other callbacks (HandleMessage, HandleCall, Terminate, HandleInspect) work as they do in regular actors. leader.Actor provides default implementations that log warnings, so you only override what you need.

    hashtag
    Error Handling

    If any callback returns an error, the actor terminates. This includes leadership callbacks - returning an error from HandleBecomeLeader causes the process to reject leadership, step down, and terminate. This is intentional: if initialization of leader responsibilities fails (can't open files, can't connect to database, etc.), it's better to terminate and let a supervisor restart with clean state than to limp along as a broken leader.

    hashtag
    Configuration Options

    The Options struct controls election behavior:

    ClusterID must match across all processes in the same election cluster. Processes with different cluster IDs ignore each other, allowing multiple independent elections in the same Ergo cluster.

    Bootstrap lists the initial peers to contact on startup. Can be empty - in this case, use the Join() method to add peers dynamically. When provided, each process should include itself in the list. At startup, processes send vote requests to bootstrap peers even if they haven't discovered them yet. This accelerates initial election and cluster formation.

    ElectionTimeoutMin and ElectionTimeoutMax define the randomization range for election timeouts. Actual timeouts are randomly chosen from this range to reduce the chance of simultaneous elections. Defaults (150-300ms) work well for local networks.

    HeartbeatInterval controls how often leaders send heartbeats. Must be significantly smaller than ElectionTimeoutMin - typically at least 3x smaller. The default (50ms) provides a 3x safety margin against the default election timeout.

    hashtag
    Tuning for Network Conditions

    For local clusters (single datacenter, low latency):

    For geographically distributed clusters (high latency, possible packet loss):

    The tradeoff: longer timeouts increase failover time but reduce false elections during network hiccups. Shorter timeouts provide fast failover but risk spurious elections if networks are slow.

    hashtag
    API Methods

    The embedded leader.Actor provides methods for querying state and communicating with peers:

    hashtag
    State Queries

    IsLeader() bool - Returns true if this process is currently the leader.

    Leader() gen.PID - Returns the current leader's PID, or empty if no leader elected yet.

    Term() uint64 - Returns the current election term.

    ClusterID() string - Returns the cluster identifier.

    hashtag
    Peer Information

    Peers() []gen.PID - Returns a snapshot of discovered peers. The slice is a copy, so you can iterate safely.

    PeerCount() int - Returns the number of known peers.

    HasPeer(pid gen.PID) bool - Checks if a specific PID is a known peer.

    Bootstrap() []gen.ProcessID - Returns the bootstrap peer list.

    hashtag
    Communication

    Broadcast(message any) - Sends a message to all discovered peers. Useful for disseminating information or coordinating state across the cluster.

    BroadcastBootstrap(message any) - Sends a message to all bootstrap peers (excluding self). Useful for announcements before peer discovery completes.

    Join(peer gen.ProcessID) - Manually adds a peer to the cluster by sending it a vote request. Use this for dynamic cluster growth when new nodes join after initial bootstrap.

    hashtag
    Example: Leader-Only Processing

    hashtag
    Example: Broadcasting State Updates

    hashtag
    Common Patterns

    hashtag
    Single Writer Coordination

    Only the leader writes to external storage:

    hashtag
    Task Scheduling

    Only the leader schedules periodic tasks:

    hashtag
    Forwarding to Leader

    Followers forward writes to the leader:

    hashtag
    Dynamic Cluster Membership

    You can start a node with an empty bootstrap list and add peers dynamically:

    hashtag
    Network Partitions and Split-Brain

    Network partitions are inevitable in distributed systems. The election algorithm handles them safely through the quorum requirement.

    hashtag
    Partition Scenario

    Consider a five-node cluster that splits into groups of 3 and 2:

    Group A (majority side) - Node1 remains leader because it can send heartbeats to Node2 and Node3, which acknowledge them. The majority side continues operating normally.

    Group B (minority side) - Node4 and Node5 don't receive heartbeats from Node1. They trigger elections, but neither can get 3 votes (only 2 nodes total in their partition). They remain leaderless and reject write requests.

    This asymmetry is intentional. Only one side can have a leader, preventing split-brain writes that would corrupt data.

    hashtag
    Partition Healing

    When the network partition heals:

    The minority nodes recognize the majority leader's heartbeats and rejoin the cluster. If they had incremented their term during failed election attempts, they would detect the higher term and update accordingly.

    hashtag
    Integration with Applications

    For real applications, the leader actor is a building block for distributed systems patterns:

    hashtag
    Distributed Key-Value Store

    Extend the leader actor with log replication for a linearizable KV store:

    hashtag
    Distributed Lock Service

    Implement distributed locks where the leader grants leases:

    hashtag
    Limitations and Trade-offs

    The leader election actor solves coordination, but it's not a complete distributed database. Understanding what it doesn't provide is as important as knowing what it does.

    No automatic log replication - The actor handles leader election but doesn't replicate application state. If you need replicated state machines, you must implement log replication yourself on top of the election foundation.

    No persistence - Election state exists only in memory. If all nodes restart simultaneously, the cluster performs a fresh election. For state that must survive restarts, use external storage or implement persistence in your application.

    Cluster membership is dynamic discovery, not consensus - Nodes discover peers through message exchange, not through a formal membership protocol. This is sufficient for most use cases but isn't suitable for scenarios requiring precise, consensus-based membership changes.

    Leader election is not instantly consistent - During network partitions or failures, there may be brief periods with no leader, or where nodes have inconsistent views of leadership. This is fundamental to distributed consensus and cannot be avoided.

    The actor provides the foundation - stable leader election with safety guarantees. Building complete distributed systems (databases, coordination services) requires additional mechanisms built on this foundation.

    hashtag
    Observability

    The leader actor integrates with Ergo's inspection system:

    Monitor leadership changes in your logging:

    Track cluster health by monitoring peer counts and leadership stability over time.

    Port

    Actors communicate through message passing within the framework. But what if you need to integrate with an external program written in Python, C, or any other language? You could spawn goroutines to manage stdin/stdout, handle protocol framing, deal with buffer management - but this breaks the actor model and spreads I/O complexity throughout your code.

    Port meta-process solves this by wrapping external programs as actors. The external program runs as a child process. You send messages to the Port, and it writes them to the program's stdin. The Port reads from stdout and sends you messages. From your actor's perspective, you're just exchanging messages with another actor - the external program's details are abstracted away.

    This enables clean integration with legacy systems, specialized libraries in other languages, or any tool that uses stdin/stdout for communication. The actor model stays intact while bridging to external processes.

    hashtag

    // Version 1
    type OrderCreatedV1 struct {
        OrderID int64
    }
    
    // Version 2 - new field
    type OrderCreatedV2 struct {
        OrderID  int64
        Priority int
    }
    func (a *Actor) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case OrderCreatedV1:
            return a.handleOrderV1(m)
        case OrderCreatedV2:
            return a.handleOrderV2(m)
        }
        return nil
    }
    func init() {
        types := []any{
            OrderCreatedV1{},
            OrderCreatedV2{},
        }
        for _, t := range types {
            if err := edf.RegisterTypeOf(t); err != nil && err != gen.ErrTaken {
                panic(err)
            }
        }
    }
    events/
    ├── order_created_v1.go
    ├── order_created_v2.go
    └── register.go
    import "company.com/events"
    
    events.OrderCreatedV1{}
    events.OrderCreatedV2{}
    switch m := message.(type) {
    case events.OrderCreatedV1:
        // ...
    case events.OrderCreatedV2:
        // ...
    }
    messaging/
    ├── v1/
    │   └── events/
    │       └── order_created.go
    └── v2/
        └── events/
            └── order_created.go
    import eventsv1 "company.com/messaging/v1/events"
    import eventsv2 "company.com/messaging/v2/events"
    
    eventsv1.OrderCreated{}
    eventsv2.OrderCreated{}
    switch m := message.(type) {
    case eventsv1.OrderCreated:
        // ...
    case eventsv2.OrderCreated:
        // ...
    }
    messaging/
    ├── v1/
    │   ├── events/
    │   │   ├── go.mod              # module company.com/messaging/v1/events
    │   │   └── order_created.go
    │   └── payment/
    │       ├── go.mod              # module company.com/messaging/v1/payment
    │       └── charge.go
    └── v2/
        ├── events/
        │   ├── go.mod              # module company.com/messaging/v2/events
        │   └── order_created.go
        └── payment/
            ├── go.mod              # module company.com/messaging/v2/payment
            └── charge.go
    // payment-api/charge_v1.go
    package paymentapi
    
    type ChargeRequestV1 struct {
        OrderID int64
        Amount  int64
    }
    
    type ChargeResponseV1 struct {
        TransactionID string
        Status        string
    }
    import paymentapi "company.com/payment-api"
    
    response, err := a.Call(paymentPID, paymentapi.ChargeRequestV1{
        OrderID: order.ID,
        Amount:  order.Total,
    })
    events/
    ├── OWNERS.md           # who approves changes
    ├── CHANGELOG.md        # version history
    └── order/
        ├── created_v1.go
        └── created_v2.go
    # OWNERS.md
    
    Maintainers (approve all changes):
    - platform-team
    
    Reviewers (approve breaking changes):
    - order-team
    - payment-team
    - analytics-team
    company.com/
    │
    ├── events/                     # cluster-wide events
    │   ├── go.mod                  # module company.com/events
    │   ├── order_created_v1.go
    │   ├── order_created_v2.go
    │   ├── payment_received_v1.go
    │   └── register.go
    │
    ├── payment-api/                # Payment Service contract
    │   ├── go.mod                  # module company.com/payment-api
    │   ├── charge_v1.go
    │   └── refund_v1.go
    │
    ├── order-service/
    │   ├── go.mod                  # requires: events, payment-api
    │   ├── internal/
    │   └── cmd/
    │
    └── payment-service/
        ├── go.mod                  # requires: events
        ├── internal/
        └── cmd/
    company.com/
    │
    ├── messaging/                  # cluster-wide events and contracts
    │   ├── v1/
    │   │   ├── events/
    │   │   │   ├── go.mod          # module company.com/messaging/v1/events
    │   │   │   ├── order_created.go
    │   │   │   └── payment_received.go
    │   │   └── payment/
    │   │       ├── go.mod          # module company.com/messaging/v1/payment
    │   │       └── charge.go
    │   └── v2/
    │       ├── events/
    │       │   ├── go.mod          # module company.com/messaging/v2/events
    │       │   └── order_created.go
    │       └── payment/
    │           ├── go.mod          # module company.com/messaging/v2/payment
    │           └── charge.go
    │
    ├── order-service/
    │   ├── go.mod                  # requires: messaging/v1/events, messaging/v1/payment
    │   ├── internal/
    │   └── cmd/
    │
    └── payment-service/
        ├── go.mod                  # requires: messaging/v1/events
        ├── internal/
        └── cmd/
    // events/register.go
    package events
    
    import (
        "ergo.services/ergo/gen"
        "ergo.services/ergo/net/edf"
    )
    
    func init() {
        types := []any{
            OrderCreatedV1{},
            OrderCreatedV2{},
            PaymentReceivedV1{},
        }
        for _, t := range types {
            if err := edf.RegisterTypeOf(t); err != nil && err != gen.ErrTaken {
                panic(err)
            }
        }
    }
    import "company.com/events"
    
    // Using events.OrderCreatedV1 means the package is imported,
    // init() has already run, types are registered
    // Deprecated: use ChargeRequestV2. Remove after 2025-Q3.
    type ChargeRequestV1 struct {
        OrderID int64
        Amount  int64
    }
    case ChargeRequestV1:
        a.Log().Warning("deprecated ChargeRequestV1 from %s", from)
        return a.handleChargeV1(m)
    // internal/acl/charge.go
    package acl
    
    import api "company.com/payment-api"
    
    func ChargeV1ToV2(v1 api.ChargeRequestV1) api.ChargeRequestV2 {
        return api.ChargeRequestV2{
            OrderID:  v1.OrderID,
            Amount:   v1.Amount,
            Currency: "USD", // default for V1 clients
        }
    }
    func (a *Actor) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case api.ChargeRequestV1:
            v2 := acl.ChargeV1ToV2(m)
            return a.processCharge(v2)
        case api.ChargeRequestV2:
            return a.processCharge(m)
        }
        return nil
    }
    func TestPaymentActorAcceptsBothVersions(t *testing.T) {
        tc := unit.NewTestCase(t, "test@localhost")
        defer tc.Stop()
    
        actor := tc.Spawn(createPaymentActor)
    
        // V1 works
        actor.Send(ChargeRequestV1{OrderID: 1, Amount: 100})
        actor.ShouldSend().Message(ChargeResponseV1{}).Once().Assert()
    
        // V2 works
        actor.Send(ChargeRequestV2{OrderID: 2, Amount: 200, Currency: "EUR"})
        actor.ShouldSend().Message(ChargeResponseV2{}).Once().Assert()
    }
    func TestACLConvertsV1ToV2(t *testing.T) {
        v1 := ChargeRequestV1{OrderID: 123, Amount: 500}
        v2 := acl.ChargeV1ToV2(v1)
    
        assert.Equal(t, v1.OrderID, v2.OrderID)
        assert.Equal(t, v1.Amount, v2.Amount)
        assert.Equal(t, "USD", v2.Currency) // default
    }
    type MessageOrderShippedV1 struct {
        OrderID   int64
        TrackingN string
    }
    type ChargeRequestV1 struct {
        OrderID int64
        Amount  int64
    }
    
    type ChargeResponseV1 struct {
        TransactionID string
        Status        string
    }
    type OrderCreatedV1 struct { ... }
    type PaymentReceivedV1 struct { ... }
    type OrderV1 struct { ... }   // correct
    type Order struct { ... }     // avoid - unclear versioning
    type OrderNew struct { ... }  // avoid - not a version number
    // Wrong - breaks existing consumers
    type Order struct {
        ID       int64
        Priority int    // added field breaks wire format
    }
    
    // Correct - create new version (in type name or new package path)
    type OrderV2 struct {
        ID       int64
        Priority int
    }
    // Type exists but not registered - encoding fails at runtime
    type OrderV3 struct { ... }
    
    // Must register before node starts
    edf.RegisterTypeOf(OrderV3{})
    type Coordinator struct {
        leader.Actor
        
        tasks      []Task
        processing bool
    }
    
    func (c *Coordinator) Init(args ...any) (leader.Options, error) {
        clusterID := args[0].(string)
        bootstrap := args[1].([]gen.ProcessID)
        
        c.tasks = make([]Task, 0)
        
        return leader.Options{
            ClusterID: clusterID,
            Bootstrap: bootstrap,
        }, nil
    }
    
    func (c *Coordinator) HandleBecomeLeader() error {
        c.Log().Info("elected as leader - starting task processor")
        c.processing = true
        c.startProcessing()
        return nil
    }
    
    func (c *Coordinator) HandleBecomeFollower(leader gen.PID) error {
        if leader == (gen.PID{}) {
            c.Log().Info("no leader elected yet")
        } else {
            c.Log().Info("following leader: %s", leader)
            c.processing = false
            c.stopProcessing()
        }
        return nil
    }
    
    func (c *Coordinator) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case AddTaskRequest:
            if !c.IsLeader() {
                // Not leader - reject or forward
                c.Send(from, NotLeaderError{Leader: c.Leader()})
                return nil
            }
            
            c.tasks = append(c.tasks, msg.Task)
            c.Send(from, TaskAccepted{})
            
        case ProcessNextTask:
            if c.IsLeader() && len(c.tasks) > 0 {
                task := c.tasks[0]
                c.tasks = c.tasks[1:]
                c.executeTask(task)
            }
        }
        return nil
    }
    
    func (c *Coordinator) Terminate(reason error) {
        c.Log().Info("coordinator stopping: %s", reason)
    }
    clusterID := "task-coordinators"
    bootstrap := []gen.ProcessID{
        {Name: "coordinator", Node: "node1@host"},
        {Name: "coordinator", Node: "node2@host"},
        {Name: "coordinator", Node: "node3@host"},
    }
    
    factory := func() gen.ProcessBehavior {
        return &Coordinator{}
    }
    
    pid, err := node.SpawnRegister("coordinator", factory, 
        gen.ProcessOptions{}, clusterID, bootstrap)
    type ActorBehavior interface {
        gen.ProcessBehavior
        
        Init(args ...any) (Options, error)
        HandleMessage(from gen.PID, message any) error
        HandleCall(from gen.PID, ref gen.Ref, request any) (any, error)
        Terminate(reason error)
        HandleInspect(from gen.PID, item ...string) map[string]string
        
        // Leadership transitions
        HandleBecomeLeader() error
        HandleBecomeFollower(leader gen.PID) error
        
        // Cluster membership changes
        HandlePeerJoined(peer gen.PID) error
        HandlePeerLeft(peer gen.PID) error
        
        // Term changes (for log replication or versioning)
        HandleTermChanged(oldTerm, newTerm uint64) error
    }
    type Options struct {
        ClusterID string            // Required: identifies the cluster
        Bootstrap []gen.ProcessID   // Optional: initial peers to contact
        
        ElectionTimeoutMin int      // Minimum timeout (ms, default: 150)
        ElectionTimeoutMax int      // Maximum timeout (ms, default: 300)
        HeartbeatInterval  int      // Heartbeat frequency (ms, default: 50)
    }
    leader.Options{
        ClusterID:          "my-cluster",
        Bootstrap:          peers,
        ElectionTimeoutMin: 150,
        ElectionTimeoutMax: 300,
        HeartbeatInterval:  50,
    }
    leader.Options{
        ClusterID:          "my-cluster",
        Bootstrap:          peers,
        ElectionTimeoutMin: 1000,  // 1 second
        ElectionTimeoutMax: 2000,  // 2 seconds
        HeartbeatInterval:  250,   // 4x safety margin
    }
    func (c *Coordinator) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ProcessTask:
            if !c.IsLeader() {
                // Forward to leader
                if leader := c.Leader(); leader != (gen.PID{}) {
                    c.Send(leader, msg)
                } else {
                    c.Send(from, ErrorNoLeader{})
                }
                return nil
            }
            
            // We're leader - process the task
            result := c.executeTask(msg.Task)
            c.Send(from, TaskResult{Result: result})
        }
        return nil
    }
    func (c *Coordinator) HandleBecomeLeader() error {
        c.Log().Info("elected leader - broadcasting initial state")
        
        // Notify all peers of current state
        c.Broadcast(StateUpdate{
            Term:  c.Term(),
            State: c.getCurrentState(),
        })
        
        return nil
    }
    
    func (c *Coordinator) HandlePeerJoined(peer gen.PID) error {
        if c.IsLeader() {
            // New peer joined - send them current state
            c.Send(peer, StateUpdate{
                Term:  c.Term(),
                State: c.getCurrentState(),
            })
        }
        return nil
    }
    func (c *Coordinator) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case WriteRequest:
            if !c.IsLeader() {
                c.Send(from, ErrorNotLeader{Leader: c.Leader()})
                return nil
            }
            
            // Only leader performs writes
            err := c.database.Write(msg.Key, msg.Value)
            if err != nil {
                c.Send(from, WriteError{Err: err})
            } else {
                c.Send(from, WriteSuccess{})
            }
        }
        return nil
    }
    func (c *Coordinator) HandleBecomeLeader() error {
        c.schedulerActive = true
        c.scheduleNextTask()
        return nil
    }
    
    func (c *Coordinator) HandleBecomeFollower(leader gen.PID) error {
        c.schedulerActive = false
        return nil
    }
    
    func (c *Coordinator) scheduleNextTask() {
        if !c.schedulerActive {
            return
        }
        
        c.SendAfter(c.PID(), RunScheduledTask{}, 10*time.Second)
    }
    
    func (c *Coordinator) HandleMessage(from gen.PID, message any) error {
        switch message.(type) {
        case RunScheduledTask:
            if c.IsLeader() {
                c.executeScheduledWork()
                c.scheduleNextTask()  // Reschedule
            }
        }
        return nil
    }
    func (c *Coordinator) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case WriteCommand:
            if c.IsLeader() {
                // Process locally
                c.applyWrite(msg)
                c.Send(from, WriteSuccess{})
            } else {
                // Forward to leader
                leader := c.Leader()
                if leader == (gen.PID{}) {
                    c.Send(from, ErrorNoLeader{})
                } else {
                    c.Send(leader, ForwardedWrite{
                        OriginalSender: from,
                        Command:        msg,
                    })
                }
            }
            
        case ForwardedWrite:
            // Leader received forwarded write
            if c.IsLeader() {
                c.applyWrite(msg.Command)
                c.Send(msg.OriginalSender, WriteSuccess{})
            }
        }
        return nil
    }
    // Start isolated node with no bootstrap
    pid, err := node.SpawnRegister("coordinator", createCoordinator,
        gen.ProcessOptions{}, 
        "cluster-id", 
        []gen.ProcessID{})  // Empty bootstrap
    
    // Later, add peers dynamically
    coordinator.Join(gen.ProcessID{
        Name: "coordinator",
        Node: "node2@host",
    })
    
    coordinator.Join(gen.ProcessID{
        Name: "coordinator", 
        Node: "node3@host",
    })
    type KVStore struct {
        leader.Actor
        
        data      map[string]string
        log       []LogEntry
        commitIdx uint64
    }
    
    func (kv *KVStore) HandleBecomeLeader() error {
        // Leader starts replicating log to followers
        kv.startReplication()
        return nil
    }
    
    func (kv *KVStore) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case PutRequest:
            if !kv.IsLeader() {
                return kv.forwardToLeader(from, msg)
            }
            
            // Append to log
            entry := LogEntry{
                Term:    kv.Term(),
                Command: msg,
            }
            kv.log = append(kv.log, entry)
            
            // Replicate to followers (simplified)
            kv.Broadcast(ReplicateEntry{Entry: entry})
            
        case ReplicateEntry:
            // Follower receives log entry
            kv.log = append(kv.log, msg.Entry)
        }
        return nil
    }
    type LockService struct {
        leader.Actor
        
        locks map[string]LockInfo
    }
    
    func (ls *LockService) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case AcquireLock:
            if !ls.IsLeader() {
                ls.Send(from, ErrorNotLeader{Leader: ls.Leader()})
                return nil
            }
            
            if _, held := ls.locks[msg.Resource]; held {
                ls.Send(from, LockDenied{})
            } else {
                ls.locks[msg.Resource] = LockInfo{
                    Holder: from,
                    Expiry: time.Now().Add(msg.Duration),
                }
                ls.Send(from, LockGranted{})
            }
            
        case ReleaseLock:
            if ls.IsLeader() {
                delete(ls.locks, msg.Resource)
            }
        }
        return nil
    }
    
    func (ls *LockService) HandleBecomeFollower(leader gen.PID) error {
        // Leader changed - invalidate all locks
        ls.locks = make(map[string]LockInfo)
        return nil
    }
    info, err := node.Inspect(coordinatorPID)
    fmt.Printf("Cluster: %s\n", info["cluster"])
    fmt.Printf("Term: %s\n", info["term"])
    fmt.Printf("IsLeader: %s\n", info["leader"])
    fmt.Printf("Peers: %s\n", info["peers"])
    func (c *Coordinator) HandleBecomeLeader() error {
        c.Log().Info("became leader at term=%d with %d peers", 
            c.Term(), c.PeerCount())
        // Record metric, emit event, update monitoring dashboard
        return nil
    }
    
    func (c *Coordinator) HandlePeerLeft(peer gen.PID) error {
        c.Log().Warning("peer left: %s (remaining: %d)", 
            peer, c.PeerCount())
        return nil
    }
    Creating a Port

    Create a Port with meta.CreatePort and spawn it as a meta-process:

    The Port starts the external program and establishes three pipes: stdin (for writing), stdout (for reading), and stderr (for errors). The program runs as a child process managed by the Port meta-process.

    When the Port starts, it sends MessagePortStart to your actor. When the external program terminates (or the Port is stopped), it sends MessagePortTerminate. Between these, you exchange data messages.

    hashtag
    Text Mode: Line-Based Communication

    By default, Port operates in text mode. It reads stdout line by line and sends each line as MessagePortText. It reads stderr the same way and sends errors as MessagePortError.

    Text mode uses bufio.Scanner internally, which splits input by lines (newline delimiter). You can customize the splitting logic:

    Text mode is simple and works well for line-oriented protocols: command-response pairs, JSON-per-line, log output, or any text-based format. But it's not suitable for binary protocols.

    hashtag
    Binary Mode: Raw Bytes

    For binary protocols (Protobuf, MessagePack, custom framing), enable binary mode:

    In binary mode, the Port reads raw bytes from stdout and sends them as MessagePortData. You send binary data using MessagePortData messages:

    The Port reads up to ReadBufferSize bytes at a time from stdout and sends each chunk as MessagePortData. There's no framing or splitting - you receive raw bytes as the Port reads them. If your protocol has message boundaries, you must track them yourself.

    Stderr is always processed in text mode, even when binary mode is enabled. Stderr messages arrive as MessagePortError.

    hashtag
    Chunking: Automatic Message Framing

    Reading raw bytes means dealing with partial messages. A 1KB message might arrive as three separate MessagePortData messages (512 bytes, 400 bytes, 88 bytes), or multiple messages might arrive together in one chunk. You need to buffer, reassemble, and detect message boundaries.

    Chunking solves this by automatically framing messages. Instead of receiving raw bytes, you receive complete chunks - one MessagePortData per message, properly framed.

    hashtag
    Fixed-Length Chunks

    If every message is the same size, use fixed-length chunking:

    The Port buffers stdout until it has 256 bytes, then sends them as one MessagePortData. If a read returns 512 bytes, you receive two MessagePortData messages (256 bytes each). If a read returns 100 bytes, the Port waits for more data before sending.

    This is efficient for fixed-size protocols: binary structs, fixed-width encodings, or any format where every message has the same length.

    hashtag
    Header-Based Chunking

    Most binary protocols use variable-length messages with a header that specifies the length. Chunking can parse these headers automatically:

    This configuration matches a protocol where:

    • Every message starts with a 4-byte header

    • The header contains a 4-byte big-endian integer (bytes 0-3)

    • The integer specifies the payload length (header not included)

    • Messages are: [4-byte length][payload]

    The Port reads the header, extracts the length, waits for the full payload to arrive, then sends the complete message (header + payload) as MessagePortData.

    Example protocol:

    With the configuration above, you receive two MessagePortData messages:

    • First: 14 bytes (4-byte header + 10-byte payload)

    • Second: 260 bytes (4-byte header + 256-byte payload)

    If the external program writes both messages at once (274 bytes total), the Port automatically splits them. If the program writes slowly (header arrives, then payload arrives later), the Port waits for the complete message before sending.

    Header length options:

    HeaderLengthSize can be 1, 2, or 4 bytes. All lengths are big-endian. The Port reads the header, extracts the length value, computes the total message size (adding header size if HeaderLengthIncludesHeader is false), and buffers until the complete message arrives.

    MaxLength protection:

    If the header specifies a length exceeding MaxLength, the Port terminates with gen.ErrTooLarge. This protects against malformed messages or malicious programs that claim a message is 4GB (causing memory exhaustion).

    Set MaxLength based on your protocol's reasonable maximum. Leave it zero for no limit (use cautiously).

    hashtag
    Buffer Management

    The Port allocates buffers for reading stdout. By default, each read allocates a new buffer, which is sent in MessagePortData and becomes garbage when you're done with it. For high-throughput ports, this causes GC pressure.

    Use a buffer pool to reuse buffers:

    The Port gets buffers from the pool when reading stdout. When you receive MessagePortData, the Data field is a buffer from the pool. You must return it to the pool when done:

    If you forget to return buffers, the pool will allocate new ones, defeating the purpose. If you return a buffer and then access it later, you'll get corrupted data (the buffer is reused by the Port for the next read).

    When you send MessagePortData to write to stdin, the Port automatically returns the buffer to the pool after writing (if a pool is configured). You don't need to do anything:

    Buffer pools are critical for high-throughput scenarios. For low-volume ports (a few messages per second), the GC overhead is negligible - skip the pool for simplicity.

    hashtag
    Write Keepalive

    Some external programs expect periodic input to stay alive. If stdin goes silent for too long, they timeout or disconnect. You could send keepalive messages from your actor (with timers), but that's tedious and error-prone.

    Enable automatic keepalive:

    The Port wraps stdin with a keepalive flusher. If nothing is written for WriteBufferKeepAlivePeriod, it automatically sends WriteBufferKeepAlive bytes. This keeps the connection alive without any action from your actor.

    The keepalive message can be anything: a null byte, a specific protocol message, a ping command. The external program receives it as normal stdin input. Design your protocol to ignore or handle keepalive messages.

    Keepalive is only available in binary mode. In text mode, you need to send keepalive messages manually.

    hashtag
    Environment Variables

    The external program inherits environment variables based on your configuration:

    EnableEnvOS: Includes the operating system's environment. This gives the program access to PATH, HOME, USER, and other system variables. Useful when the program needs to find other executables or access user-specific paths.

    EnableEnvMeta: Includes environment variables from the meta-process (inherited from its parent actor). Meta-processes share their parent's environment. If the parent has MY_VAR=value, the Port's external program sees MY_VAR=value too.

    Env: Custom variables specific to this Port. These are always included regardless of the other flags.

    Order of precedence (if duplicate names):

    1. Custom Env (highest priority)

    2. Meta-process environment

    3. OS environment (lowest priority)

    hashtag
    Routing Messages

    By default, all Port messages (start, terminate, data, errors) go to the parent process - the actor that spawned the Port. For single-port scenarios, this is fine. For multiple ports or advanced architectures, you want routing:

    All Port messages are sent to the process registered as data_handler. This enables:

    Worker pools:

    The Port sends all messages to a pool, which distributes them across workers. Multiple ports can share the same pool for load balancing.

    Centralized handlers:

    Both ports send messages to python_manager, which coordinates multiple Python scripts.

    Distinguishing ports with tags:

    The Tag field appears in all Port messages. The manager uses it to distinguish which port sent the message:

    If Process is empty or not registered, messages go to the parent process.

    hashtag
    Port Messages

    Messages you receive from the Port:

    MessagePortStart - Port started successfully, external program is running:

    Sent once after the external program starts. Use this to send initialization commands.

    MessagePortTerminate - Port stopped, external program exited:

    Sent when the external program terminates (exit, crash, killed) or when you terminate the Port. After this, the Port is dead - you cannot send it more messages.

    MessagePortText - Line from stdout (text mode only):

    Sent for each line read from stdout in text mode. The delimiter (newline or custom) is stripped from Text.

    MessagePortData - Binary data from stdout (binary mode only):

    In binary mode without chunking, Data contains whatever bytes the Port read (up to ReadBufferSize). With chunking, Data contains one complete chunk.

    If ReadBufferPool is configured, Data is from the pool - return it when done.

    MessagePortError - Line from stderr (always text mode):

    Sent for each line read from stderr. Stderr is always processed in text mode, even when binary mode is enabled for stdout.

    Messages you send to the Port:

    MessagePortText - Send text to stdin (text mode):

    Writes Text to stdin. Newlines are not added automatically - include them if your protocol needs them.

    MessagePortData - Send binary data to stdin (binary mode):

    Writes Data to stdin. If ReadBufferPool is configured, the Port returns the buffer to the pool after writing. Don't use the buffer after sending.

    hashtag
    Termination and Cleanup

    When the external program exits (normally or crash), the Port sends MessagePortTerminate and terminates itself. The Port also kills the external program if:

    • The Port is terminated (you call process.SendExit to the Port's ID)

    • The Port's parent terminates (cascading termination)

    • An error occurs reading stdout (broken pipe, I/O error)

    The Port calls Kill() on the child process and waits for it to exit. This ensures cleanup happens even if the program is misbehaving.

    Stderr is read in a separate goroutine. This means stderr messages can arrive after MessagePortTerminate if the program wrote to stderr just before exiting. Design your actor to handle this ordering.

    hashtag
    Inspection

    Port supports inspection for debugging:

    Returns a map with Port status:

    Use this for monitoring, debugging, or displaying Port status in management UIs.

    hashtag
    Patterns and Pitfalls

    Pattern: Request-response wrapper

    Wrap a Port to provide synchronous Call semantics. Useful for RPC-style protocols.

    Pattern: Supervised restart

    Supervise the actor that spawns ports. If the actor crashes, the supervisor restarts it, which re-spawns ports. Ports inherit parent lifecycle - when the actor terminates, all its ports terminate.

    Pattern: Backpressure with buffer pool

    Limit memory usage by capping concurrent buffers. If processing is slow, the semaphore blocks, which blocks the actor's message loop, which applies backpressure to the Port.

    Pitfall: Forgetting to return buffers

    Pool buffers are reused. If you store them, they'll be overwritten by future reads. Copy data if you need to keep it.

    Pitfall: Blocking on stdin writes

    If the external program stops reading stdin (buffer full, process blocked), the Port blocks writing. The Port's HandleMessage is blocked, so it can't send you more stdout data. Deadlock.

    Solution: Design your protocol so the external program never stops reading stdin. Use flow control or chunking to prevent overflows.

    Pitfall: Ignoring MessagePortError

    Stderr messages arrive as MessagePortError. If you don't handle them, warnings and errors from the external program are lost. Always handle stderr or explicitly decide to ignore it.

    Pitfall: Not handling MessagePortTerminate

    After MessagePortTerminate, the Port is dead. Sending messages returns errors. Handle termination: restart the Port, fail gracefully, or terminate your actor.

    Port meta-processes enable clean integration with external programs. They handle process management, I/O buffering, protocol framing, and lifecycle coordination - letting you focus on the protocol logic while maintaining the actor model's isolation and simplicity.

    type Controller struct {
        act.Actor
        portID gen.Alias
    }
    
    func (c *Controller) Init(args ...any) error {
        // Define port options
        options := meta.PortOptions{
            Cmd:  "python3",
            Args: []string{"processor.py", "--mode=batch"},
            Env: map[gen.Env]string{
                "WORKER_ID": "worker-1",
            },
        }
        
        // Create port behavior
        portBehavior, err := meta.CreatePort(options)
        if err != nil {
            return fmt.Errorf("failed to create port: %w", err)
        }
        
        // Spawn as meta-process
        portID, err := c.SpawnMeta(portBehavior, gen.MetaOptions{})
        if err != nil {
            return fmt.Errorf("failed to spawn port: %w", err)
        }
        
        c.portID = portID
        c.Log().Info("spawned port for %s (id: %s)", options.Cmd, portID)
        return nil
    }
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortStart:
            c.Log().Info("port started: %s", m.ID)
            // Send initial command
            c.Send(m.ID, meta.MessagePortText{Text: "INIT worker-1\n"})
            
        case meta.MessagePortText:
            // Received line from stdout
            c.Log().Info("port output: %s", m.Text)
            c.processOutput(m.Text)
            
        case meta.MessagePortError:
            // Received line from stderr
            c.Log().Warning("port error: %s", m.Error)
            
        case meta.MessagePortTerminate:
            c.Log().Info("port terminated: %s", m.ID)
            // Restart or cleanup
        }
        return nil
    }
    
    func (c *Controller) processCommand(cmd string) {
        // Send command to external program
        c.Send(c.portID, meta.MessagePortText{
            Text: cmd + "\n",
        })
    }
    options := meta.PortOptions{
        Cmd: "processor",
        
        // Custom split function for stdout
        SplitFuncStdout: func(data []byte, atEOF bool) (advance int, token []byte, err error) {
            // Find null-terminated strings instead of newlines
            if i := bytes.IndexByte(data, 0); i >= 0 {
                return i + 1, data[:i], nil
            }
            if atEOF && len(data) > 0 {
                return len(data), data, nil
            }
            return 0, nil, nil
        },
        
        // Custom split function for stderr (optional)
        SplitFuncStderr: bufio.ScanWords, // Split stderr by words
    }
    options := meta.PortOptions{
        Cmd: "binary-processor",
        Binary: meta.PortBinaryOptions{
            Enable:         true,
            ReadBufferSize: 16384,  // 16KB read buffer
        },
    }
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortStart:
            // Send binary request
            request := encodeRequest("GET", "/data")
            c.Send(m.ID, meta.MessagePortData{Data: request})
            
        case meta.MessagePortData:
            // Received binary data from stdout
            response := decodeResponse(m.Data)
            c.handleResponse(response)
        }
        return nil
    }
    options := meta.PortOptions{
        Cmd: "fixed-protocol",
        Binary: meta.PortBinaryOptions{
            Enable: true,
            ReadChunk: meta.ChunkOptions{
                Enable:      true,
                FixedLength: 256,  // Every message is exactly 256 bytes
            },
        },
    }
    options := meta.PortOptions{
        Cmd: "length-prefix-protocol",
        Binary: meta.PortBinaryOptions{
            Enable: true,
            ReadChunk: meta.ChunkOptions{
                Enable: true,
                
                // Header structure
                HeaderSize:           4,  // 4-byte header
                HeaderLengthPosition: 0,  // Length starts at byte 0
                HeaderLengthSize:     4,  // Length is a 4-byte integer
                
                // Does length include the header?
                HeaderLengthIncludesHeader: false,  // Length is payload only
                
                // Safety limit
                MaxLength: 1048576,  // Max 1MB per message
            },
        },
    }
    Message 1: [0x00 0x00 0x00 0x0A] [10 bytes of payload]
    Message 2: [0x00 0x00 0x01 0x00] [256 bytes of payload]
    // Length is in bytes 2-3 (2-byte length at offset 2)
    HeaderLengthPosition: 2,
    HeaderLengthSize:     2,
    
    // Length includes the header (length = total message size)
    HeaderLengthIncludesHeader: true,
    
    // Protocol: [type][flags][length-MSB][length-LSB][payload]
    //            byte0  byte1  byte2       byte3      bytes 4+
    MaxLength: 65536,  // Reject messages larger than 64KB
    bufferPool := &sync.Pool{
        New: func() any {
            return make([]byte, 16384)
        },
    }
    
    options := meta.PortOptions{
        Cmd: "high-throughput",
        Binary: meta.PortBinaryOptions{
            Enable:         true,
            ReadBufferSize: 16384,
            ReadBufferPool: bufferPool,
        },
    }
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortData:
            // Process the data
            c.processData(m.Data)
            
            // Return buffer to pool
            bufferPool.Put(m.Data)
        }
        return nil
    }
    buf := bufferPool.Get().([]byte)
    // Fill buf with data
    c.Send(portID, meta.MessagePortData{Data: buf})
    // Port returns buf to pool after writing
    options := meta.PortOptions{
        Cmd: "keepalive-required",
        Binary: meta.PortBinaryOptions{
            Enable:                     true,
            WriteBufferKeepAlive:       []byte{0x00}, // Send null byte
            WriteBufferKeepAlivePeriod: 5 * time.Second,
        },
    }
    options := meta.PortOptions{
        Cmd: "processor",
        
        // Enable OS environment variables (PATH, HOME, etc)
        EnableEnvOS: true,
        
        // Enable meta-process environment variables
        EnableEnvMeta: true,
        
        // Custom environment variables
        Env: map[gen.Env]string{
            "WORKER_ID": "worker-1",
            "LOG_LEVEL": "debug",
        },
    }
    options := meta.PortOptions{
        Cmd:     "worker",
        Process: "data_handler",  // Send all messages to this registered process
    }
    options := meta.PortOptions{
        Cmd:     "processor",
        Process: "worker_pool",  // act.Pool actor
    }
    options := meta.PortOptions{
        Cmd:     "python3",
        Args:    []string{"script1.py"},
        Process: "python_manager",
    }
    
    options2 := meta.PortOptions{
        Cmd:     "python3",
        Args:    []string{"script2.py"},
        Process: "python_manager",
    }
    options1 := meta.PortOptions{
        Cmd:     "worker",
        Tag:     "input-processor",
        Process: "manager",
    }
    
    options2 := meta.PortOptions{
        Cmd:     "worker",
        Tag:     "output-formatter",
        Process: "manager",
    }
    func (m *Manager) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case meta.MessagePortData:
            switch msg.Tag {
            case "input-processor":
                m.handleInput(msg.Data)
            case "output-formatter":
                m.handleOutput(msg.Data)
            }
        }
        return nil
    }
    type MessagePortStart struct {
        ID  gen.Alias  // Port's meta-process ID
        Tag string     // Tag from PortOptions
    }
    type MessagePortTerminate struct {
        ID  gen.Alias
        Tag string
    }
    type MessagePortText struct {
        ID   gen.Alias
        Tag  string
        Text string  // One line (delimiter removed)
    }
    type MessagePortData struct {
        ID   gen.Alias
        Tag  string
        Data []byte  // Raw bytes or complete chunk
    }
    type MessagePortError struct {
        ID    gen.Alias
        Tag   string
        Error error  // Line from stderr as an error
    }
    c.Send(portID, meta.MessagePortText{
        Text: "COMMAND arg1 arg2\n",
    })
    c.Send(portID, meta.MessagePortData{
        Data: encodedMessage,
    })
    result, err := process.Call(portID, gen.Inspect{})
    map[string]string{
        "tag":               "worker-1",
        "cmd":               "/usr/bin/python3",
        "args":              "[script.py --mode=batch]",
        "pid":               "12345",       // OS process ID
        "binary":            "true",        // Binary mode enabled
        "binary.read_chunk": "true",        // Chunking enabled
        "env":               "[WORKER_ID=worker-1]",
        "pwd":               "/path/to/working/dir",
        "bytesIn":           "1048576",     // Bytes read from stdout
        "bytesOut":          "524288",      // Bytes written to stdin
    }
    type PortWrapper struct {
        act.Actor
        portID   gen.Alias
        pending  map[gen.Ref]gen.PID
        sequence uint64
    }
    
    func (w *PortWrapper) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        // Generate unique request ID
        reqID := atomic.AddUint64(&w.sequence, 1)
        
        // Store caller
        w.pending[ref] = from
        
        // Send to port with ID
        w.Send(w.portID, meta.MessagePortText{
            Text: fmt.Sprintf("%d:%s\n", reqID, request),
        })
        
        // Will respond asynchronously
        return nil, nil
    }
    
    func (w *PortWrapper) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortText:
            // Parse response: "reqID:result"
            parts := strings.SplitN(m.Text, ":", 2)
            reqID, result := parts[0], parts[1]
            
            // Find pending caller
            for ref, caller := range w.pending {
                if matchesRequestID(ref, reqID) {
                    w.SendResponse(caller, ref, result)
                    delete(w.pending, ref)
                    break
                }
            }
        }
        return nil
    }
    type PortSupervisor struct {
        act.Supervisor
    }
    
    func (s *PortSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
        return act.SupervisorSpec{
            Children: []act.SupervisorChildSpec{
                {
                    Name:    "port_manager",
                    Factory: createPortManager,
                },
            },
            Restart: act.SupervisorRestart{
                Strategy:  act.SupervisorStrategyPermanent,
                Intensity: 5,
                Period:    10,
            },
        }, nil
    }
    bufferPool := &sync.Pool{
        New: func() any {
            return make([]byte, 8192)
        },
    }
    
    // Limit concurrent buffers
    sem := make(chan struct{}, 100) // Max 100 buffers in flight
    
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortData:
            // Acquire semaphore (blocks if 100 buffers in use)
            sem <- struct{}{}
            
            go func() {
                defer func() {
                    <-sem             // Release semaphore
                    bufferPool.Put(m.Data) // Return buffer
                }()
                
                // Process data (can be slow)
                c.processData(m.Data)
            }()
        }
        return nil
    }
    // WRONG: Buffer leaked
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortData:
            c.dataQueue = append(c.dataQueue, m.Data) // Stored, never returned!
        }
        return nil
    }
    
    // CORRECT: Copy if you need to store
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortData:
            copied := make([]byte, len(m.Data))
            copy(copied, m.Data)
            c.dataQueue = append(c.dataQueue, copied)
            bufferPool.Put(m.Data) // Return original
        }
        return nil
    }
    // Port writes are blocking
    c.Send(portID, meta.MessagePortData{Data: largeBuffer})
    // ^ This Send doesn't block, but the Port's write to stdin might
    // WRONG: Stderr ignored
    func (c *Controller) HandleMessage(from gen.PID, message any) error {
        switch m := message.(type) {
        case meta.MessagePortData:
            c.process(m.Data)
        // No case for MessagePortError!
        }
        return nil
    }
    // WRONG: Port terminated, but actor keeps trying to use it
    func (c *Controller) processData(data []byte) {
        c.Send(c.portID, meta.MessagePortData{Data: data})
        // ^ Fails if port terminated
    }

    events/

    All consumers

    Change field type

    No

    Create new version

    Rename field

    No

    Create new version

    Reorder fields

    No

    Create new version

    Module

    receiver-api/

    events/

    Changes

    Receiver team decides

    All consumers coordinate

    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner

    Handling Sync Requests

    Handling synchronous requests in the asynchronous actor model

    The actor model is fundamentally asynchronous. Processes send messages and continue immediately without waiting for responses. This asynchrony is core to the model - actors don't block, they process messages one at a time from their mailbox, and they scale because thousands of actors can run concurrently without threads blocking on I/O or responses.

    But real systems often need synchronous patterns. A client makes a request and must wait for a response before continuing. An HTTP handler receives a request and can't return to the client until the response is ready. A database query needs to block until the data arrives. These synchronous requirements don't disappear just because your system uses actors.

    The challenge is satisfying these synchronous requirements without actually blocking the actor. If an actor blocks waiting for a response, it can't process other messages in its mailbox. The actor becomes unresponsive to everything else. This defeats the purpose of the actor model - you want concurrent message processing, not sequential blocking.

    This chapter explores how to handle synchronous-style requests while maintaining asynchronous actor behavior. You'll learn how the framework implements request-response, how to handle Call requests efficiently, and how to process them asynchronously even when the caller is blocked waiting.

    hashtag
    The Nature of Synchronous Calls in Actors

    In traditional synchronous code, when you call a function, you wait for it to return:

    The calling thread stops. The operating system schedules other threads. Eventually the query completes, the thread wakes up, and execution continues. This is fine when you have many threads - some block, others run. But it's wasteful, and it doesn't scale to tens of thousands of concurrent operations.

    In the actor model, you send a message and continue:

    The sender doesn't block. The message goes into the database actor's mailbox. When the database actor processes it, it sends a response message back. The original sender handles that response later in its own message loop. This is how actors achieve massive concurrency - no actor ever blocks waiting, so you can run thousands of actors with a small thread pool.

    But what if the sender legitimately needs to wait? What if it's an HTTP handler that can't return to the client until the query completes?

    The framework provides Call for this:

    From the caller's perspective, this looks synchronous - you call, you wait, you get a result. But from the system's perspective, it's asynchronous:

    1. The caller sends a request message with a unique reference (gen.Ref)

    2. The caller's goroutine blocks waiting for a response with that reference

    3. The recipient receives the request as a HandleCall invocation

    The caller blocks, but blocking is isolated to that one actor. The actor's goroutine is suspended (cheap), not spinning (expensive). Other actors run normally. The recipient processes the request whenever it gets to it in its mailbox, not immediately. The entire system remains asynchronous, but individual actors can use synchronous-style APIs when needed.

    hashtag
    Basic HandleCall Implementation

    When a process receives a Call request, the framework invokes HandleCall:

    Critical distinction: The error you return from HandleCall is not the response to the caller - it's the termination reason for your process!

    • return result, nil - Send result to caller, continue running

    • return errorValue, nil - Send errorValue to caller, continue running

    When you return a non-nil result from HandleCall, the framework automatically sends it as a response message to the caller. The caller's blocked Call unblocks and returns your result. Any value can be a result - integers, strings, structs, even errors.

    If you need to send an error to the caller, return the error as the result value, not as the error return:

    The second return value (error) is for terminating your process. Return gen.TerminateReasonNormal to gracefully stop, or any other error for abnormal termination. If you return both a result and gen.TerminateReasonNormal, the framework sends the result first, then terminates your process.

    From the caller's side:

    The caller blocks at Call until your HandleCall returns. This can be milliseconds (local, fast computation) or seconds (remote, slow operation). The caller can specify a timeout - if no response arrives within the timeout, Call returns nil, gen.ErrTimeout.

    Note the distinction: err from Call is a framework-level error (timeout, network failure, process terminated). The result itself might be an error value sent by your HandleCall - that's application-level.

    hashtag
    Why Not Just Use Channels?

    You might wonder: why not just use Go channels for request-response?

    This breaks the actor model in subtle ways:

    Shared memory - Channels are shared memory. Passing a channel in a message creates a direct communication path outside the actor system. If the worker is on a remote node, the channel doesn't work (channels don't serialize). Your code becomes non-portable between local and remote.

    Blocking semantics - Blocking on a channel blocks the actor's goroutine, but the actor is still "running" from the framework's perspective. The actor can't process other messages while blocked. With Call, the framework knows the actor is waiting for a response and can properly account for it (the actor is in ProcessStateWaitResponse).

    Timeout coordination - Channels don't have built-in timeouts. You'd wrap them in select with time.After, but timeout cleanup is tricky. With Call, timeouts are built-in, and references have deadlines that the receiver can check.

    No network transparency - Call works identically for local and remote processes. Channels don't. If you use channels for local request-response, your code won't work when you move to a distributed deployment.

    The framework's Call mechanism is designed specifically for request-response in the actor model, works across the network, and integrates properly with the actor lifecycle.

    hashtag
    Handling Requests with Worker Pools

    A common pattern is a server process that receives many Call requests. If processing each request takes time (database query, HTTP call, complex computation), handling them sequentially in HandleCall creates a bottleneck. One slow request delays all subsequent requests.

    The solution is act.Pool - a specialized actor that automatically distributes requests across a pool of worker actors:

    Notice what's not in this code - there's no HandleCall for the Server. You don't need one.

    act.Pool automatically intercepts all incoming Call requests and forwards them to workers. When you send a Call to the Server PID, the Pool:

    1. Receives the Call request in its mailbox

    2. Pops an available worker from the pool

    3. Forwards the entire request (from, ref, message) to the worker

    The worker receives the Call request with the original caller's PID and ref. When the worker returns a result from HandleCall, it goes directly to the original caller, bypassing the Pool entirely. The Pool is just a router.

    From the caller's perspective:

    This gives you concurrent request processing:

    • 10 Call requests arrive at the Server simultaneously

    • Pool forwards each to a different worker

    • All 10 workers process concurrently

    The caller's experience is unchanged - they call, they block, they get a result. They don't know about the pool. The concurrency is entirely internal to the server.

    Worker resilience:

    If a worker crashes or becomes unresponsive, the Pool automatically spawns a replacement worker. Worker failures don't affect the Pool's availability - other workers continue processing requests while the Pool restarts failed workers in the background.

    If all workers are busy (mailboxes full), incoming requests queue up in the Pool's mailbox until a worker becomes available.

    For more details on Pool configuration and advanced patterns, see .

    hashtag
    Asynchronous Processing of Synchronous Requests

    Sometimes you need to handle a Call request asynchronously within a single actor, without workers. Maybe you're waiting for a timer, or you need to make another Call before you can respond, or you want to batch multiple requests.

    You can do this manually:

    The pattern:

    1. HandleCall stores from and ref for later

    2. HandleCall returns (nil, nil) - async handling

    You must respond eventually, or the caller will timeout. If you lose track of the ref or forget to respond, the caller waits until timeout and gets gen.ErrTimeout.

    The result you send with SendResponse can be any value - strings, numbers, structs, even errors. If you want to send an error to the caller, just send it as a normal result value:

    The caller receives it as result (first return value from Call) and can check if it's an error.

    hashtag
    SendResponse vs SendResponseError: Two Channels for Results

    When you handle Call requests asynchronously, you send responses later using SendResponse. But there's also SendResponseError. What's the difference, and when do you use each?

    The difference is in which return value the caller receives from Call.

    SendResponse sends to the result channel:

    Whatever you send appears as the first return value (result). The second return value (err) is nil, meaning no framework error occurred. The result can be anything - strings, numbers, structs, even errors:

    The caller must check if the result is an error:

    SendResponseError sends to the error channel:

    The error appears as the second return value (err), exactly where framework errors like timeout and network failures appear. The first return value (result) is nil.

    From the caller's perspective, there's no difference between an error from SendResponseError and a framework error:

    The problem with mixing channels

    The framework uses the error channel for transport errors - problems with the messaging infrastructure. Your application uses it for business logic results. When you call SendResponseError, you're mixing these two concerns.

    Consider a typical caller error handling:

    This makes sense for transport errors - network glitches, temporary overload. But if the database actor uses SendResponseError for "record not found", the caller retries unnecessarily. The record won't appear in one second.

    The caller has no way to distinguish. Both arrive through the error channel.

    When mixing is justified

    Despite this issue, SendResponseError has legitimate uses. The key is: use it for errors that should be handled like transport errors.

    Imagine a database query actor. It receives queries, executes them against a database, and returns results. What errors can occur?

    Application errors - problems with the query itself:

    • Bad SQL syntax

    • Permission denied

    • Constraint violation

    These are not infrastructure problems. The actor is working fine, the database is up, the request was processed. The query just has issues. The caller should see these as results, not transport failures.

    Infrastructure errors - problems with the database connection:

    • Database server is down

    • Network to database lost

    • Connection pool exhausted

    • Too many simultaneous connections

    These are infrastructure problems. The actor couldn't process the request because a dependency is unavailable. From the caller's perspective, this is the same as if the actor itself were unreachable (timeout) or the node were down (network failure). The caller should handle all of these identically - retry, fallback, circuit breaking.

    Here's how to implement this:

    The caller handles both channels naturally:

    This works because the caller wants to handle infrastructure failures identically, regardless of whether they originate from the framework (timeout, network) or from the application (database down). Both represent unavailable service, both trigger the same fallback logic.

    Guideline

    Use SendResponse for all normal cases, including expected errors (validation, not found, unauthorized). These are results - the request was processed, here's what happened.

    Use SendResponseError only when the error represents an infrastructure failure that the caller should treat the same as transport errors - retry with backoff, circuit breaking, fallback to alternative services.

    If in doubt, use SendResponse. It keeps transport and application concerns separate, giving the caller maximum clarity.

    hashtag
    Using Ref.IsAlive for Timeout Awareness

    When you handle requests asynchronously, the caller might timeout before you respond. Imagine:

    1. Caller makes a Call with 5 second timeout

    2. Your HandleCall stores the request, returns nil (async)

    3. 6 seconds pass

    4. Caller's timeout fires,

    Your response arrives after the caller stopped waiting. The caller won't receive it (it's not waiting on that ref anymore). Your work was wasted.

    You can detect this with ref.IsAlive():

    ref.IsAlive() checks the deadline embedded in the reference. When the caller made the Call with a timeout, the framework created a reference with MakeRefWithDeadline(now + timeout). The deadline is stored in ref.ID[2] as a unix timestamp. IsAlive() compares it to the current time - if the deadline passed, it returns false.

    This lets you skip processing expired requests. If a request took too long to reach the front of the queue, and the caller already gave up, don't waste resources computing a response nobody will receive.

    But be careful: IsAlive() returning false doesn't mean the caller is definitely gone. It means the deadline passed. The caller might have disappeared for other reasons (crash, network disconnect), or they might still exist but already moved on. It's a hint for optimization, not a guarantee about caller state.

    If you send a response after the deadline, nothing bad happens. The response message arrives, the receiver checks if anyone is waiting for that ref, finds nobody, and drops the message. It's just wasted work - harmless but inefficient.

    hashtag
    Common Patterns and Pitfalls

    Pattern: Immediate vs deferred

    Some requests you can answer immediately, others need async processing. Mix both in the same HandleCall based on the situation.

    Pattern: Batch processing

    Accumulate requests, process them together, respond to each individually. Efficient for operations with high setup cost (database connections, API requests with rate limits).

    Pitfall: Losing references

    You need both from and ref to send a response. Store them together.

    Pitfall: Confusing result errors with termination errors

    This is the most common mistake. Remember: the error return from HandleCall terminates your process, it doesn't go to the caller (except the special case of gen.TerminateReasonNormal with a non-nil result).

    Pitfall: Blocking in HandleCall

    Even though the caller is blocked waiting, your actor shouldn't block. If you sleep for 5 seconds, you can't handle other messages during that time. Other callers will queue up waiting. If this is unavoidable (calling a blocking API you don't control), spawn a worker to handle it or use act.Pool.

    hashtag
    The Path to Important Delivery

    Everything discussed so far assumes the response message arrives. But what if it doesn't? Networks drop packets. Remote processes crash. Connections fail.

    When a response is lost, the caller blocks until timeout. Eventually Call returns gen.ErrTimeout, but you don't know if the request was processed or not. Did the receiver handle it and the response got lost? Or did the request itself get lost before reaching the receiver?

    This uncertainty is a fundamental problem in distributed systems. The framework's Call mechanism gives you request-response semantics, but it doesn't guarantee the response arrives. It's "best effort" - works reliably for local calls and stable network connections, but no guarantees.

    For many use cases, this is fine. Timeouts are acceptable. Callers can retry. Idempotent operations tolerate retries. But some operations can't tolerate uncertainty. A payment authorization must definitely succeed or definitely fail - timeout isn't acceptable.

    The solution is Important Delivery. When you enable the Important flag, the framework changes from "best effort" to "confirmed delivery." Responses don't just get sent, they get acknowledged. If the response fails to deliver, you know immediately rather than waiting for timeout.

    Important Delivery makes the network transparent for failures, not just successes. It turns request-response from "probably works" into "definitely works or definitely fails, no ambiguity."

    We'll explore Important Delivery in depth in the next chapter. For now, understand that everything you've learned about Call and HandleCall still applies. Important Delivery is a layer on top, not a replacement. You'll still handle requests the same way - the framework just makes delivery more reliable.

    For details on how messages and calls flow through the network, see . For understanding delivery guarantees, continue to .

    The recipient processes it and sends a response message with the same reference

  • The response arrives in the caller's mailbox, waking up the blocked goroutine

  • The caller's Call returns with the result

  • return result, gen.TerminateReasonNormal - Send result to caller, then terminate normally
  • return nil, someError - Terminate with someError, no response sent to caller

  • Returns the worker to the pool (reusable for next request)
    Each worker responds directly to its caller
  • Pool remains free to route more requests

  • Later (timer, another message, whatever), you process the request
  • Call SendResponse(from, ref, result) to send the result

  • Caller's blocked Call unblocks with your result

  • Call
    returns
    gen.ErrTimeout
  • Your actor finishes processing and calls SendResponse

  • Pool Actor
    Network Transparency
    Important Delivery
    result := database.Query("SELECT * FROM users")
    // blocked here until query completes
    processResult(result)
    database.Send(QueryRequest{SQL: "SELECT * FROM users"})
    // immediately continues, doesn't wait
    doOtherWork()
    result, err := process.Call(databasePID, QueryRequest{SQL: "SELECT * FROM users"})
    // blocked here, but only this actor is blocked
    // other actors continue running normally
    type Calculator struct {
        act.Actor
    }
    
    func (c *Calculator) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case AddRequest:
            result := req.A + req.B
            return result, nil
            
        case DivideRequest:
            if req.B == 0 {
                // Return error as the result value, not as termination reason
                return fmt.Errorf("division by zero"), nil
            }
            result := req.A / req.B
            return result, nil
            
        default:
            // Return error as the result value
            return fmt.Errorf("unknown request type"), nil
        }
    }
    // WRONG - terminates the process!
    if invalid {
        return nil, fmt.Errorf("invalid request")
    }
    
    // CORRECT - sends error to caller
    if invalid {
        return fmt.Errorf("invalid request"), nil
    }
    // Somewhere in another actor
    result, err := process.Call(calculatorPID, AddRequest{A: 10, B: 20})
    if err != nil {
        // This is a framework error (timeout, connection lost, etc)
        process.Log().Error("call failed: %s", err)
        return err
    }
    
    // Check if the result itself is an error (application-level error)
    if errResult, ok := result.(error); ok {
        process.Log().Error("calculator returned error: %s", errResult)
        return errResult
    }
    
    sum := result.(int)
    process.Log().Info("10 + 20 = %d", sum)
    // Tempting but wrong in actor model
    response := make(chan Result)
    process.Send(workerPID, Request{Data: data, ResponseChan: response})
    result := <-response  // block waiting
    type Server struct {
        act.Pool
    }
    
    type Worker struct {
        act.Actor
    }
    
    func (s *Server) Init(args ...any) (act.PoolOptions, error) {
        return act.PoolOptions{
            PoolSize:      10,  // 10 worker actors
            WorkerFactory: func() gen.ProcessBehavior { return &Worker{} },
        }, nil
    }
    
    // No HandleCall needed for Server! Pool handles forwarding automatically.
    
    func (w *Worker) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        // Process the request
        switch req := request.(type) {
        case QueryRequest:
            // Simulate slow operation
            time.Sleep(100 * time.Millisecond)
            result := fmt.Sprintf("Result for: %s", req.Query)
            return result, nil
            
        default:
            // Return error as result value, not termination reason
            return fmt.Errorf("unknown request"), nil
        }
    }
    // Caller doesn't know about the pool
    result, err := process.Call(serverPID, QueryRequest{Query: "data"})
    // Result comes from whichever worker handled it
    type AsyncHandler struct {
        act.Actor
        pending map[gen.Ref]pendingRequest
    }
    
    type pendingRequest struct {
        from gen.PID
        data any
    }
    
    func (a *AsyncHandler) Init(args ...any) error {
        a.pending = make(map[gen.Ref]pendingRequest)
        return nil
    }
    
    func (a *AsyncHandler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case BatchRequest:
            // Store the request for later
            a.pending[ref] = pendingRequest{from: from, data: req}
            
            // Maybe set a timer to process after accumulating more requests
            a.SendAfter(a.PID(), BatchTrigger{}, 100 * time.Millisecond)
            
            // Return nil to handle asynchronously
            return nil, nil
            
        case ImmediateRequest:
            // This one we can answer immediately
            return "immediate result", nil
        }
        
        // Return error as result value
        return fmt.Errorf("unknown request"), nil
    }
    
    func (a *AsyncHandler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case BatchTrigger:
            // Time to respond to all pending requests
            for ref, pr := range a.pending {
                result := a.processBatch(pr.data)
                a.SendResponse(pr.from, ref, result)
            }
            a.pending = make(map[gen.Ref]pendingRequest)  // clear
        }
        return nil
    }
    if invalid {
        a.SendResponse(pr.from, ref, fmt.Errorf("validation failed"))
    }
    // Handler
    a.SendResponse(caller, ref, "success")
    
    // Caller receives
    result, err := process.Call(handler, request)
    // result = "success"
    // err = nil
    // Handler sends an error as a result
    a.SendResponse(caller, ref, fmt.Errorf("user not found"))
    
    // Caller receives
    result, err := process.Call(handler, request)
    // result = error("user not found")
    // err = nil
    result, err := process.Call(handler, request)
    if err != nil {
        // Framework problem - timeout, network, process died
        return fmt.Errorf("call failed: %w", err)
    }
    
    if errResult, ok := result.(error); ok {
        // Application-level error
        return fmt.Errorf("operation failed: %w", errResult)
    }
    
    // Success - use result
    processResult(result)
    // Handler
    a.SendResponseError(caller, ref, fmt.Errorf("database unavailable"))
    
    // Caller receives
    result, err := process.Call(handler, request)
    // result = nil
    // err = error("database unavailable")
    result, err := process.Call(handler, request)
    if err != nil {
        // Could be:
        // - Timeout (gen.ErrTimeout)
        // - Network failure (gen.ErrNoConnection)
        // - Process crashed (gen.ErrProcessTerminated)
        // - OR: Handler sent via SendResponseError
        // Caller cannot distinguish!
        return fmt.Errorf("call failed: %w", err)
    }
    result, err := process.Call(databaseActor, query)
    if err != nil {
        // Retry logic for transport errors
        time.Sleep(1 * time.Second)
        result, err = process.Call(databaseActor, query)
        if err != nil {
            return err  // Give up
        }
    }
    type DatabaseActor struct {
        act.Actor
        db      *sql.DB
        pending map[gen.Ref]pendingRequest
    }
    
    type pendingRequest struct {
        from  gen.PID
        query string
    }
    
    func (d *DatabaseActor) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        query := request.(string)
        
        // Store for async processing
        d.pending[ref] = pendingRequest{from: from, query: query}
        
        // Trigger async processing
        d.Send(d.PID(), executeQuery{ref: ref})
        
        return nil, nil  // Will respond asynchronously
    }
    
    func (d *DatabaseActor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case executeQuery:
            pr := d.pending[msg.ref]
            
            // Execute query
            rows, err := d.db.Query(pr.query)
            
            if err != nil {
                // Distinguish error types
                if isInfrastructureError(err) {
                    // Database down, connection lost, etc
                    // Send as transport error - caller should retry/fallback
                    d.SendResponseError(pr.from, msg.ref, fmt.Errorf("database unavailable: %w", err))
                } else {
                    // Bad SQL, permission denied, etc
                    // Send as application result - caller should show to user
                    d.SendResponse(pr.from, msg.ref, fmt.Errorf("query failed: %w", err))
                }
                delete(d.pending, msg.ref)
                return nil
            }
            
            // Success
            d.SendResponse(pr.from, msg.ref, rows)
            delete(d.pending, msg.ref)
        }
        return nil
    }
    
    func isInfrastructureError(err error) bool {
        // Check for connection-related errors
        if strings.Contains(err.Error(), "connection refused") {
            return true
        }
        if strings.Contains(err.Error(), "too many connections") {
            return true
        }
        // ... other infrastructure error checks
        return false
    }
    result, err := process.Call(databaseActor, "SELECT * FROM users")
    if err != nil {
        // Infrastructure problem:
        // - Database is down (SendResponseError)
        // - Actor timed out (gen.ErrTimeout)
        // - Network failure (gen.ErrNoConnection)
        // All handled the same way - try fallback
        
        process.Log().Warning("database unavailable, using cache: %s", err)
        return useFallbackCache()
    }
    
    // Check if result is an error
    if errResult, ok := result.(error); ok {
        // Application error - bad query, permission denied, etc
        // Don't retry, don't fallback - show to user
        return fmt.Errorf("query error: %w", errResult)
    }
    
    // Success
    return result
    func (a *AsyncHandler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case BatchTrigger:
            for ref, pr := range a.pending {
                // Check if the caller is still waiting
                if !ref.IsAlive() {
                    // Timeout expired, don't bother processing
                    a.Log().Warning("request %s expired, skipping", ref)
                    delete(a.pending, ref)
                    continue
                }
                
                // Still waiting, process and respond
                result := a.processBatch(pr.data)
                a.SendResponse(pr.from, ref, result)
                delete(a.pending, ref)
            }
        }
        return nil
    }
    func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        switch req := request.(type) {
        case CachedRequest:
            // We have the answer immediately
            if result, found := a.cache[req.Key]; found {
                return result, nil
            }
            // Cache miss, fetch asynchronously
            a.pending[ref] = pendingRequest{from: from, data: req}
            a.fetchFromBackend(req.Key, ref)
            return nil, nil
            
        case WriteRequest:
            // Writes are fast, handle synchronously
            a.data[req.Key] = req.Value
            return "ok", nil
        }
        // Return error as result value
        return fmt.Errorf("unknown request"), nil
    }
    type Batcher struct {
        act.Actor
        pending []pendingRequest
        timer   gen.CancelFunc
    }
    
    type pendingRequest struct {
        from gen.PID
        ref  gen.Ref
        data any
    }
    
    func (b *Batcher) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        // Add to batch
        b.pending = append(b.pending, pendingRequest{from, ref, request})
        
        // Start timer if this is the first request
        if len(b.pending) == 1 {
            b.timer = b.SendAfter(b.PID(), Flush{}, 100 * time.Millisecond)
        }
        
        // If batch is full, flush immediately
        if len(b.pending) >= 100 {
            if b.timer != nil {
                b.timer()  // cancel timer
            }
            b.flush()
        }
        
        return nil, nil
    }
    
    func (b *Batcher) flush() {
        // Process all pending requests in one batch
        results := b.processBatch(b.pending)
        
        for i, pr := range b.pending {
            if pr.ref.IsAlive() {
                b.SendResponse(pr.from, pr.ref, results[i])
            }
        }
        
        b.pending = b.pending[:0]  // clear, keep capacity
    }
    // WRONG: Storing only the reference
    func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        a.pendingRefs = append(a.pendingRefs, ref)  // Lost the 'from'!
        return nil, nil
    }
    
    // Later - how do we respond?
    func (a *Handler) respond() {
        for _, ref := range a.pendingRefs {
            a.SendResponse(???, ref, result)  // Who do we send to?
        }
    }
    // WRONG: This terminates your process!
    func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        if !a.isAuthorized(from) {
            return nil, fmt.Errorf("unauthorized")  // OOPS! Process terminates
        }
        return a.process(request), nil
    }
    
    // CORRECT: Send error as result to caller
    func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        if !a.isAuthorized(from) {
            return fmt.Errorf("unauthorized"), nil  // Caller gets error, process continues
        }
        return a.process(request), nil
    }
    
    // ALSO CORRECT: For async handling
    func (a *Handler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case processedResult:
            // Send any result - value or error, doesn't matter
            a.SendResponse(msg.caller, msg.ref, msg.result)
        }
        return nil
    }
    // WRONG: Blocks the actor
    func (a *Handler) HandleCall(from gen.PID, ref gen.Ref, request any) (any, error) {
        time.Sleep(5 * time.Second)  // Actor can't process other messages!
        return "done", nil
    }

    Pub/Sub Internals

    How the Pub/Sub system works internally

    This document explains how Ergo Framework's pub/sub system works under the hood. It's written for developers who want to understand the architecture, network behavior, and performance characteristics when building distributed systems.

    For basic usage, see Links and Monitors and Events. This document assumes you're familiar with those concepts and focuses on how the system works internally.

    hashtag
    The Unified Architecture

    Links, monitors, and events look like separate features when you use them. But underneath, they share the same mechanism. Understanding this unification explains why the system behaves consistently and why certain optimizations work.

    hashtag
    The Core Concept

    Every interaction in the pub/sub system follows one pattern:

    A consumer subscribes to a target and receives notifications about that target.

    This applies whether you're linking to a process, monitoring a registered name, or subscribing to an event stream. The differences are in what you subscribe to and what notifications you receive.

    hashtag
    Three Components of Every Subscription

    1. Consumer - The process creating the subscription. This is the process that will receive notifications when something happens to the target.

    2. Target - What the consumer subscribes to. Targets come in several types:

    Target Type
    Example
    What It Represents

    3. Subscription Type - How the consumer wants to receive notifications:

    Type
    Creates
    Notification
    Effect on Consumer

    The combination of target type and subscription type determines what message you receive:

    hashtag
    Implicit vs Explicit Events

    The targets divide into two categories based on what notifications they generate:

    Implicit Events - Processes, names, aliases, and nodes generate termination notifications automatically. The target doesn't do anything special - when it terminates (or disconnects, for nodes), the framework generates notifications for all subscribers.

    Explicit Events - Registered events generate both published messages AND termination notifications. A producer process explicitly registers an event and publishes messages to it. When the producer terminates or unregisters the event, subscribers also receive termination notification.

    The key difference: implicit events give you one notification (termination). Explicit events give you N published messages plus termination notification.

    hashtag
    Why Unification Matters

    This unified architecture has practical benefits:

    Consistent behavior - The same subscription and notification mechanics work for all target types. Once you understand how monitors work for processes, you understand how they work for events.

    Shared optimizations - Network optimizations (covered later) apply to all subscription types. Whether you're monitoring 100 remote processes or subscribing 100 consumers to a remote event, the same sharing mechanism kicks in.

    Predictable cleanup - Termination cleanup works identically for all subscriptions. When a process terminates, all its subscriptions are cleaned up using the same code path.

    hashtag
    How Local Subscriptions Work

    When you subscribe to a target on the same node, the operation is simple and fast.

    hashtag
    What You Experience

    The call returns instantly. There's no network communication, no blocking. The node records your subscription in memory.

    hashtag
    What Happens Internally

    The Target Manager maintains subscription records. When a process terminates, Target Manager looks up all subscribers and delivers notifications to their mailboxes. For links, notifications go to the Urgent queue. For monitors, they go to the System queue.

    hashtag
    Guarantees

    Instant subscription - No waiting, no blocking. The subscription is recorded synchronously.

    Guaranteed notification - If the target terminates after you subscribe, you will receive notification. The notification mechanism is part of the termination process itself.

    Asynchronous delivery - Notifications arrive in your mailbox like any other message. You process them in your HandleMessage callback.

    Automatic cleanup - When the target terminates and you receive notification, the subscription is removed automatically. You don't need to unsubscribe.

    hashtag
    How Remote Subscriptions Work

    Remote subscriptions involve network communication but provide the same guarantees as local subscriptions.

    hashtag
    What You Experience

    The call blocks while the subscription request travels to the remote node and the response returns. This typically takes milliseconds on a local network.

    hashtag
    What Happens Internally

    The subscription request travels to the remote node's Target Manager. It validates that the target exists (returning an error if not), records the subscription, and sends confirmation. Once established, termination notifications travel back over the network.

    hashtag
    Subscription Validation

    Both local and remote subscriptions validate that the target exists:

    For events, the target node validates the event is registered:

    hashtag
    Guaranteed Notification Delivery

    Remote subscriptions guarantee you receive exactly one termination notification. This guarantee holds even when networks fail. Two paths can deliver your notification:

    Path 1: Normal Delivery

    The target terminates normally. The remote node sends the notification over the network. You receive it with the actual termination reason:

    Path 2: Connection Failure

    The network connection fails before the notification arrives (or before the target even terminates). Your local node detects the disconnection and generates notifications for all subscriptions to targets on the failed node:

    Why This Works

    You're guaranteed notification through one of two mechanisms:

    1. Remote node delivers it (normal case)

    2. Local node generates it when detecting connection failure (failover)

    The Reason field tells you which path occurred. Your code typically handles both the same way - the target is no longer accessible regardless of why.

    This failover mechanism compensates for network unreliability. You write code assuming notifications always arrive, because they do.

    hashtag
    Network Optimization: Shared Subscriptions

    This section describes the optimization that makes distributed pub/sub practical at scale. Without it, many common patterns would be impractical.

    hashtag
    The Problem

    Consider a realistic scenario:

    Naive implementation: Each MonitorPID call creates a separate network subscription. Result:

    • 100 network round-trips to create subscriptions

    • 100 subscription records on the remote node

    • 100 network messages when the coordinator terminates

    This doesn't scale. With 1000 workers, you'd have 1000 network messages just to deliver one termination notification.

    hashtag
    What Actually Happens

    The framework automatically detects when multiple local processes subscribe to the same remote target and shares the network subscription.

    What you observe:

    The first subscription to a remote target requires network communication. Every subsequent subscription from the same node to the same target returns instantly - it shares the existing network subscription.

    hashtag
    How Notification Delivery Works

    When the remote target terminates:

    1. Remote node sends ONE notification message to your node

    2. Your node receives it and looks up all local subscribers to that target

    3. Your node delivers individual notifications to each subscriber's mailbox

    hashtag
    Performance Characteristics

    Operation
    Without Sharing
    With Sharing

    Network cost comparison for 100 subscribers:

    Phase
    Without Sharing
    With Sharing

    hashtag
    Impact on Event Publishing

    The same optimization applies to event publishing. When you publish an event with subscribers on multiple nodes:

    The framework groups subscribers by node and sends ONE message per node:

    What the producer sees:

    What subscribers see:

    hashtag
    Real-World Scale Example

    Consider a market data feed with 1 million subscribers distributed across 10 nodes:

    When the producer publishes one price update:

    Approach
    Network Messages

    The optimization transforms O(N) network cost (where N = total subscribers) into O(M) cost (where M = number of nodes). For distributed systems with many subscribers per node, this is the difference between practical and impossible.

    Actual benchmark results (from the benchmark):

    hashtag
    Why This Matters for System Design

    This optimization enables patterns that would be impractical otherwise:

    Worker pools monitoring coordinators:

    Distributed caching with invalidation:

    Hierarchical supervision across nodes:

    High-frequency event streaming:

    hashtag
    When Sharing Doesn't Apply

    The optimization applies when multiple processes on the SAME node subscribe to the SAME remote target.

    These share:

    These don't share:

    hashtag
    Buffered Events: Partial Optimization

    Buffered events receive partial optimization. The subscription is shared, but each subscriber must retrieve buffer contents individually.

    hashtag
    Why Buffers Complicate Sharing

    Event buffers store recent messages for new subscribers:

    When a subscriber joins, they receive the buffered messages:

    The problem: different subscribers joining at different times need different buffer contents.

    If subscriptions were fully shared, all subscribers would receive the same buffer - incorrect for late subscribers.

    hashtag
    What Actually Happens

    First subscriber: Network round-trip to create subscription AND retrieve buffer.

    Subsequent subscribers: Network round-trip to retrieve current buffer (subscription already exists).

    Published messages: Still optimized - one network message per node, distributed locally.

    hashtag
    Performance Comparison

    Aspect
    Unbuffered Event
    Buffered Event

    hashtag
    When to Use Buffers

    Use buffers when:

    • New subscribers need recent history (last N configuration updates)

    • Subscribers might miss messages during brief disconnections

    • State can be reconstructed from recent messages

    Avoid buffers when:

    • Real-time streaming where history isn't useful

    • High subscriber count across many nodes (each pays network cost)

    • Messages are only meaningful at publish time

    Practical guidance:

    hashtag
    Producer Notifications

    Producers can receive notifications when subscriber interest changes. This enables demand-driven data production.

    hashtag
    Enabling Notifications

    hashtag
    What You Receive

    hashtag
    When Notifications Arrive

    Transition
    Notification

    You only receive notifications when crossing the zero threshold. The notifications answer: "is anyone listening?" - not "how many are listening?"

    hashtag
    Practical Use Case: On-Demand Data Production

    The producer idles when nobody's listening, avoiding unnecessary API calls and resource usage. When subscribers appear, it starts producing. When all subscribers leave, it stops.

    hashtag
    Network Transparency

    Notifications work across nodes. Remote subscribers count toward "someone is listening":

    The producer doesn't know or care whether subscribers are local or remote. The notification mechanism handles it transparently.

    hashtag
    Multiple Events

    Each event tracks subscribers independently:

    hashtag
    Automatic Cleanup

    Subscriptions clean up automatically when any participant terminates. This eliminates resource leaks from forgotten subscriptions.

    hashtag
    When Target Terminates

    All subscribers receive notification. The subscription ceases to exist - there's nothing to unsubscribe from.

    hashtag
    When Subscriber Terminates

    Your subscriptions are removed from:

    • Local subscription records

    • Remote nodes (for remote subscriptions)

    If you were the last local subscriber to a remote target, the network subscription is removed. Otherwise, it stays for remaining local subscribers.

    hashtag
    When Event Producer Terminates

    Subscribers can't distinguish explicit UnregisterEvent from producer termination - both deliver termination notification with reason gen.ErrUnregistered.

    hashtag
    When Network Connection Fails

    All subscriptions involving the failed node are cleaned up. If the node reconnects later, you need to re-subscribe - the framework doesn't automatically restore subscriptions.

    hashtag
    Explicit Unsubscription

    You can explicitly remove subscriptions:

    Explicit unsubscription is useful when:

    • You want to stop watching before termination

    • You're switching to a different target

    • You're implementing connection retry logic

    But in most cases, you don't need explicit unsubscription. Let termination handle cleanup.

    hashtag
    Cleanup Order Guarantees

    When a process terminates, cleanup happens in a specific order:

    1. Process state changes to Terminated

    2. All outgoing subscriptions (where process is consumer) are removed

    3. All incoming subscriptions (where process is target) generate notifications

    This ordering ensures:

    • You don't receive notifications after your process starts terminating

    • Subscribers to you receive notifications before your resources are freed

    • No race conditions between notification delivery and cleanup

    hashtag
    Summary

    Concept
    How It Works

    hashtag
    Key Performance Insights

    For subscriptions:

    • First subscription to remote target: network round-trip

    • Additional subscriptions to same target: instant

    • Unbuffered events: full sharing

    For notifications:

    • One network message per subscriber node

    • Local distribution to all subscribers on that node

    • Cost scales with number of nodes, not number of subscribers

    For cleanup:

    • Automatic on any termination

    • No resource leaks possible

    • No manual unsubscription required

    gen.Alias{...}

    A process alias

    Node

    gen.Atom("node@host")

    A network connection

    Event

    gen.Event{Name: "prices", Node: "node@host"}

    A registered event

    Continues running

    N network messages

    1 network message

    Unsubscribe (not last)

    1 network round-trip

    0 (instant)

    Unsubscribe (last)

    1 network round-trip

    1 network round-trip

    100 round-trips

    1 round-trip

    Total

    300 network operations

    3 network operations

    1 message per node

    1 message per node

    Termination notification

    1 message per node

    1 message per node

    Subscriber count is moderate
    Memory constraints (buffers consume memory on producer node)
    Process resources are freed

    Event publishing

    One network message per subscriber node, local fanout to subscribers

    Buffered events

    Shared delivery, but each subscriber retrieves buffer individually

    Producer notifications

    MessageEventStart/Stop when crossing zero subscriber threshold

    Automatic cleanup

    All subscriptions cleaned up on any termination

    Buffered events: shared delivery, individual buffer retrieval

    PID

    gen.PID{Node: "node@host", ID: 100}

    A specific process instance

    ProcessID

    gen.ProcessID{Name: "worker", Node: "node@host"}

    A registered name

    Link

    Exit signal

    MessageExit* → Urgent queue

    Terminates by default

    Monitor

    Down message

    First subscription

    1 network round-trip

    1 network round-trip

    N additional subscriptions

    N network round-trips

    0 (instant)

    Subscribe all

    100 round-trips

    1 round-trip

    Notification

    100 messages

    1 message

    Without optimization

    1,000,000

    With optimization

    10

    First subscription

    1 network round-trip

    1 network round-trip

    Additional subscriptions

    Instant (shared)

    Network round-trip (buffer retrieval)

    0 → 1 (first subscriber)

    MessageEventStart

    1 → 0 (last subscriber leaves)

    MessageEventStop

    1 → 2, 2 → 3, etc.

    None

    3 → 2, 2 → 1, etc.

    Unified architecture

    Links, monitors, and events share the same subscription mechanism

    Local subscriptions

    Instant creation, guaranteed notification, asynchronous delivery

    Remote subscriptions

    Network round-trip, guaranteed notification via normal or failover path

    Shared subscriptions

    distributed-pub-sub-1Marrow-up-right

    Alias

    MessageDown* → System queue

    Notification delivery

    Unsubscribe all

    Published messages

    None

    Multiple local subscribers share one network subscription to remote target

    Building a Cluster

    Building production clusters with Ergo technologies

    Ergo provides a complete technology stack for building distributed systems. Service discovery, load balancing, failover, observability - all integrated and working together. No external dependencies except the registrar. No API gateways, service meshes, or orchestration layers between your services.

    This chapter shows how to use Ergo technologies to build production clusters. You'll see how service discovery enables automatic load balancing, how the leader actor provides failover, how metrics and Observer give you visibility into cluster state. Each technology solves a specific problem; together they cover the full spectrum of distributed system requirements.

    hashtag
    The Integration Cost Problem

    LinkPID(target)      → MessageExitPID when target terminates
    MonitorPID(target)   → MessageDownPID when target terminates
    
    LinkEvent(event)     → MessageExitEvent when event ends
    MonitorEvent(event)  → MessageDownEvent when event ends
    // Subscribe to process - implicit event
    process.MonitorPID(targetPID)
    
    // When targetPID terminates, you receive MessageDownPID
    // The target process didn't send anything - framework generated it
    // Producer registers event - explicit event source
    token, _ := producer.RegisterEvent("prices", gen.EventOptions{})
    
    // Producer publishes messages
    producer.SendEvent("prices", token, PriceUpdate{...})
    
    // Subscriber receives:
    // - MessageEvent for each published message
    // - MessageDownEvent when producer terminates or unregisters
    // Subscribe to local process
    err := process.MonitorPID(localTarget)
    // Returns immediately - no waiting
    // Subscribe to remote process
    err := process.MonitorPID(remotePID)
    // Blocks briefly during network round-trip
    // May return error if target doesn't exist or network fails
    // If target doesn't exist, you get an error
    err := process.MonitorPID(nonExistentPID)
    // err == gen.ErrProcessUnknown
    
    // If target exists but already terminated
    err := process.MonitorPID(terminatedPID)
    // err == gen.ErrProcessTerminated
    // If event isn't registered, you get an error
    _, err := process.MonitorEvent(gen.Event{Name: "unknown", Node: "node@host"})
    // err == gen.ErrEventUnknown
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case gen.MessageDownPID:
            // Normal termination reasons:
            // - gen.TerminateReasonNormal (clean shutdown)
            // - gen.TerminateReasonShutdown (requested shutdown)
            // - gen.TerminateReasonPanic (crash)
            // - gen.TerminateReasonKill (forced kill)
            // - Custom error (process returned error)
            log.Printf("Target terminated: %v", msg.Reason)
        }
        return nil
    }
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case gen.MessageDownPID:
            if msg.Reason == gen.ErrNoConnection {
                // Connection failed
                // Target might still be running on an isolated node
                // Or it might have terminated - we can't know
                log.Printf("Lost connection to target's node")
            }
        }
        return nil
    }
    case gen.MessageDownPID:
        // Whether normal termination or connection failure,
        // the target is gone from our perspective
        w.handleTargetGone(msg.PID, msg.Reason)
    // On node A, 100 worker processes all monitor the same coordinator on node B
    coordinatorPID := gen.PID{Node: "nodeB@host", ID: 500}
    
    for i := 0; i < 100; i++ {
        workers[i].MonitorPID(coordinatorPID)
    }
    // First worker subscribes
    err := worker1.MonitorPID(coordinatorPID)
    // Takes a few milliseconds - network round-trip
    
    // Second worker subscribes to SAME target
    err := worker2.MonitorPID(coordinatorPID)
    // Returns instantly - no network communication
    
    // 98 more workers subscribe
    // All return instantly
    // All 100 workers receive notification
    // But only ONE network message was sent
    // Your node distributed it locally
    
    func (w *Worker) HandleMessage(from gen.PID, message any) error {
        case gen.MessageDownPID:
            // You can't tell if you're the only subscriber
            // or one of 1000 subscribers
            // The timing and behavior are identical
    }
    // Producer on node A publishes
    process.SendEvent("market.prices", token, PriceUpdate{Symbol: "BTC", Price: 42000})
    // Publish returns immediately
    process.SendEvent("market.prices", token, update)
    // You don't wait for delivery
    // You don't know how many subscribers there are
    // You don't know which nodes they're on
    func (c *Consumer) HandleEvent(message gen.MessageEvent) error {
        // Event arrives in your mailbox
        // Same timing whether you're the only subscriber or one of thousands
        // Same timing whether producer is local or remote
        return nil
    }
    Configuration:
    - 1 producer on node A
    - 10 consumer nodes (B through K)
    - 100,000 subscribers per consumer node
    - 1,000,000 total subscribers
    Total subscribers:       1000000
    Consumer nodes:          10
    Subscribers per node:    100000
    
    Time to publish:         64.125µs
    Time to deliver all:     342.414375ms
    Network messages sent:   10 (1 per consumer node)
    Delivery rate:           2920438 msg/sec
    // 50 workers on each of 10 nodes monitor a shared coordinator
    // Network cost: 10 subscriptions, not 500
    for i := 0; i < 50; i++ {
        worker := SpawnWorker()
        worker.MonitorPID(coordinatorPID)
    }
    // Cache instances on every node subscribe to invalidation events
    // When data changes, ONE message per node delivers invalidation
    // Each node updates all its local cache instances
    // Multiple supervisors can monitor the same critical process
    // Notification cost stays constant regardless of supervisor count
    // Price feed publishes thousands of updates per second
    // Cost per update: one message per subscriber NODE
    // Not one message per subscriber PROCESS
    // Same node, same remote target
    processA.MonitorPID(remoteTarget)  // Network round-trip
    processB.MonitorPID(remoteTarget)  // Instant (shared)
    processC.MonitorPID(remoteTarget)  // Instant (shared)
    // Different remote targets
    processA.MonitorPID(remoteTarget1)  // Network round-trip
    processB.MonitorPID(remoteTarget2)  // Network round-trip (different target)
    
    // Different nodes subscribing to same target
    // (Each node has its own subscription to the target)
    nodeX_process.MonitorPID(remoteTarget)  // Network round-trip
    nodeY_process.MonitorPID(remoteTarget)  // Network round-trip (from different node)
    // Producer creates event with 100-message buffer
    token, _ := process.RegisterEvent("prices", gen.EventOptions{
        Buffer: 100,
    })
    
    // Producer publishes messages over time
    process.SendEvent("prices", token, msg1)  // Stored in buffer
    process.SendEvent("prices", token, msg2)  // Stored in buffer
    // ... more messages ...
    // Subscriber joins and receives buffer
    buffered, _ := process.MonitorEvent(event)
    for _, msg := range buffered {
        // These are recent messages published before subscription
    }
    // Process 1 subscribes at 10:00:00
    buffered1, _ := process1.MonitorEvent(event)
    // Receives messages 1-100
    
    // Producer publishes messages 101-150
    
    // Process 2 subscribes at 10:00:30
    buffered2, _ := process2.MonitorEvent(event)
    // Must receive messages 51-150 (different from process 1!)
    // Good: Configuration updates, moderate subscribers
    token, _ := process.RegisterEvent("config.updates", gen.EventOptions{
        Buffer: 10,  // Last 10 config changes
    })
    
    // Good: State snapshots for late joiners
    token, _ := process.RegisterEvent("game.state", gen.EventOptions{
        Buffer: 1,  // Just the latest state
    })
    
    // Better without buffer: High-frequency price feed
    token, _ := process.RegisterEvent("prices.realtime", gen.EventOptions{
        Buffer: 0,  // Full sharing optimization
    })
    
    // Better without buffer: High subscriber count
    token, _ := process.RegisterEvent("system.metrics", gen.EventOptions{
        Buffer: 0,  // Thousands of subscribers, skip buffer overhead
    })
    token, _ := process.RegisterEvent("expensive.data", gen.EventOptions{
        Notify: true,
    })
    func (p *Producer) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case gen.MessageEventStart:
            // First subscriber appeared
            // Subscriber count: 0 → 1
            p.Log().Info("First subscriber for event: %s", msg.Name)
    
        case gen.MessageEventStop:
            // Last subscriber left
            // Subscriber count: 1 → 0
            p.Log().Info("No more subscribers for event: %s", msg.Name)
        }
        return nil
    }
    type PriceFeeder struct {
        act.Actor
    
        token   gen.Ref
        polling bool
    }
    
    func (p *PriceFeeder) Init(args ...any) error {
        // Register event with notifications enabled
        token, err := p.RegisterEvent("prices", gen.EventOptions{
            Notify: true,
        })
        if err != nil {
            return err
        }
        p.token = token
        // Don't start polling yet - wait for subscribers
        return nil
    }
    
    func (p *PriceFeeder) HandleMessage(from gen.PID, message any) error {
        switch message.(type) {
        case gen.MessageEventStart:
            // Someone wants prices - start polling external API
            p.Log().Info("Starting price polling - subscribers appeared")
            p.polling = true
            p.schedulePoll()
    
        case gen.MessageEventStop:
            // Nobody wants prices - stop wasting resources
            p.Log().Info("Stopping price polling - no subscribers")
            p.polling = false
    
        case pollTick:
            if p.polling {
                price := p.fetchPriceFromAPI()
                p.SendEvent("prices", p.token, price)
                p.schedulePoll()
            }
        }
        return nil
    }
    
    func (p *PriceFeeder) schedulePoll() {
        p.SendAfter(p.PID(), pollTick{}, time.Second)
    }
    // Producer on node A
    token, _ := process.RegisterEvent("data", gen.EventOptions{Notify: true})
    
    // Subscriber on node B subscribes
    // Producer receives MessageEventStart
    
    // Subscriber on node B unsubscribes (and was the only subscriber)
    // Producer receives MessageEventStop
    token1, _ := process.RegisterEvent("prices.stocks", gen.EventOptions{Notify: true})
    token2, _ := process.RegisterEvent("prices.crypto", gen.EventOptions{Notify: true})
    
    // MessageEventStart for "prices.stocks" when first stock subscriber appears
    // MessageEventStart for "prices.crypto" when first crypto subscriber appears
    // These are independent - one can have subscribers while other doesn't
    // You subscribed to this process
    process.MonitorPID(target)
    
    // Target terminates for any reason
    // - Returns error from callback
    // - Panics
    // - Receives kill signal
    // - Node shuts down
    
    // You receive notification
    case gen.MessageDownPID:
        // Subscription is automatically removed
        // No cleanup needed on your part
    // You created several subscriptions
    process.MonitorPID(target1)
    process.MonitorPID(target2)
    process.MonitorEvent(event)
    
    // Your process terminates
    // All subscriptions are automatically removed
    // No cleanup code needed
    // Producer registered an event
    token, _ := producer.RegisterEvent("prices", gen.EventOptions{Buffer: 100})
    
    // Producer terminates (or explicitly calls UnregisterEvent)
    
    // All subscribers receive termination notification:
    // - Links receive MessageExitEvent
    // - Monitors receive MessageDownEvent
    
    // Event resources are cleaned up:
    // - Buffer memory freed
    // - Event name available for re-registration
    // You have subscriptions to targets on node B
    process.MonitorPID(pid_on_nodeB)
    process.MonitorProcessID(name_on_nodeB)
    process.MonitorEvent(event_on_nodeB)
    
    // Connection to node B fails
    // (Network partition, node crash, etc.)
    
    // You receive termination for ALL subscriptions to that node:
    case gen.MessageDownPID:
        if msg.Reason == gen.ErrNoConnection {
            // Network failure, not process termination
        }
    
    case gen.MessageDownProcessID:
        if msg.Reason == gen.ErrNoConnection {
            // Network failure
        }
    
    case gen.MessageDownEvent:
        if msg.Reason == gen.ErrNoConnection {
            // Network failure
        }
    // Remove link
    process.UnlinkPID(target)
    process.UnlinkEvent(event)
    
    // Remove monitor
    process.DemonitorPID(target)
    process.DemonitorEvent(event)
    Traditional microservice architectures pay a heavy integration tax. Each service needs:
    • HTTP/gRPC endpoints for communication

    • Client libraries with retry logic and circuit breakers

    • Service mesh sidecars for traffic management

    • API gateways for routing and load balancing

    • Health check endpoints and probes

    • Metrics exporters and tracing spans

    • Configuration management and secret injection

    Each layer adds latency, complexity, and failure modes. A simple call between two services traverses client library, sidecar proxy, load balancer, another sidecar, server library. Each hop serializes, deserializes, and can fail independently.

    Ergo eliminates these layers. Processes communicate directly through message passing. The framework handles serialization, routing, load balancing, and failure detection. No sidecars, no API gateways, no client libraries.

    One network hop. One serialization. Built-in load balancing and failover. This isn't a philosophical difference - it's orders of magnitude less infrastructure to deploy, maintain, and debug.

    hashtag
    Service Discovery with Registrars

    Service discovery is the foundation of clustering. How does node A find node B? How does a process locate the right service instance? Ergo provides three registrar options, each suited for different scales and requirements.

    hashtag
    Embedded Registrar

    The embedded registrar requires no external infrastructure. The first node on a host becomes the registrar server; others connect as clients.

    Cross-host discovery uses UDP queries. When node 2 needs to reach node 4, it asks its local registrar server (node 1), which queries node 4's host via UDP.

    Use for: Development, testing, single-host deployments, simple multi-host setups without firewalls blocking UDP.

    Limitations: No application discovery, no configuration management, no event notifications.

    hashtag
    etcd Registrar

    etcd provides centralized discovery with application routing, configuration management, and event notifications. Nodes register with etcd and maintain leases for automatic cleanup.

    etcd registrar capabilities:

    Feature
    Description

    Node discovery

    Find all nodes in the cluster

    Application discovery

    Find which nodes run specific applications

    Weighted routing

    Load balance based on application weights

    Configuration

    Use for: Teams already running etcd, clusters up to 50-70 nodes, deployments needing application discovery.

    hashtag
    Saturn Registrar

    Saturn is purpose-built for Ergo. Instead of polling (like etcd), it maintains persistent connections and pushes updates immediately. Topology changes propagate in milliseconds.

    Saturn vs etcd:

    Aspect
    etcd
    Saturn

    Update propagation

    Polling (seconds)

    Push (milliseconds)

    Connection model

    HTTP requests

    Persistent TCP

    Use for: Large clusters, real-time topology awareness, production systems where discovery latency matters.

    hashtag
    Application Discovery and Load Balancing

    Applications are the unit of deployment in Ergo. A node can load multiple applications, start them with different modes, and register them with the registrar. Other nodes discover applications and route requests based on weights.

    hashtag
    Registering Applications

    When you start an application, it automatically registers with the registrar (if using etcd or Saturn):

    The registrar now knows: application "api" is running on this node with weight 100.

    hashtag
    Discovering Applications

    Other nodes can discover where applications run:

    Output might show:

    hashtag
    Weighted Load Balancing

    Weights enable traffic distribution. A node with weight 100 receives twice as much traffic as a node with weight 50. Use this for:

    • Canary deployments: New version with weight 10, stable with weight 90

    • Capacity matching: Powerful nodes get higher weights

    • Graceful draining: Set weight to 0 before maintenance

    During canary and rolling deployments, the cluster runs mixed code versions. Messages sent from new nodes must be understood by old nodes, and vice versa. Ensure your message types support version coexistence as described in Message Versioning.

    hashtag
    Routing Requests

    Once you know where applications run, route requests using weighted selection:

    This is application-level load balancing without external infrastructure. No load balancer service, no sidecar proxies.

    hashtag
    Running Multiple Instances for Load Balancing

    Horizontal scaling means running the same application on multiple nodes. Each instance handles a portion of traffic. Add nodes to increase capacity; remove nodes to reduce costs.

    hashtag
    Deployment Pattern

    Each client discovers all api instances and distributes requests based on weights.

    hashtag
    Implementation

    On each worker node:

    On coordinator/client nodes:

    hashtag
    Scaling Operations

    Scale up: Start new node with the same application. It registers with the registrar. Other nodes discover it through events or next resolution.

    Scale down: Set weight to 0 (drain), wait for in-flight work, stop the node. Registrar removes the registration when the lease expires.

    hashtag
    Reacting to Topology Changes

    Subscribe to registrar events to react when instances join or leave:

    No polling. No service mesh. Events arrive within milliseconds (Saturn) or at the next poll cycle (etcd).

    hashtag
    Running Multiple Instances for Failover

    Failover means having standby instances ready to take over when the primary fails. The leader actor implements distributed leader election - exactly one instance is active (leader) while others wait (followers).

    hashtag
    The Leader Actor

    The leader.Actor from ergo.services/actor/leader implements Raft-based leader election. Embed it in your actor to participate in elections:

    hashtag
    Election Mechanics

    1. All instances start as followers

    2. If no heartbeats arrive, a follower becomes candidate

    3. Candidate requests votes from peers

    4. Majority vote wins; candidate becomes leader

    5. Leader sends periodic heartbeats

    6. If leader fails, followers detect timeout and elect new leader

    hashtag
    Failover Scenario

    Failover happens automatically. No manual intervention. The surviving nodes elect a new leader within the election timeout (150-300ms by default).

    hashtag
    Use Cases

    Single-writer coordination: Only the leader writes to prevent conflicts.

    Task scheduling: Only the leader runs periodic tasks.

    Distributed locks: Leader grants exclusive access.

    hashtag
    Quorum and Split-Brain

    Leader election requires a majority (quorum) to prevent split-brain:

    Cluster Size
    Quorum
    Tolerated Failures

    3 nodes

    2

    1

    5 nodes

    3

    2

    If a network partition splits 5 nodes into groups of 3 and 2:

    • The group of 3 can elect a leader (has quorum)

    • The group of 2 cannot (no quorum)

    This prevents both sides from having leaders and making conflicting decisions.

    hashtag
    Observability with Metrics

    The metrics actor from ergo.services/actor/metrics exposes Prometheus-format metrics. Base metrics are collected automatically; you add custom metrics for application-specific telemetry.

    hashtag
    Basic Setup

    This starts an HTTP server at :9090/metrics with base metrics:

    Metric
    Description

    ergo_node_uptime_seconds

    Node uptime

    ergo_processes_total

    Total process count

    ergo_processes_running

    Actively processing

    ergo_memory_used_bytes

    hashtag
    Custom Metrics

    Extend the metrics actor for application-specific telemetry:

    Update metrics from your application:

    hashtag
    Prometheus Integration

    Now you have cluster-wide visibility: process counts, memory usage, network traffic, custom business metrics - all in Prometheus/Grafana.

    hashtag
    Inspecting with Observer

    Observer is a web UI for cluster inspection. Run it as an application within your node or as a standalone tool.

    hashtag
    Embedding Observer

    Open http://localhost:9911 to see:

    • Node info: Uptime, memory, CPU, process counts

    • Network: Connected nodes, acceptors, traffic graphs

    • Process list: All processes with state, mailbox depth, runtime

    • Process details: Links, monitors, aliases, environment

    • Logs: Real-time log stream with filtering

    hashtag
    Standalone Observer Tool

    For inspecting remote nodes without embedding:

    Connect to any node in your cluster and inspect its state remotely.

    hashtag
    Process Inspection

    Observer calls HandleInspect on processes to get internal state:

    This data appears in the Observer UI, updated every second.

    hashtag
    Debugging Production Issues

    Observer helps diagnose:

    • Memory leaks: Watch ergo_memory_alloc_bytes, find processes with growing mailboxes

    • Stuck processes: Check "Top Running" sort to find processes consuming CPU

    • Message backlogs: "Top Mailbox" shows processes falling behind

    • Network issues: Traffic graphs show bytes/messages per remote node

    • Process relationships: Links and monitors show supervision structure

    hashtag
    Remote Operations

    Ergo supports starting processes and applications on remote nodes. This enables dynamic workload distribution and orchestration.

    hashtag
    Remote Spawn

    Start a process on a remote node:

    The spawned process runs on the remote node but can communicate with any process in the cluster.

    hashtag
    Remote Application Start

    Start an application on a remote node:

    Use this for:

    • Dynamic orchestration: Coordinator decides which apps run where

    • Staged deployment: Start apps in order, waiting for health checks

    • Capacity management: Start/stop apps based on load

    hashtag
    Configuration Management

    etcd and Saturn registrars provide cluster-wide configuration with hierarchical overrides.

    hashtag
    Configuration Hierarchy

    Node-specific overrides cluster-wide, which overrides global.

    hashtag
    Typed Configuration

    Values are stored as strings with type prefixes:

    hashtag
    Configuration Events

    React to config changes in real-time:

    No restart required. Configuration propagates to all nodes automatically.

    hashtag
    Putting It Together

    Here's a complete example: a job processing cluster with load balancing, failover, metrics, and observability.

    hashtag
    Architecture

    hashtag
    Coordinator Node

    hashtag
    Worker Node

    hashtag
    What You Get

    • Load balancing: Jobs distribute across workers based on weights

    • Failover: If coordinator leader fails, another takes over in <300ms

    • Discovery: Workers auto-register; coordinators discover them via events

    • Metrics: Prometheus scrapes all nodes for cluster-wide visibility

    • Inspection: Observer UI shows processes, mailboxes, network traffic

    • Configuration: Update settings via etcd; changes propagate immediately

    All of this with:

    • No API gateways

    • No service mesh

    • No load balancer services

    • No orchestration layers

    • No client libraries with retry logic

    Just Ergo nodes communicating directly through message passing.

    hashtag
    Summary

    Ergo provides integrated technologies for building production clusters:

    Technology
    Purpose
    Package

    Registrars

    Service discovery

    ergo.services/registrar/etcd, registrar/saturn

    Applications

    Deployment units with weights

    Core framework

    These components eliminate the integration layers that dominate traditional microservice architectures. Instead of building infrastructure, you build applications.

    For implementation details, see:

    • Message Versioning

    • Service Discovering

    • Leader Actor

    Project Structure

    How to Structure Projects Built with Ergo Framework

    The same codebase can run as a monolith on your laptop or as distributed services across a data center. This flexibility comes from one principle: applications are the unit of composition. How you organize your project determines whether you can use this flexibility or fight against it.

    This chapter covers project organization, message isolation patterns, deployment strategies, and evolution paths. The goal is a structure that supports both development simplicity and production scalability without code changes.

    hashtag
    The Flexibility Promise

    Ergo's network transparency means a process doesn't know if it's talking to a neighbor in the same node or a remote process across the network. The same Send() call works either way. But this only helps if your code is organized to take advantage of it.

    Consider two deployment scenarios:

    Development: All applications in one process for fast iteration.

    Production: Applications distributed across nodes for scalability.

    The application code is identical in both cases. Only the entry point changes - which applications start on which nodes.

    This works because:

    • Applications are self-contained functional units

    • Messages define contracts between applications

    • The framework handles routing transparently

    Your project structure must preserve these properties. Mix them up, and you lose deployment flexibility.

    hashtag
    Directory Layout

    A well-structured project separates entry points from applications from shared code:

    hashtag
    Entry Points (cmd/)

    Each directory in cmd/ produces a different binary with a different deployment topology.

    Monolith - everything together:

    Distributed - each application on its own node:

    The application code (apps/api, apps/worker) is identical. The entry point decides what runs where.

    hashtag
    Applications (apps/)

    Each subdirectory in apps/ is a self-contained application. An application is:

    • A cohesive functional unit

    • Deployable independently

    • Composed of actors with a supervision tree

    • Communicating via messages

    Application structure:

    Application definition:

    Applications should not import each other. If apps/api imports apps/worker, you've created a compile-time dependency that limits deployment flexibility.

    hashtag
    Service-Level Types (types/)

    When applications need to communicate, they need shared message types. The types/ directory holds these contracts:

    Both apps/orders and apps/shipping can import types without importing each other. This breaks the circular dependency while maintaining strong typing.

    hashtag
    Shared Libraries (lib/)

    Non-actor code that multiple applications use goes in lib/:

    Libraries must be:

    • Stateless - no global variables, no goroutines

    • Pure - same inputs produce same outputs

    • Actor-agnostic - no dependency on gen.Process

    Libraries are safe to call from actor callbacks because they don't block or manage state.

    hashtag
    Message Isolation Levels

    Messages define contracts between actors. The visibility of message types controls who can send them and where they can travel. Ergo uses Go's export rules plus EDF serialization requirements to create four isolation levels.

    Understanding these levels is critical for proper encapsulation.

    hashtag
    Level 1: Application-Internal (Same Node)

    Messages used only within a single application instance on one node.

    Characteristics:

    • Type is unexported (scheduleTask)

    • Fields are unexported (taskID, not TaskID)

    • Cannot be imported by other packages

    Use when:

    • Communication between actors in the same application

    • Messages never leave the local node

    • Implementation details that shouldn't be exposed

    hashtag
    Level 2: Application-Cluster (Same Application, Multiple Nodes)

    Messages between instances of the same application across nodes.

    Characteristics:

    • Type is unexported (replicateState)

    • Fields are exported (Version, not version)

    • Cannot be imported by other packages

    Use when:

    • Replication between application instances

    • Cluster-internal coordination

    • Messages that other applications shouldn't see

    hashtag
    Level 3: Cross-Application (Same Node Only)

    Messages between different applications on the same node.

    Characteristics:

    • Type is exported (StatusQuery)

    • Fields are unexported (taskID, not TaskID)

    • CAN be imported by other packages

    Use when:

    • Local service queries

    • Same-node optimization paths

    • Explicitly preventing network transmission

    This level is intentionally restrictive. If someone tries to send StatusQuery to a remote node, serialization fails. The unexported fields act as a compile-time guard against accidental network use.

    hashtag
    Level 4: Service-Level (Everywhere)

    Messages that form public contracts between applications across the cluster.

    Characteristics:

    • Type is exported (ProcessTask)

    • Fields are exported (TaskID)

    • CAN be imported by any package

    Use when:

    • Public API between applications

    • Events that multiple applications subscribe to

    • Commands sent across application boundaries

    hashtag
    Summary Table

    Level
    Scope
    Type
    Fields
    Serializable
    Import

    hashtag
    Choosing the Right Level

    Start with Level 1 (maximum restriction). Only increase visibility when needed:

    1. Does another application need this message?

      • No → Keep type unexported (Level 1 or 2)

      • Yes → Export type (Level 3 or 4)

    hashtag
    Application Design Patterns

    hashtag
    Supervision Structure

    Applications typically have a supervision tree:

    hashtag
    Configuration via Options

    Applications accept configuration through an Options struct:

    Entry points configure options based on deployment:

    hashtag
    Inter-Application Communication

    Applications discover each other through application names, not node names:

    When running as monolith, routes returns the local node. When distributed, it returns remote nodes. The code doesn't change.

    hashtag
    Event Publishing

    Applications publish events for loose coupling:

    Events decouple applications. Orders doesn't know who listens. Shipping doesn't know where Orders runs.

    hashtag
    Deployment Patterns

    hashtag
    Pattern 1: Development Monolith

    Everything in one process for fast iteration:

    Benefits:

    • Single binary to run

    • No network setup

    • Easy debugging

    • Fast startup

    hashtag
    Pattern 2: Distributed Production

    Each application on dedicated nodes:

    Each binary runs one application:

    Benefits:

    • Independent scaling per tier

    • Fault isolation

    • Resource optimization

    • Zero-downtime updates

    hashtag
    Pattern 3: Hybrid Deployment

    Group related applications for efficiency:

    Benefits:

    • Reduced network hops for common paths

    • Fewer nodes to manage

    • Right-sized for actual traffic patterns

    hashtag
    Testing Strategies

    hashtag
    Unit Testing Actors

    Test actors in isolation using the testing framework:

    hashtag
    Integration Testing Applications

    Test complete applications:

    hashtag
    Testing Distributed Scenarios

    Test multiple nodes:

    hashtag
    Evolution and Refactoring

    hashtag
    Starting Simple

    Begin with a monolith:

    hashtag
    Extracting Applications

    When the monolith grows, extract bounded contexts:

    Step 1: Identify boundaries in the combined application.

    Step 2: Create separate application packages.

    Step 3: Update the entry point.

    Step 4: When ready, create distributed entry points.

    The application code never changes. Only entry points and deployment.

    hashtag
    Merging Applications

    If you over-distributed:

    No application code changes. Just different composition.

    hashtag
    Best Practices

    hashtag
    Application Boundaries

    Do:

    • One application per bounded context

    • Applications that scale together can be one application

    • Applications that deploy together can be one application

    Don't:

    • Create applications for single actors

    • Split applications by technical layer (web/service/data)

    • Create circular dependencies between applications

    Good:

    Bad:

    hashtag
    Message Design

    Do:

    • Start with Level 1 (most restrictive)

    • Increase visibility only when needed

    • Document which level each message uses

    Don't:

    • Default to Level 4 for everything

    • Mix isolation levels arbitrarily

    • Use any or interface{} for messages

    hashtag
    Dependencies

    Do:

    • Applications import types/ for shared contracts

    • Applications import lib/ for utilities

    • Entry points import applications

    Don't:

    • Applications import other applications

    • Libraries depend on applications

    • Create import cycles

    hashtag
    Configuration

    Do:

    • Use Options structs for application config

    • Validate in CreateApp or Load

    • Provide sensible defaults

    Don't:

    • Hard-code configuration in actors

    • Read os.Getenv directly in actors

    • Store configuration in global variables

    hashtag
    What's Next

    This article covered project organization for flexible deployment. As your system grows into a distributed cluster, two topics become essential:

    • - service discovery, load balancing, failover, and observability

    • - evolving message contracts during rolling upgrades

    spinner
    // No configuration needed - embedded registrar is the default
    node, _ := ergo.StartNode("service@localhost", gen.NodeOptions{})
    import "ergo.services/registrar/etcd"
    
    options := gen.NodeOptions{
        Network: gen.NetworkOptions{
            Registrar: etcd.Create(etcd.Options{
                Endpoints: []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
                Cluster:   "production",
            }),
        },
    }
    node, _ := ergo.StartNode("service@host", options)
    import "ergo.services/registrar/saturn"
    
    options := gen.NodeOptions{
        Network: gen.NetworkOptions{
            Registrar: saturn.Create("saturn.example.com", "your-token", saturn.Options{
                Cluster: "production",
            }),
        },
    }
    node, _ := ergo.StartNode("service@host", options)
    // Define application spec with weight
    spec := gen.ApplicationSpec{
        Name:   "api",
        Group:  []gen.ApplicationMemberSpec{{Factory: createAPIHandler}},
        Weight: 100,  // higher weight = more traffic
    }
    
    node.ApplicationLoad(app, spec)
    node.ApplicationStart("api", gen.ApplicationOptions{})
    registrar, _ := node.Network().Registrar()
    resolver := registrar.Resolver()
    
    routes, _ := resolver.ResolveApplication("api")
    for _, route := range routes {
        fmt.Printf("api on %s (weight: %d, state: %s)\n",
            route.Node, route.Weight, route.State)
    }
    api on node1@host1 (weight: 100, state: running)
    api on node2@host2 (weight: 100, state: running)
    api on node3@host3 (weight: 50, state: running)
    // Canary deployment
    // Stable nodes
    stableSpec := gen.ApplicationSpec{Name: "api", Weight: 90}
    // Canary node
    canarySpec := gen.ApplicationSpec{Name: "api", Weight: 10}
    func (c *Client) selectNode(routes []gen.ApplicationRoute) gen.Atom {
        // Filter running instances
        var running []gen.ApplicationRoute
        for _, r := range routes {
            if r.State == gen.ApplicationStateRunning {
                running = append(running, r)
            }
        }
    
        // Weighted random selection
        totalWeight := 0
        for _, r := range running {
            totalWeight += r.Weight
        }
    
        pick := rand.Intn(totalWeight)
        cumulative := 0
        for _, r := range running {
            cumulative += r.Weight
            if pick < cumulative {
                return r.Node
            }
        }
        return running[0].Node
    }
    func main() {
        options := gen.NodeOptions{
            Network: gen.NetworkOptions{
                Registrar: etcd.Create(etcd.Options{
                    Endpoints: []string{"etcd:2379"},
                    Cluster:   "production",
                }),
            },
        }
    
        node, _ := ergo.StartNode("worker@"+hostname(), options)
    
        // Load and start the application
        node.ApplicationLoad(&WorkerApp{}, gen.ApplicationSpec{
            Name:   "worker",
            Weight: 100,
        })
        node.ApplicationStartPermanent("worker", gen.ApplicationOptions{})
    
        node.Wait()
    }
    func (c *Coordinator) distributeWork(job Job) error {
        registrar, _ := c.Node().Network().Registrar()
        routes, _ := registrar.Resolver().ResolveApplication("worker")
    
        // Select node based on weights
        targetNode := c.selectNode(routes)
    
        // Get connection and send work
        remote, _ := c.Network().GetNode(targetNode)
        return remote.Send("worker_handler", job)
    }
    // Drain before shutdown
    info, _ := node.ApplicationInfo("worker")
    // Update weight through your deployment tooling
    // Wait for in-flight work...
    node.Stop()
    func (c *Coordinator) Init(args ...any) error {
        registrar, _ := c.Node().Network().Registrar()
        event, _ := registrar.Event()
        c.MonitorEvent(event)
        return nil
    }
    
    func (c *Coordinator) HandleEvent(ev gen.MessageEvent) error {
        switch msg := ev.Message.(type) {
    
        case etcd.EventApplicationStarted:
            if msg.Name == "worker" {
                c.Log().Info("worker started on %s (weight: %d)", msg.Node, msg.Weight)
                c.refreshWorkerList()
            }
    
        case etcd.EventApplicationStopped:
            if msg.Name == "worker" {
                c.Log().Info("worker stopped on %s", msg.Node)
                c.refreshWorkerList()
            }
    
        case etcd.EventNodeLeft:
            c.Log().Warning("node left: %s", msg.Name)
            c.handleNodeFailure(msg.Name)
        }
        return nil
    }
    import "ergo.services/actor/leader"
    
    type Scheduler struct {
        leader.Actor
    
        jobQueue []Job
        active   bool
    }
    
    func (s *Scheduler) Init(args ...any) (leader.Options, error) {
        return leader.Options{
            ClusterID: "scheduler-cluster",
            Bootstrap: []gen.ProcessID{
                {Name: "scheduler", Node: "node1@host1"},
                {Name: "scheduler", Node: "node2@host2"},
                {Name: "scheduler", Node: "node3@host3"},
            },
        }, nil
    }
    
    func (s *Scheduler) HandleBecomeLeader() error {
        s.Log().Info("elected as leader - starting job processing")
        s.active = true
        s.startProcessingJobs()
        return nil
    }
    
    func (s *Scheduler) HandleBecomeFollower(leaderPID gen.PID) error {
        s.Log().Info("following leader: %s", leaderPID)
        s.active = false
        s.stopProcessingJobs()
        return nil
    }
    func (s *Scheduler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case SubmitJob:
            if s.IsLeader() == false {
                // Forward to leader
                s.Send(s.Leader(), msg)
                return nil
            }
            s.jobQueue = append(s.jobQueue, msg.Job)
        }
        return nil
    }
    func (s *Scheduler) HandleBecomeLeader() error {
        s.SendAfter(s.PID(), RunScheduledTasks{}, 10*time.Second)
        return nil
    }
    
    func (s *Scheduler) HandleMessage(from gen.PID, message any) error {
        switch message.(type) {
        case RunScheduledTasks:
            if s.IsLeader() {
                s.executeScheduledTasks()
                s.SendAfter(s.PID(), RunScheduledTasks{}, 10*time.Second)
            }
        }
        return nil
    }
    func (s *Scheduler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case AcquireLock:
            if s.IsLeader() == false {
                s.Send(from, NotLeader{Leader: s.Leader()})
                return nil
            }
            if s.locks[msg.Resource] != nil {
                s.Send(from, LockDenied{})
            } else {
                s.locks[msg.Resource] = &Lock{Holder: from, Expiry: time.Now().Add(msg.TTL)}
                s.Send(from, LockGranted{})
            }
        }
        return nil
    }
    import "ergo.services/actor/metrics"
    
    node.Spawn(metrics.Factory, gen.ProcessOptions{}, metrics.Options{
        Host: "0.0.0.0",
        Port: 9090,
        CollectInterval: 10 * time.Second,
    })
    type AppMetrics struct {
        metrics.Actor
    
        requestsTotal  prometheus.Counter
        requestLatency prometheus.Histogram
        activeJobs     prometheus.Gauge
    }
    
    func (m *AppMetrics) Init(args ...any) (metrics.Options, error) {
        m.requestsTotal = prometheus.NewCounter(prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total requests processed",
        })
    
        m.requestLatency = prometheus.NewHistogram(prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Buckets: prometheus.DefBuckets,
        })
    
        m.activeJobs = prometheus.NewGauge(prometheus.GaugeOpts{
            Name: "app_active_jobs",
            Help: "Currently processing jobs",
        })
    
        m.Registry().MustRegister(m.requestsTotal, m.requestLatency, m.activeJobs)
    
        return metrics.Options{Port: 9090}, nil
    }
    // In your request handler
    func (h *Handler) HandleMessage(from gen.PID, message any) error {
        start := time.Now()
    
        // Process request...
    
        // Send metrics update
        h.Send(metricsPID, RequestCompleted{Duration: time.Since(start)})
        return nil
    }
    
    // In metrics actor
    func (m *AppMetrics) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case RequestCompleted:
            m.requestsTotal.Inc()
            m.requestLatency.Observe(msg.Duration.Seconds())
        case JobStarted:
            m.activeJobs.Inc()
        case JobCompleted:
            m.activeJobs.Dec()
        }
        return nil
    }
    # prometheus.yml
    scrape_configs:
      - job_name: 'ergo-cluster'
        static_configs:
          - targets:
              - 'node1:9090'
              - 'node2:9090'
              - 'node3:9090'
    import "ergo.services/application/observer"
    
    options := gen.NodeOptions{
        Applications: []gen.ApplicationBehavior{
            observer.CreateApp(observer.Options{
                Host: "localhost",
                Port: 9911,
            }),
        },
    }
    node, _ := ergo.StartNode("mynode@localhost", options)
    go install ergo.services/tools/observer@latest
    observer -cookie "your-cluster-cookie"
    func (w *Worker) HandleInspect(from gen.PID, item ...string) map[string]string {
        return map[string]string{
            "queue_depth":  fmt.Sprintf("%d", len(w.queue)),
            "processed":    fmt.Sprintf("%d", w.processedCount),
            "current_job":  w.currentJob.ID,
            "uptime":       time.Since(w.startTime).String(),
        }
    }
    // On remote node: enable spawn
    network := node.Network()
    network.EnableSpawn("worker", createWorker, "coordinator@host")
    
    // On coordinator: spawn remotely
    remote, _ := coordinator.Network().GetNode("worker@host")
    pid, _ := remote.Spawn("worker", gen.ProcessOptions{}, WorkerConfig{BatchSize: 100})
    
    // Send work to the remote process
    coordinator.Send(pid, ProcessJob{Data: jobData})
    // On remote node: load app and enable remote start
    node.ApplicationLoad(&WorkerApp{}, gen.ApplicationSpec{Name: "workers"})
    network.EnableApplicationStart("workers", "coordinator@host")
    
    // On coordinator: start remotely
    remote, _ := coordinator.Network().GetNode("worker@host")
    remote.ApplicationStartPermanent("workers", gen.ApplicationOptions{})
    1. Node-specific in cluster:     /cluster/{cluster}/config/{node}/{item}
    2. Cluster-wide default:         /cluster/{cluster}/config/*/{item}
    3. Global default:               /config/global/{item}
    # Set config via etcdctl
    etcdctl put services/ergo/cluster/production/config/*/db.pool_size "int:20"
    etcdctl put services/ergo/cluster/production/config/*/cache.enabled "bool:true"
    etcdctl put services/ergo/cluster/production/config/node1/db.pool_size "int:50"
    // Read config in your application
    registrar, _ := node.Network().Registrar()
    config, _ := registrar.Config("db.pool_size", "cache.enabled")
    
    poolSize := config["db.pool_size"].(int64)    // 50 on node1, 20 on others
    cacheEnabled := config["cache.enabled"].(bool) // true
    func (a *App) HandleEvent(ev gen.MessageEvent) error {
        switch msg := ev.Message.(type) {
        case etcd.EventConfigUpdate:
            a.Log().Info("config changed: %s = %v", msg.Item, msg.Value)
    
            switch msg.Item {
            case "log.level":
                a.updateLogLevel(msg.Value.(string))
            case "cache.size":
                a.resizeCache(msg.Value.(int64))
            }
        }
        return nil
    }
    package main
    
    import (
        "ergo.services/actor/leader"
        "ergo.services/actor/metrics"
        "ergo.services/application/observer"
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/registrar/etcd"
    )
    
    type Coordinator struct {
        leader.Actor
        workers []gen.ApplicationRoute
    }
    
    func (c *Coordinator) Init(args ...any) (leader.Options, error) {
        // Subscribe to registrar events
        reg, _ := c.Node().Network().Registrar()
        ev, _ := reg.Event()
        c.MonitorEvent(ev)
    
        // Initial worker discovery
        c.refreshWorkers()
    
        return leader.Options{
            ClusterID: "coordinators",
            Bootstrap: []gen.ProcessID{
                {Name: "coordinator", Node: "coord1@host1"},
                {Name: "coordinator", Node: "coord2@host2"},
                {Name: "coordinator", Node: "coord3@host3"},
            },
        }, nil
    }
    
    func (c *Coordinator) HandleBecomeLeader() error {
        c.Log().Info("became leader - starting job distribution")
        c.SendAfter(c.PID(), DistributeJobs{}, time.Second)
        return nil
    }
    
    func (c *Coordinator) HandleBecomeFollower(leader gen.PID) error {
        c.Log().Info("following %s", leader)
        return nil
    }
    
    func (c *Coordinator) HandleEvent(ev gen.MessageEvent) error {
        switch ev.Message.(type) {
        case etcd.EventApplicationStarted, etcd.EventApplicationStopped:
            c.refreshWorkers()
        }
        return nil
    }
    
    func (c *Coordinator) refreshWorkers() {
        reg, _ := c.Node().Network().Registrar()
        c.workers, _ = reg.Resolver().ResolveApplication("worker")
    }
    
    func main() {
        options := gen.NodeOptions{
            Network: gen.NetworkOptions{
                Registrar: etcd.Create(etcd.Options{
                    Endpoints: []string{"etcd:2379"},
                    Cluster:   "production",
                }),
            },
            Applications: []gen.ApplicationBehavior{
                observer.CreateApp(observer.Options{Port: 9911}),
            },
        }
    
        node, _ := ergo.StartNode("coord1@host1", options)
    
        // Start metrics
        node.Spawn(metrics.Factory, gen.ProcessOptions{}, metrics.Options{Port: 9090})
    
        // Start coordinator
        node.SpawnRegister("coordinator", func() gen.ProcessBehavior {
            return &Coordinator{}
        }, gen.ProcessOptions{})
    
        node.Wait()
    }
    package main
    
    import (
        "ergo.services/actor/metrics"
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/registrar/etcd"
    )
    
    type WorkerApp struct{}
    
    func (w *WorkerApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name:   "worker",
            Weight: 100,
            Group: []gen.ApplicationMemberSpec{
                {Name: "handler", Factory: createHandler},
            },
        }, nil
    }
    
    func (w *WorkerApp) Start(mode gen.ApplicationMode) {}
    func (w *WorkerApp) Terminate(reason error) {}
    
    func main() {
        options := gen.NodeOptions{
            Network: gen.NetworkOptions{
                Registrar: etcd.Create(etcd.Options{
                    Endpoints: []string{"etcd:2379"},
                    Cluster:   "production",
                }),
            },
        }
    
        node, _ := ergo.StartNode("worker1@host1", options)
    
        // Start metrics
        node.Spawn(metrics.Factory, gen.ProcessOptions{}, metrics.Options{Port: 9090})
    
        // Load and start worker application
        node.ApplicationLoad(&WorkerApp{})
        node.ApplicationStartPermanent("worker", gen.ApplicationOptions{})
    
        node.Wait()
    }

    Cannot be serialized for network transmission

  • Maximum encapsulation

  • CAN be serialized (EDF requires exported fields)

  • Internal to application, but network-capable

  • Cannot be serialized (unexported fields block EDF)

  • Cross-application but local-only

  • CAN be serialized

  • Full network transparency

  • 2

    Same app, any node

    unexported

    Exported

    Yes

    No

    3

    Cross-app, same node

    Exported

    unexported

    No

    Yes

    4

    Everywhere

    Exported

    Exported

    Yes

    Yes

    Does this message cross node boundaries?
    • No → Keep fields unexported (Level 1 or 3)

    • Yes → Export fields (Level 2 or 4)

    Register Level 4 types with EDF

    Include pointers in network messages

    Read environment in entry points

    1

    Within app, same node

    unexported

    unexported

    No

    Building a Cluster
    Message Versioning

    No

    Hierarchical config with type conversion

    Events

    Real-time notifications of cluster changes

    Scalability

    50-70 nodes

    Thousands of nodes

    Event latency

    Next poll cycle

    Immediate

    Infrastructure

    General-purpose KV store

    Purpose-built for Ergo

    7 nodes

    4

    3

    Memory from OS

    ergo_memory_alloc_bytes

    Heap allocation

    ergo_connected_nodes_total

    Remote connections

    ergo_remote_messages_in_total

    Messages received per node

    ergo_remote_messages_out_total

    Messages sent per node

    ergo_remote_bytes_in_total

    Bytes received per node

    ergo_remote_bytes_out_total

    Bytes sent per node

    Leader Actor

    Failover via leader election

    ergo.services/actor/leader

    Metrics Actor

    Prometheus observability

    ergo.services/actor/metrics

    Observer

    Web UI for inspection

    ergo.services/application/observer, tools/observer

    Remote Spawn

    Dynamic process creation

    Core framework

    Remote App Start

    Dynamic application deployment

    Core framework

    Configuration

    Hierarchical config management

    Registrar feature

    Metrics Actor
    Observer Application
    etcd Client
    Saturn Client
    spinner
    spinner
    spinner
    project/
    ├── cmd/                        # Entry points
    │   ├── monolith/main.go       # All apps together
    │   ├── api/main.go            # API node
    │   ├── worker/main.go         # Worker node
    │   └── storage/main.go        # Storage node
    │
    ├── apps/                       # Application packages
    │   ├── api/
    │   │   ├── app.go             # Application definition
    │   │   ├── handler.go         # Request handling actor
    │   │   ├── router.go          # Routing logic
    │   │   ├── messages.go        # Internal messages
    │   │   └── supervisor.go      # Supervision tree
    │   ├── worker/
    │   │   ├── app.go
    │   │   ├── processor.go
    │   │   ├── queue.go
    │   │   ├── messages.go
    │   │   └── types.go
    │   └── storage/
    │       ├── app.go
    │       ├── reader.go
    │       ├── writer.go
    │       ├── messages.go
    │       └── types.go
    │
    ├── types/                      # Service-level contracts
    │   ├── events.go              # Cross-application events
    │   └── commands.go            # Cross-application commands
    │
    ├── lib/                        # Shared non-actor code
    │   ├── config/                # Configuration utilities
    │   └── models/                # Domain models
    │
    └── go.mod
    // cmd/monolith/main.go
    package main
    
    import (
        "myproject/apps/api"
        "myproject/apps/worker"
        "myproject/apps/storage"
    
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
    )
    
    func main() {
        node, _ := ergo.StartNode("app@localhost", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                api.CreateApp(api.Options{Port: 8080}),
                worker.CreateApp(worker.Options{Concurrency: 10}),
                storage.CreateApp(storage.Options{DSN: "postgres://..."}),
            },
        })
        node.Wait()
    }
    // cmd/api/main.go
    package main
    
    import (
        "myproject/apps/api"
    
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/registrar/etcd"
    )
    
    func main() {
        node, _ := ergo.StartNode("api@api-server-1", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                api.CreateApp(api.Options{Port: 8080}),
            },
            Network: gen.NetworkOptions{
                Registrar: etcd.Create(etcd.Options{
                    Endpoints: []string{"etcd:2379"},
                    Cluster:   "production",
                }),
            },
        })
        node.Wait()
    }
    // cmd/worker/main.go
    package main
    
    import (
        "myproject/apps/worker"
    
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
        "ergo.services/registrar/etcd"
    )
    
    func main() {
        node, _ := ergo.StartNode("worker@worker-1", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                worker.CreateApp(worker.Options{Concurrency: 50}),
            },
            Network: gen.NetworkOptions{
                Registrar: etcd.Create(etcd.Options{
                    Endpoints: []string{"etcd:2379"},
                    Cluster:   "production",
                }),
            },
        })
        node.Wait()
    }
    apps/worker/
    ├── app.go              # Application behavior implementation
    ├── processor.go        # Main processing actor
    ├── queue.go            # Queue management actor
    ├── supervisor.go       # Supervision strategy
    ├── messages.go         # Message types (see isolation levels)
    └── types.go            # Domain types used in messages
    // apps/worker/app.go
    package worker
    
    import "ergo.services/ergo/gen"
    
    type Options struct {
        Concurrency int
        QueueSize   int
    }
    
    func CreateApp(opts Options) gen.ApplicationBehavior {
        return &app{options: opts}
    }
    
    type app struct {
        options Options
    }
    
    func (a *app) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name:        "worker",
            Description: "Background task processing",
            Weight:      100,
            Group: []gen.ApplicationMemberSpec{
                {Name: "queue", Factory: a.createQueue},
                {Name: "supervisor", Factory: a.createSupervisor},
            },
            Env: gen.EnvList{
                {"CONCURRENCY", a.options.Concurrency},
                {"QUEUE_SIZE", a.options.QueueSize},
            },
        }, nil
    }
    
    func (a *app) Start(mode gen.ApplicationMode) {}
    func (a *app) Terminate(reason error) {}
    // types/events.go
    package types
    
    import (
        "time"
        "ergo.services/ergo/net/edf"
    )
    
    // Events published by the orders application
    type OrderCreated struct {
        OrderID    string
        CustomerID string
        Total      int64
        CreatedAt  time.Time
    }
    
    type OrderCompleted struct {
        OrderID     string
        CompletedAt time.Time
    }
    
    func init() {
        // Register for network serialization
        edf.RegisterTypeOf(OrderCreated{})
        edf.RegisterTypeOf(OrderCompleted{})
    }
    // lib/config/config.go
    package config
    
    import "os"
    
    func DatabaseURL() string {
        return os.Getenv("DATABASE_URL")
    }
    
    // lib/models/order.go
    package models
    
    type Order struct {
        ID         string
        CustomerID string
        Items      []OrderItem
        Total      int64
    }
    // apps/worker/messages.go
    package worker
    
    // Unexported type, unexported fields
    // Cannot be referenced outside this package
    // Cannot be serialized (unexported fields)
    type scheduleTask struct {
        taskID   string
        priority int
        data     []byte
    }
    
    type taskCompleted struct {
        taskID string
        result []byte
    }
    // apps/worker/messages.go
    package worker
    
    // Unexported type, EXPORTED fields
    // Cannot be referenced outside this package
    // CAN be serialized (exported fields)
    type replicateState struct {
        Version   int64   // Exported for EDF
        TaskIDs   []string
        Positions map[string]int
    }
    
    type syncRequest struct {
        FromVersion int64
        ToVersion   int64
    }
    // apps/worker/messages.go
    package worker
    
    // EXPORTED type, unexported fields
    // CAN be referenced by other packages
    // Cannot be serialized (unexported fields)
    type StatusQuery struct {
        taskID string  // unexported - prevents network use
    }
    
    type StatusResponse struct {
        taskID   string
        status   string
        progress int
    }
    // types/commands.go
    package types
    
    import "ergo.services/ergo/net/edf"
    
    // EXPORTED type, EXPORTED fields
    // CAN be referenced by any package
    // CAN be serialized
    type ProcessTask struct {
        TaskID   string
        Priority int
        Payload  []byte
    }
    
    type TaskResult struct {
        TaskID string
        Status string
        Output []byte
        Error  string
    }
    
    func init() {
        edf.RegisterTypeOf(ProcessTask{})
        edf.RegisterTypeOf(TaskResult{})
    }
    // apps/worker/app.go
    func (a *app) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
        return gen.ApplicationSpec{
            Name: "worker",
            Group: []gen.ApplicationMemberSpec{
                {Name: "queue_manager", Factory: createQueueManager},
                {Name: "processor_sup", Factory: a.createProcessorSupervisor},
            },
        }, nil
    }
    
    // apps/worker/supervisor.go
    func (a *app) createProcessorSupervisor() gen.ProcessBehavior {
        children := make([]act.SupervisorChildSpec, a.options.Concurrency)
        for i := 0; i < a.options.Concurrency; i++ {
            children[i] = act.SupervisorChildSpec{
                Name:   fmt.Sprintf("processor_%d", i),
                Create: createProcessor,
            }
        }
    
        return &act.Supervisor{
            Type:     act.SupervisorTypeOneForOne,
            Restart:  act.SupervisorRestartTemporary,
            Children: children,
        }
    }
    // apps/worker/app.go
    type Options struct {
        Concurrency   int
        QueueSize     int
        RetryAttempts int
        RetryDelay    time.Duration
    }
    
    func DefaultOptions() Options {
        return Options{
            Concurrency:   10,
            QueueSize:     1000,
            RetryAttempts: 3,
            RetryDelay:    time.Second,
        }
    }
    
    func CreateApp(opts Options) gen.ApplicationBehavior {
        // Validate options
        if opts.Concurrency < 1 {
            opts.Concurrency = DefaultOptions().Concurrency
        }
        return &app{options: opts}
    }
    // cmd/worker/main.go
    func main() {
        opts := worker.Options{
            Concurrency:   getEnvInt("WORKER_CONCURRENCY", 50),
            QueueSize:     getEnvInt("WORKER_QUEUE_SIZE", 10000),
            RetryAttempts: getEnvInt("WORKER_RETRY_ATTEMPTS", 5),
        }
    
        node, _ := ergo.StartNode("worker@host", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                worker.CreateApp(opts),
            },
        })
        node.Wait()
    }
    // apps/api/handler.go
    func (h *Handler) processRequest(req Request) error {
        // Discover worker application
        registrar, _ := h.Node().Network().Registrar()
        routes, _ := registrar.Resolver().ResolveApplication("worker")
    
        if len(routes) == 0 {
            return errors.New("no workers available")
        }
    
        // Select a worker (weighted random)
        target := h.selectRoute(routes)
    
        // Send message (works whether local or remote)
        remote, _ := h.Network().GetNode(target.Node)
        result, err := remote.Call("queue_manager", types.ProcessTask{
            TaskID:  req.ID,
            Payload: req.Data,
        })
    
        return err
    }
    // apps/orders/manager.go
    func (m *Manager) completeOrder(orderID string) error {
        // ... complete order logic ...
    
        // Publish event (Level 4 - service-level)
        m.SendEvent("orders", "completed", types.OrderCompleted{
            OrderID:     orderID,
            CompletedAt: time.Now(),
        })
    
        return nil
    }
    
    // apps/shipping/listener.go
    func (l *Listener) Init(args ...any) error {
        // Subscribe to order events
        event, _ := l.RegisterEvent("orders", "completed")
        l.LinkEvent(event)
        return nil
    }
    
    func (l *Listener) HandleEvent(ev gen.MessageEvent) error {
        switch e := ev.Message.(type) {
        case types.OrderCompleted:
            l.createShipment(e.OrderID)
        }
        return nil
    }
    // cmd/dev/main.go
    func main() {
        node, _ := ergo.StartNode("dev@localhost", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                api.CreateApp(api.Options{Port: 8080}),
                worker.CreateApp(worker.Options{Concurrency: 5}),
                storage.CreateApp(storage.Options{DSN: "dev.db"}),
                observer.CreateApp(observer.Options{Port: 9911}),
            },
        })
    
        node.Log().Info("Development server: http://localhost:8080")
        node.Log().Info("Observer UI: http://localhost:9911")
        node.Wait()
    }
    // cmd/api/main.go
    node, _ := ergo.StartNode("api@api-1", gen.NodeOptions{
        Applications: []gen.ApplicationBehavior{
            api.CreateApp(api.Options{Port: 8080}),
        },
        Network: gen.NetworkOptions{
            Registrar: etcd.Create(etcd.Options{
                Endpoints: []string{"etcd:2379"},
                Cluster:   "production",
            }),
        },
    })
    // cmd/frontend/main.go - API + lightweight worker
    node, _ := ergo.StartNode("frontend@web-1", gen.NodeOptions{
        Applications: []gen.ApplicationBehavior{
            api.CreateApp(api.Options{Port: 8080}),
            worker.CreateApp(worker.Options{Concurrency: 5}), // light tasks
        },
        Network: gen.NetworkOptions{Registrar: registrar},
    })
    
    // cmd/backend/main.go - Heavy processing + storage
    node, _ := ergo.StartNode("backend@compute-1", gen.NodeOptions{
        Applications: []gen.ApplicationBehavior{
            worker.CreateApp(worker.Options{Concurrency: 100}), // heavy tasks
            storage.CreateApp(storage.Options{DSN: "prod.db"}),
        },
        Network: gen.NetworkOptions{Registrar: registrar},
    })
    // apps/worker/processor_test.go
    package worker
    
    import (
        "testing"
        "ergo.services/ergo/testing/unit"
    )
    
    func TestProcessorHandlesTask(t *testing.T) {
        actor := unit.NewTestActor(t, createProcessor)
    
        // Send internal message (Level 1)
        actor.Send(actor.PID(), scheduleTask{
            taskID:   "task-1",
            priority: 1,
            data:     []byte("test"),
        })
    
        // Verify response
        actor.ExpectMessage(taskCompleted{
            taskID: "task-1",
            result: []byte("processed"),
        })
    }
    // apps/worker/integration_test.go
    package worker_test
    
    import (
        "testing"
        "myproject/apps/worker"
    
        "ergo.services/ergo"
        "ergo.services/ergo/gen"
    )
    
    func TestWorkerApplication(t *testing.T) {
        node, err := ergo.StartNode("test@localhost", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                worker.CreateApp(worker.Options{Concurrency: 2}),
            },
        })
        if err != nil {
            t.Fatal(err)
        }
        defer node.Stop()
    
        // Verify application started
        info, err := node.ApplicationInfo("worker")
        if err != nil {
            t.Fatal(err)
        }
        if info.State != gen.ApplicationStateRunning {
            t.Fatalf("expected running, got %s", info.State)
        }
    
        // Send test message and verify behavior
        // ...
    }
    func TestCrossNodeCommunication(t *testing.T) {
        // Start API node
        apiNode, _ := ergo.StartNode("api@localhost:15001", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                api.CreateApp(api.Options{}),
            },
        })
        defer apiNode.Stop()
    
        // Start worker node
        workerNode, _ := ergo.StartNode("worker@localhost:15002", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                worker.CreateApp(worker.Options{}),
            },
        })
        defer workerNode.Stop()
    
        // Connect nodes
        apiNode.Network().AddRoute("worker@localhost:15002", gen.NetworkRoute{
            Route: gen.Route{Host: "localhost", Port: 15002},
        }, 100)
    
        // Test cross-node message passing
        remote, _ := apiNode.Network().GetNode("worker@localhost:15002")
        result, err := remote.Call("queue_manager", types.ProcessTask{
            TaskID: "test-task",
        })
    
        if err != nil {
            t.Fatal(err)
        }
        // Verify result...
    }
    // cmd/main.go - Everything together
    func main() {
        node, _ := ergo.StartNode("app@localhost", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                combined.CreateApp(), // One big application
            },
        })
        node.Wait()
    }
    combined/
    ├── order_handler.go      → apps/orders/
    ├── order_processor.go    → apps/orders/
    ├── shipping_handler.go   → apps/shipping/
    ├── shipping_tracker.go   → apps/shipping/
    └── ...
    // apps/orders/app.go
    package orders
    
    func CreateApp(opts Options) gen.ApplicationBehavior {
        return &app{options: opts}
    }
    
    // apps/shipping/app.go
    package shipping
    
    func CreateApp(opts Options) gen.ApplicationBehavior {
        return &app{options: opts}
    }
    // cmd/main.go - Multiple applications, still one node
    func main() {
        node, _ := ergo.StartNode("app@localhost", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                orders.CreateApp(orders.Options{}),
                shipping.CreateApp(shipping.Options{}),
            },
        })
        node.Wait()
    }
    // cmd/orders/main.go
    func main() {
        node, _ := ergo.StartNode("orders@orders-1", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                orders.CreateApp(orders.Options{}),
            },
            Network: gen.NetworkOptions{Registrar: registrar},
        })
        node.Wait()
    }
    // Before: Two binaries
    // cmd/orders/main.go → orders.CreateApp()
    // cmd/shipping/main.go → shipping.CreateApp()
    
    // After: One binary
    // cmd/fulfillment/main.go
    func main() {
        node, _ := ergo.StartNode("fulfillment@host", gen.NodeOptions{
            Applications: []gen.ApplicationBehavior{
                orders.CreateApp(orders.Options{}),
                shipping.CreateApp(shipping.Options{}),
            },
        })
        node.Wait()
    }
    apps/
    ├── orders/      # Complete order lifecycle
    ├── shipping/    # Complete shipping lifecycle
    └── inventory/   # Complete inventory management
    apps/
    ├── order_api/           # Just API handlers
    ├── order_service/       # Just business logic
    ├── order_repository/    # Just data access
    └── order_events/        # Just events
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner
    spinner

    Unit

    A zero-dependency library for testing Ergo Framework actors with fluent API

    circle-info

    Introduced for Ergo Framework 3.1.0 and above (not yet released. available in v310 branch)

    The Ergo Unit Testing Library makes testing actor-based systems simple and reliable. It provides specialized tools designed specifically for the unique challenges of testing actors, with zero external dependencies and an intuitive, readable API.

    hashtag
    What You'll Learn

    This guide takes you from simple actor tests to complex distributed scenarios. Here's the journey:

    hashtag
    Getting Started (You Are Here!)

    • Your First Test - Simple echo and counter examples

    • Built-in Assertions - Simple tools for common checks

    • Basic Message Testing - Verify actors send the right messages

    hashtag
    Intermediate Skills (Next Steps)

    • Configuration Testing - Test environment-driven behavior

    • Complex Message Patterns - Handle sophisticated message flows

    • Basic Process Spawning - Test actor creation and lifecycle

    hashtag
    Advanced Features (When You Need Them)

    • Actor Termination - Test error handling and graceful shutdowns

    • Exit Signals - Manage process lifecycles in supervision trees

    • Scheduled Operations - Test cron jobs and time-based behavior

    hashtag
    Expert Level (Complex Scenarios)

    • Dynamic Value Capture - Handle generated IDs, timestamps, and random data

    • Complex Workflows - Test multi-step business processes

    • Performance & Load Testing - Verify behavior under stress

    Tip: The documentation follows this learning path. You can jump to advanced topics if needed, but starting from the beginning ensures you understand the foundations.

    hashtag
    Why Testing Actors is Different

    Traditional testing tools don't work well with actors. Here's why:

    hashtag
    The Challenge: Actors Are Not Functions

    Regular code testing follows a simple pattern:

    But actors are fundamentally different:

    • They run asynchronously - you send a message and the response comes later

    • They maintain state - previous messages affect future behavior

    • They spawn other actors - creating complex hierarchies

    hashtag
    What Makes Actor Testing Hard

    1. Asynchronous Communication

    1. Message Flow Complexity

    1. Dynamic Process Creation

    1. State Changes Over Time

    hashtag
    How This Library Solves Actor Testing

    The Ergo Unit Testing Library addresses each of these challenges:

    hashtag
    Event Capture - See Everything Your Actor Does

    Instead of guessing what happened, the library automatically captures every actor operation:

    hashtag
    Fluent Assertions - Test What Matters

    Express your test intentions clearly:

    hashtag
    Dynamic Value Handling - Work With Generated Data

    Capture and reuse dynamically generated values:

    hashtag
    State Testing Through Behavior - Verify State Changes

    Test state indirectly by verifying behavioral changes:

    hashtag
    Why Zero Dependencies Matters

    Actor testing is complex enough without dependency management headaches:

    • No version conflicts - Works with any Go testing setup

    • No external tools - Everything needed is built-in

    • Simple imports - Just import "ergo.services/ergo/testing/unit"

    hashtag
    Core Concepts

    Now that you understand why actor testing is different, let's explore the key concepts that make this library work.

    hashtag
    The Event-Driven Testing Model

    Everything your actor does becomes a testable "event".

    When you run this simple test:

    Here's what happens behind the scenes:

    1. Your actor receives the message - Normal actor behavior

    2. Your actor sends a response - Normal actor behavior

    3. The library captures a SendEvent - Testing magic

    The library automatically captures these events:

    • SendEvent - When your actor sends a message

    • SpawnEvent - When your actor creates child processes

    • LogEvent - When your actor writes log messages

    hashtag
    Why Events Matter

    Events solve the fundamental challenge of testing asynchronous systems:

    Instead of this (impossible):

    You do this (works perfectly):

    hashtag
    The Fluent Assertion API

    The library provides a readable, chainable API that expresses test intentions clearly:

    Benefits of the fluent API:

    • Readable - Tests read like English sentences

    • Discoverable - IDE autocomplete guides you through options

    • Flexible - Chain only the validations you need

    hashtag
    Installation

    hashtag
    Your First Actor Test

    Let's start with the simplest possible actor test to understand the basics:

    hashtag
    A Simple Echo Actor

    hashtag
    Testing the Echo Actor

    hashtag
    What Just Happened?

    This simple test demonstrates the core pattern:

    1. unit.Spawn() - Creates a test actor in an isolated environment

    2. actor.SendMessage() - Sends a message to your actor (like prod would)

    3. actor.ShouldSend() - Verifies that your actor sent the expected message

    Key insight: You're not testing internal state - you're testing behavior. You verify what the actor does (sends messages) rather than what it contains (internal variables).

    hashtag
    Why This Works

    The testing library automatically captures everything your actor does:

    • Every message sent by your actor

    • Every process spawned by your actor

    • Every log message written by your actor

    Then it provides fluent assertions to verify these captured events.

    hashtag
    Adding Slightly More Complexity

    Let's test an actor that maintains some state:

    This shows how you test stateful behavior without accessing internal state - by observing how the actor's responses change over time.

    hashtag
    Built-in Assertions

    Before diving into complex actor testing, let's cover the simple assertion utilities you'll use throughout your tests.

    Why Built-in Assertions Matter for Actor Testing:

    Actor tests often need to verify simple conditions alongside complex event assertions. Rather than forcing you to import external testing libraries (which could conflict with your project dependencies), the unit testing library provides everything you need:

    hashtag
    Available Assertions

    Equality Testing:

    Boolean Testing:

    Nil Testing:

    String Testing:

    Type Testing:

    hashtag
    Why Zero Dependencies Matter

    No Import Conflicts:

    Consistent Error Messages: All assertions provide clear, consistent error messages that integrate well with the actor testing output.

    Framework Agnostic: Works with any Go testing setup - standard go test, IDE test runners, CI/CD systems, etc.

    hashtag
    Basic Message Testing

    Now that you understand the fundamentals, let's explore message testing in more depth.

    hashtag
    What Comes Next

    Now you'll learn how to test different aspects of actor behavior, building from simple to complex:

    Fundamentals (You're here!)

    • Basic message sending and receiving

    • Simple process creation

    • Logging and observability

    • Configuration testing

    Intermediate Skills

    • Complex message patterns

    • Event inspection and debugging

    • Actor lifecycle and termination

    • Error handling and recovery

    Advanced Features

    • Scheduled operations (cron jobs)

    • Network and distribution

    • Performance and load testing

    hashtag
    Basic Logging Testing

    Logging is crucial for production actors - it provides visibility into what your actors are doing and helps with debugging. Let's learn how to test logging behavior.

    hashtag
    Why Test Logging?

    Logging tests ensure:

    • Your actors provide sufficient information for monitoring

    • Debug information is available when needed

    • Log levels are respected (don't log debug in production)

    hashtag
    Simple Logging Test

    hashtag
    Testing Different Log Levels

    hashtag
    Testing Log Content

    hashtag
    Logging Best Practices for Testing

    Structure your log messages to make them easy to test:

    Test log levels appropriately:

    • Error - Test that errors are logged when they occur

    • Warning - Test that concerning but non-fatal events are captured

    • Info - Test that important business events are recorded

    hashtag
    Intermediate Skills

    Now that you've mastered the basics, let's tackle more complex testing scenarios.

    hashtag
    Configuration and Environment Testing

    Real actors often behave differently based on configuration. Let's test this:

    The Spawn function creates an isolated testing environment for your actor. Unlike production actors that run in a complex node environment, test actors run in a controlled sandbox where every operation is captured for verification.

    Key Benefits:

    • Isolation: Each test actor runs independently without affecting other tests

    • Deterministic: Test outcomes are predictable and repeatable

    • Observable: All actor operations are automatically captured as events

    Example Actor:

    Test Implementation:

    hashtag
    Configuration Options - Fine-Tuning the Test Environment

    Test configuration allows you to simulate different runtime conditions without requiring complex setup:

    Environment Variables (WithEnv): Test how your actors behave with different configurations without changing production code. Useful for testing feature flags, database URLs, timeout values, and other configuration-driven behavior.

    Log Levels (WithLogLevel): Control the verbosity of test output and verify that your actors log appropriately at different levels. Critical for testing monitoring and debugging capabilities.

    Process Hierarchy (WithParent, WithRegister): Test actors that need to interact with parent processes or require specific naming for registration-based lookups.

    hashtag
    Message Testing

    hashtag
    ShouldSend() - Verifying Actor Communication

    Message testing is the heart of actor validation. Since actors communicate exclusively through messages, verifying message flow is crucial for ensuring correct behavior.

    Why Message Testing Matters:

    • Validates Integration: Ensures actors communicate correctly with their dependencies

    • Confirms Business Logic: Verifies that the right messages are sent in response to inputs

    • Detects Side Effects: Catches unintended message sends that could cause bugs

    Example Actor:

    Test Implementation:

    hashtag
    Advanced Message Matching - Flexible Validation Patterns

    When testing complex message structures or dynamic content, the library provides powerful matching capabilities:

    Pattern Matching Benefits:

    • Partial Validation: Test only the fields that matter for your specific test case

    • Dynamic Content Handling: Validate messages with timestamps, UUIDs, or generated IDs

    • Type Safety: Ensure messages are of the correct type even when content varies

    hashtag
    Process Spawning

    hashtag
    ShouldSpawn() - Testing Process Lifecycle Management

    Process spawning is a fundamental actor pattern for building hierarchical systems. The testing library provides comprehensive tools for verifying that actors create, configure, and manage child processes correctly.

    Why Process Spawning Tests Matter:

    • Resource Management: Ensure actors don't spawn too many or too few processes

    • Configuration Propagation: Verify that child processes receive correct configuration

    • Error Handling: Test behavior when process spawning fails

    Example Actor:

    Test Implementation:

    hashtag
    Dynamic Process Testing - Handling Generated Values

    Real-world actors often generate dynamic values like session IDs, request tokens, or timestamps. The library provides sophisticated tools for capturing and validating these dynamic values.

    Dynamic Value Testing Scenarios:

    • Session Management: Test actors that create sessions with generated IDs

    • Request Tracking: Verify that request tokens are properly generated and used

    • Time-based Operations: Validate actors that schedule work or create timestamps

    hashtag
    Remote Spawn Testing

    hashtag
    ShouldRemoteSpawn() - Testing Distributed Actor Creation

    Remote spawn testing allows you to verify that actors correctly create processes on remote nodes in a distributed system. The testing library captures RemoteSpawnEvent operations and provides fluent assertions for validation.

    Why Test Remote Spawning:

    • Distribution Logic: Ensure actors spawn processes on the correct remote nodes

    • Load Distribution: Verify round-robin or other distribution strategies work correctly

    • Error Handling: Test behavior when remote nodes are unavailable

    Example Actor:

    Test Implementation:

    Advanced Remote Spawn Patterns:

    • Multi-Node Distribution: Test round-robin or other distribution strategies across multiple nodes

    • Error Scenarios: Verify proper error handling when nodes are unavailable

    • Event Inspection: Direct inspection of RemoteSpawnEvent for detailed validation

    hashtag
    Actor Termination Testing

    hashtag
    ShouldTerminate() - Testing Actor Lifecycle Completion

    Actor termination is a critical aspect of actor systems. Actors can terminate for various reasons: normal completion, explicit shutdown, or errors. The testing library provides comprehensive tools for validating termination behavior and ensuring proper cleanup.

    Why Test Actor Termination:

    • Resource Cleanup: Ensure actors properly clean up resources when terminating

    • Error Propagation: Verify that errors are handled correctly and lead to appropriate termination

    • Graceful Shutdown: Test that actors respond correctly to shutdown signals

    Termination Reasons:

    • gen.TerminateReasonNormal - Normal completion of actor work

    • gen.TerminateReasonShutdown - Graceful shutdown request

    • Custom errors - Abnormal termination due to specific errors

    Example Actor:

    Test Implementation:

    Advanced Termination Patterns:

    hashtag
    Exit Signal Testing

    hashtag
    ShouldSendExit() - Testing Graceful Process Termination

    Exit signals (SendExit and SendExitMeta) are used to gracefully terminate other processes. This is different from actor self-termination - it's about one actor telling another to exit. The testing library provides comprehensive assertions for validating exit signal behavior.

    Why Test Exit Signals:

    • Graceful Shutdown: Ensure supervisors can properly terminate child processes

    • Resource Cleanup: Verify that exit signals trigger proper cleanup in target processes

    • Error Propagation: Test that failure conditions are communicated via exit signals

    Example Actor:

    Test Implementation:

    hashtag
    Exit Signal Testing Methods

    Basic Exit Signal Assertions:

    Advanced Exit Signal Patterns:

    hashtag
    Cron Testing

    hashtag
    ShouldAddCronJob(), ShouldExecuteCronJob() - Testing Scheduled Operations

    Cron job testing allows you to validate scheduled operations in your actors without waiting for real time to pass. The testing library provides comprehensive mock time support and detailed cron job lifecycle management.

    Why Test Cron Jobs:

    • Schedule Validation: Ensure cron expressions are correct and jobs run at expected times

    • Job Management: Test job addition, removal, enabling, and disabling operations

    • Execution Logic: Verify that scheduled operations perform correctly when triggered

    Cron Testing Features:

    • Mock Time Support: Control time flow for deterministic testing

    • Job Lifecycle Testing: Validate job creation, scheduling, execution, and cleanup

    • Event Tracking: Monitor all cron-related operations and state changes

    Example Actor:

    Test Implementation:

    hashtag
    Cron Testing Methods

    Job Lifecycle Assertions:

    Mock Time Control:

    Advanced Cron Patterns:

    hashtag
    Built-in Assertions

    The library includes a comprehensive set of zero-dependency assertion functions that cover common testing scenarios without requiring external testing frameworks:

    Why Built-in Assertions:

    • Zero Dependencies: Avoid version conflicts and complex dependency management

    • Consistent Interface: All assertions follow the same pattern and error reporting

    • Testing Framework Agnostic: Works with any Go testing approach

    hashtag
    Advanced Features

    hashtag
    Dynamic Value Capture - Testing Generated Content

    Real-world actors frequently generate dynamic values like timestamps, UUIDs, session IDs, or auto-incrementing counters. Traditional testing approaches struggle with these values because they're unpredictable. The library provides sophisticated capture mechanisms to handle these scenarios elegantly.

    The Challenge of Dynamic Values:

    • Timestamps: Created at runtime, impossible to predict exact values

    • UUIDs: Randomly generated, different in every test run

    • Auto-incrementing IDs: Dependent on execution order and system state

    The Solution - Value Capture:

    Capture Strategies:

    • Immediate Capture: Capture values as soon as they're generated

    • Pattern Matching: Use validation functions to identify and validate dynamic content

    • Structured Matching: Validate message structure while ignoring specific dynamic fields

    hashtag
    Event Inspection - Deep System Analysis

    For complex testing scenarios or debugging difficult issues, the library provides direct access to the complete event timeline. This allows you to perform sophisticated analysis of actor behavior beyond what's possible with standard assertions.

    hashtag
    Events() - Complete Event History

    Access all captured events for detailed analysis:

    hashtag
    LastEvent() - Most Recent Operation

    Get the most recently captured event:

    hashtag
    ClearEvents() - Reset Event History

    Clear all captured events, useful for isolating test phases:

    Event Inspection Use Cases:

    • Performance Analysis: Count operations to identify performance bottlenecks

    • Workflow Validation: Ensure complex multi-step processes execute in the correct order

    • Error Investigation: Analyze the complete event sequence leading to failures

    hashtag
    Timeout Support - Assertion Timing Control

    The library provides timeout support for assertions that might need time-based validation:

    Timeout Function Usage:

    • Assertion Wrapping: Wrap assertion functions to add timeout behavior

    • Integration Testing: Useful when testing with external systems that might have delays

    • Performance Validation: Ensure assertions complete within expected time limits

    hashtag
    Testing Patterns and Best Practices

    hashtag
    Test Organization Strategies

    Single Responsibility Testing: Each test should focus on one specific behavior or scenario. This makes tests easier to understand, debug, and maintain.

    State Isolation: Each test should start with a clean state and not depend on other tests. Use actor.ClearEvents() when needed to reset event history between test phases.

    Error Path Testing: Don't just test the happy path. Actor systems need robust error handling, so test failure scenarios thoroughly:

    hashtag
    Message Design for Testability

    Structured Messages: Design your messages to be easily testable by using structured types rather than primitive values:

    Predictable vs Dynamic Content: Separate predictable content from dynamic content in your messages to make testing easier:

    hashtag
    Performance Testing Considerations

    Event Overhead: While event capture is lightweight, be aware that every operation creates events. For performance-critical tests, you can:

    • Clear events periodically with ClearEvents()

    • Focus assertions on specific time windows

    • Use event inspection to identify performance bottlenecks

    Scaling Testing: Test how your actors behave under load by simulating multiple concurrent operations:

    hashtag
    Best Practices

    1. Use descriptive test names that clearly indicate what behavior is being tested

    2. Test all message types your actor handles, including edge cases

    3. Capture dynamic values early using the Capture() method for generated IDs

    This testing library provides comprehensive coverage for all Ergo Framework actor patterns while maintaining zero external dependencies and excellent readability. By following these patterns and practices, you can build robust, well-tested actor systems that behave correctly in both simple and complex scenarios.

    hashtag
    Complete Examples and Use Cases

    The library includes comprehensive test examples organized into feature-specific files that demonstrate all capabilities through real-world scenarios:

    hashtag
    Feature-Based Test Files

    basic_test.go - Fundamental Actor Testing

    • Basic actor functionality and message handling

    • Dynamic value capture and validation

    • Built-in assertions and event tracking

    network_test.go - Distributed System Testing

    • Remote node simulation and connectivity

    • Network configuration and route management

    • Remote spawn operations and event capture

    workflow_test.go - Complex Business Logic

    • Multi-step order processing workflows

    • State machine validation and transitions

    • Business process orchestration

    call_test.go - Synchronous Communication

    • Call operations and response handling

    • Async call patterns and timeouts

    • Send/response communication flows

    cron_test.go - Scheduled Operations

    • Cron job lifecycle management

    • Mock time control and schedule validation

    • Job execution tracking and assertions

    termination_test.go - Actor Lifecycle Management

    • Actor termination handling and cleanup

    • Exit signal testing (SendExit/SendExitMeta)

    • Normal vs abnormal termination scenarios

    hashtag
    Comprehensive Test Examples

    1. Complex State Machine Testing (workflow_test.go)

      • Multi-step order processing workflow

      • Validation, payment, and fulfillment pipeline

    hashtag
    Getting Started with Examples

    hashtag
    Learning Path

    1. Start with Basic Examples: basic_test.go - Core functionality and patterns

    2. Explore Message Testing: basic_test.go - Message flow and assertions

    3. Learn Process Management: basic_test.go

    Each test file provides complete, working implementations of specific actor patterns and demonstrates best practices for testing each scenario. All tests include comprehensive comments explaining the testing strategy and validation approach.

    hashtag
    Configuration and Environment Testing

    Real actors often behave differently based on configuration. Let's test this:

    hashtag
    Complex Message Patterns

    As your actors become more sophisticated, your message testing needs to handle more complex scenarios:

    hashtag
    Testing Message Sequences

    hashtag
    Testing Conditional Logic

    hashtag
    Basic Process Spawning

    Many actors need to create child processes. Here's how to test this:

    hashtag
    Capturing Dynamic Process IDs

    When actors spawn processes, you often need to use the generated PID in subsequent tests:

    hashtag
    Event Inspection for Debugging

    When tests fail, you need to understand what actually happened:

    hashtag
    Failure Injection Testing

    hashtag
    Overview

    The Ergo Unit Testing Library includes a failure injection system that allows you to test how your actors handle various error conditions. This is essential for building robust actor systems that can gracefully handle failures in production.

    hashtag
    Method Failure Injection

    Access failure injection through the actor's Process() method:

    hashtag
    Available Failure Methods

    The failure injection system provides several methods on TestProcess:

    hashtag
    Common Use Cases

    hashtag
    Testing Spawn Failures

    hashtag
    Testing Message Send Failures

    hashtag
    Testing Intermittent Failures

    hashtag
    Testing Pattern-Based Failures

    hashtag
    Testing One-Time Failures

    hashtag
    Advanced Testing Scenarios

    hashtag
    Testing Supervisor Restart Strategies

    hashtag
    Testing Method Call Tracking

    hashtag
    Best Practices

    1. Clear Events Between Test Phases: Use ClearEvents() when transitioning between test phases to avoid assertion confusion.

    2. Test Recovery: Always test that your actors can recover after failures are cleared or when using one-time failures.

    3. Verify Call Counts: Use GetMethodCallCount()

    hashtag
    Common Pitfalls

    1. Event Accumulation: Events accumulate across multiple operations. Use ClearEvents() to reset between test phases.

    2. Timing Issues: Some assertions may need time to complete. Use appropriate timeouts and consider async patterns.

    3. Message Ordering: In high-throughput scenarios, message ordering might not be guaranteed. Test for this explicitly.

    hashtag
    Conclusion

    The Ergo Framework unit testing library provides comprehensive tools for testing actor-based systems. From simple message exchanges to complex distributed workflows, you can validate every aspect of your actor behavior with confidence.

    Key Takeaways:

    • Start Simple: Begin with basic message testing and gradually add complexity

    • Test Comprehensively: Cover happy paths, error conditions, and edge cases

    • Use Fluent Assertions: Take advantage of the readable assertion API

    The library's zero-dependency design, comprehensive feature set, and integration with Go's testing framework make it the ideal choice for building robust, well-tested actor systems with the Ergo Framework.

    Next Steps:

    1. Explore the complete test examples in the framework repository

    2. Start with simple actors and gradually build complexity

    3. Integrate testing into your development workflow

    Happy testing!

    Basic Logging Testing - Verify your actors provide good observability
    Event Inspection - Debug and analyze actor behavior
    Network & Distribution - Test multi-node actor systems
    They communicate only via messages - no direct access to internal state
  • They can fail and restart - requiring lifecycle testing

  • Fast execution - No overhead from external libraries
    You verify the captured event - Your assertion

    TerminateEvent - When your actor shuts down

    Precise - Specify exactly what matters for each test
    When your actor terminates
    Sensitive operations are properly audited

    Debug - Test that detailed troubleshooting info is available

    Configurable: Fine-tune the testing environment to match your needs
    Tests Message Content: Validates that message payloads contain correct data
    Negative Testing: Verify that certain messages are NOT sent in specific scenarios
    Supervision Trees: Validate that supervisors manage their children appropriately
    Resource Allocation: Test dynamic assignment of resources to processes
    Resource Management: Validate that remote spawning respects capacity limits

    Negative Assertions: Ensure remote spawns don't happen under certain conditions

    Supervision Trees: Validate that supervisors handle child termination appropriately
    Supervision Trees: Validate that supervisors manage process lifecycles correctly
    Time Control: Use mock time to test time-dependent behavior deterministically
    Schedule Simulation: Test complex scheduling scenarios without real time delays
    Actor-Specific: Designed specifically for the needs of actor testing
    Process IDs: Assigned by the actor system, not controllable in tests
    Cross-Reference Testing: Use captured values in multiple assertions to ensure consistency
    Integration Testing: Verify that multiple actors interact correctly in complex scenarios

    Test error conditions not just the happy path

  • Use pattern matching for complex message validation

  • Clear events between test phases when needed with ClearEvents()

  • Configure appropriate log levels for debugging vs production testing

  • Test temporal behaviors with timeout mechanisms

  • Validate distributed scenarios using network simulation

  • Organize tests by behavior rather than by implementation details

  • Core testing patterns and best practices
    Multi-node interaction patterns
    Error handling and recovery scenarios
    Concurrent request management
    Time-dependent behavior testing
    Resource cleanup validation

    State transition validation and error handling

  • Process Management (basic_test.go)

    • Dynamic worker spawning and management

    • Resource capacity limits and monitoring

    • Worker lifecycle (start, stop, restart)

  • Advanced Pattern Matching (basic_test.go)

    • Structure matching with partial validation

    • Dynamic value handling and field validation

    • Complex conditional message matching

  • Remote Spawn Testing (network_test.go)

    • Remote spawn operations on multiple nodes

    • Round-robin distribution testing

    • Error handling for unavailable nodes

    • Event inspection and workflow validation

  • Cron Job Management (cron_test.go)

    • Job scheduling and execution validation

    • Mock time control for deterministic testing

    • Schedule expression testing and validation

  • Actor Termination (termination_test.go)

    • Normal and abnormal termination scenarios

    • Exit signal testing and process cleanup

    • Termination reason validation

    • Post-termination behavior verification

  • Concurrent Operations (call_test.go)

    • Multi-client concurrent request handling

    • Resource contention and capacity management

    • Load testing and performance validation

  • Environment & Configuration (basic_test.go)

    • Environment variable management

    • Runtime configuration changes

    • Feature flag and conditional behavior testing

  • - Spawn operations and lifecycle
  • Master Synchronous Communication: call_test.go - Calls and responses

  • Study Complex Workflows: workflow_test.go - Business logic testing

  • Practice Network Testing: network_test.go - Distributed operations

  • Explore Scheduling: cron_test.go - Time-based operations

  • Understand Termination: termination_test.go - Lifecycle completion

  • to ensure methods are called the expected number of times.
  • Pattern Matching: Use pattern-based failures to test scenarios where only specific inputs should fail.

  • Combine with Supervision: Test how supervisors handle child failures by injecting spawn failures during restart attempts.

  • State Leakage: Each test should start with clean state. Don't rely on previous test state.

  • Failure Persistence: Remember that SetMethodFailure persists until cleared, while SetMethodFailureOnce only fails once.

  • Inspect Events: Use event inspection for debugging and understanding actor behavior
  • Organize Tests: Structure tests by behavior and keep them focused

  • Handle Async Patterns: Use appropriate timeouts and pattern matching for async operations

  • Use the debugging features when tests fail
  • Share testing patterns with your team

  • // Traditional testing - call function, check result
    result := calculateTax(income, rate)
    assert.Equal(t, 1500.0, result)
    // This doesn't work with actors:
    actor.SendMessage("process_order")
    result := actor.GetResult() // No direct way to get result
    // An actor might send multiple messages to different targets:
    actor.SendMessage("start_workflow")
    // How do you verify it sent the right messages to the right places?
    // Actors spawn other actors with generated IDs:
    actor.SendMessage("create_worker")
    // How do you test the spawned worker when you don't know its PID?
    // Actor behavior changes based on message history:
    actor.SendMessage("login", user1)
    actor.SendMessage("login", user2)  
    actor.SendMessage("get_users")
    // How do you verify the internal state without breaking encapsulation?
    actor.SendMessage("process_order")
    // Library automatically captures:
    // - What messages were sent
    // - Which processes were spawned  
    // - What was logged
    // - When the actor terminated
    actor.SendMessage("create_user", userData)
    actor.ShouldSend().To("database").Message(SaveUser{...}).Once().Assert()
    actor.ShouldSpawn().Factory(userWorkerFactory).Once().Assert()
    actor.ShouldLog().Level(Info).Containing("User created").Assert()
    actor.SendMessage("create_session")
    sessionResult := actor.ShouldSpawn().Once().Capture()
    sessionPID := sessionResult.PID // Use the actual generated PID in further tests
    actor.SendMessage("login", user1)
    actor.SendMessage("get_status")
    actor.ShouldSend().To(user1).Message(StatusResponse{LoggedIn: true}).Assert()
    actor.SendMessage(sender, "hello")
    actor.ShouldSend().To(sender).Message("hello").Assert()
    actor.SendMessage("process_order")
    result := actor.WaitForResult() // Actors don't work this way
    actor.SendMessage("process_order")
    // Verify the actor did what it should do:
    actor.ShouldSend().To("database").Message(SaveOrder{...}).Assert()
    actor.ShouldSend().To("inventory").Message(CheckStock{...}).Assert()
    actor.ShouldLog().Level(Info).Containing("Processing order").Assert()
    // Basic pattern: Actor.Should[Action]().Details().Assert()
    
    actor.ShouldSend().To(recipient).Message(content).Once().Assert()
    actor.ShouldSpawn().Factory(workerFactory).Times(3).Assert()
    actor.ShouldLog().Level(Info).Containing("started").Assert()
    actor.ShouldTerminate().WithReason(normalShutdown).Assert()
    go get ergo.services/ergo/testing/unit
    package main
    
    import (
        "testing"
        "ergo.services/ergo/act"
        "ergo.services/ergo/gen"
        "ergo.services/ergo/testing/unit"
    )
    
    // EchoActor - receives a message and sends it back
    type EchoActor struct {
        act.Actor
    }
    
    func (e *EchoActor) HandleMessage(from gen.PID, message any) error {
        // Simply echo the message back to sender
        e.Send(from, message)
        return nil
    }
    
    // Factory function to create the actor
    func newEchoActor() gen.ProcessBehavior {
        return &EchoActor{}
    }
    func TestEchoActor_BasicBehavior(t *testing.T) {
        // 1. Create a test actor
        actor, err := unit.Spawn(t, newEchoActor)
        if err != nil {
            t.Fatal(err)
        }
    
        // 2. Create a sender PID (who is sending the message)
        sender := gen.PID{Node: "test", ID: 123}
    
        // 3. Send a message to the actor
        actor.SendMessage(sender, "hello world")
    
        // 4. Verify the actor sent the message back
        actor.ShouldSend().
            To(sender).                    // Should send to the original sender
            Message("hello world").        // Should send back the same message
            Once().                        // Should happen exactly once
            Assert()                       // Check that it actually happened
    }
    type CounterActor struct {
        act.Actor
        count int
    }
    
    func (c *CounterActor) HandleMessage(from gen.PID, message any) error {
        switch message {
        case "increment":
            c.count++
            c.Send(from, c.count)
        case "get":
            c.Send(from, c.count)
        case "reset":
            c.count = 0
            c.Send(from, "reset complete")
        }
        return nil
    }
    
    func TestCounterActor_StatefulBehavior(t *testing.T) {
        actor, _ := unit.Spawn(t, func() gen.ProcessBehavior { return &CounterActor{} })
        client := gen.PID{Node: "test", ID: 456}
    
        // Test incrementing
        actor.SendMessage(client, "increment")
        actor.ShouldSend().To(client).Message(1).Once().Assert()
    
        actor.SendMessage(client, "increment")
        actor.ShouldSend().To(client).Message(2).Once().Assert()
    
        // Test getting current value
        actor.SendMessage(client, "get")
        actor.ShouldSend().To(client).Message(2).Once().Assert()
    
        // Test reset
        actor.SendMessage(client, "reset")
        actor.ShouldSend().To(client).Message("reset complete").Once().Assert()
    
        // Verify reset worked
        actor.SendMessage(client, "get")
        actor.ShouldSend().To(client).Message(0).Once().Assert()
    }
    func TestActorWithBuiltInAssertions(t *testing.T) {
        actor, _ := unit.Spawn(t, newEchoActor)
        
        // Use built-in assertions for simple checks
        unit.NotNil(t, actor, "Actor should be created successfully")
        unit.Equal(t, false, actor.IsTerminated(), "New actor should not be terminated")
        
        // Combine with actor-specific assertions
        actor.SendMessage(gen.PID{Node: "test", ID: 1}, "hello")
        actor.ShouldSend().Message("hello").Once().Assert()
    }
    unit.Equal(t, expected, actual)        // Values must be equal
    unit.NotEqual(t, unexpected, actual)   // Values must be different
    unit.True(t, condition)               // Condition must be true
    unit.False(t, condition)              // Condition must be false
    unit.Nil(t, value)                    // Value must be nil
    unit.NotNil(t, value)                 // Value must not be nil
    unit.Contains(t, "hello world", "world")  // String must contain substring
    unit.IsType(t, "", actualValue)       // Value must be of specific type
    // This could cause version conflicts:
    import "github.com/stretchr/testify/assert"
    import "github.com/other/testing/lib"
    
    // This always works:
    import "ergo.services/ergo/testing/unit"
    func TestGreeter_LogsWelcomeMessage(t *testing.T) {
        actor, _ := unit.Spawn(t, newGreeter, unit.WithLogLevel(gen.LogLevelInfo))
        
        actor.SendMessage(gen.PID{}, Welcome{Name: "Alice"})
        
        // Verify the actor logged the welcome
        actor.ShouldLog().
            Level(gen.LogLevelInfo).
            Containing("Welcome Alice").
            Once().
            Assert()
    }
    func TestDataProcessor_LogLevels(t *testing.T) {
        actor, _ := unit.Spawn(t, newDataProcessor, unit.WithLogLevel(gen.LogLevelDebug))
        
        actor.SendMessage(gen.PID{}, ProcessData{Data: "sample"})
        
        // Should log at info level for important events
        actor.ShouldLog().Level(gen.LogLevelInfo).Containing("Processing started").Once().Assert()
        
        // Should log at debug level for detailed info
        actor.ShouldLog().Level(gen.LogLevelDebug).Containing("Processing sample data").Once().Assert()
        
        // Should never log at error level for normal operations
        actor.ShouldLog().Level(gen.LogLevelError).Times(0).Assert()
    }
    func TestAuditLogger_SecurityEvents(t *testing.T) {
        actor, _ := unit.Spawn(t, newAuditLogger)
        
        actor.SendMessage(gen.PID{}, LoginAttempt{User: "admin", Success: false})
        
        // Verify security events are properly logged
        actor.ShouldLog().MessageMatching(func(msg string) bool {
            return strings.Contains(msg, "SECURITY") && 
                   strings.Contains(msg, "admin") && 
                   strings.Contains(msg, "failed")
        }).Once().Assert()
    }
    // Good: Structured, predictable format
    log.Info("User login: user=%s success=%t", userID, success)
    
    // Poor: Hard to test reliably  
    log.Info("User " + userID + " tried to login and it " + result)
    type messageCounter struct {
        act.Actor
        count int
    }
    
    func (m *messageCounter) Init(args ...any) error {
        m.count = 0
        m.Log().Info("Counter initialized")
        return nil
    }
    
    func (m *messageCounter) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case "increment":
            m.count++
            m.Send("output", CountChanged{Count: m.count})
            m.Log().Debug("Count incremented to %d", m.count)
            return nil
        case "get_count":
            m.Send(from, CountResponse{Count: m.count})
            return nil
        case "reset":
            m.count = 0
            m.Send("output", CountReset{})
            return nil
        }
        return nil
    }
    
    type CountChanged struct{ Count int }
    type CountResponse struct{ Count int }
    type CountReset struct{}
    
    func factoryMessageCounter() gen.ProcessBehavior {
        return &messageCounter{}
    }
    func TestMessageCounter_BasicUsage(t *testing.T) {
        // Create test actor with configuration
        actor, err := unit.Spawn(t, factoryMessageCounter,
            unit.WithLogLevel(gen.LogLevelDebug),
            unit.WithEnv(map[gen.Env]any{
                "test_mode": true,
                "timeout":   30,
            }),
        )
        if err != nil {
            t.Fatal(err)
        }
    
        // Test initialization
        actor.ShouldLog().Level(gen.LogLevelInfo).Containing("Counter initialized").Once().Assert()
    
        // Test message handling
        actor.SendMessage(gen.PID{}, "increment")
        actor.ShouldSend().To("output").Message(CountChanged{Count: 1}).Once().Assert()
        actor.ShouldLog().Level(gen.LogLevelDebug).Containing("Count incremented to 1").Once().Assert()
    
        // Test state query
        actor.SendMessage(gen.PID{Node: "test", ID: 123}, "get_count")
        actor.ShouldSend().To(gen.PID{Node: "test", ID: 123}).Message(CountResponse{Count: 1}).Once().Assert()
    
        // Test reset
        actor.SendMessage(gen.PID{}, "reset")
        actor.ShouldSend().To("output").Message(CountReset{}).Once().Assert()
    }
    // Available options for unit.Spawn()
    unit.WithLogLevel(gen.LogLevelDebug)                    // Set log level
    unit.WithEnv(map[gen.Env]any{"key": "value"})          // Environment variables
    unit.WithParent(gen.PID{Node: "parent", ID: 100})      // Parent process
    unit.WithRegister(gen.Atom("registered_name"))         // Register with name
    unit.WithNodeName(gen.Atom("test_node@localhost"))     // Node name
    type notificationService struct {
        act.Actor
        subscribers []gen.PID
    }
    
    func (n *notificationService) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case Subscribe:
            n.subscribers = append(n.subscribers, msg.PID)
            n.Send(msg.PID, SubscriptionConfirmed{})
            return nil
        case Broadcast:
            for _, subscriber := range n.subscribers {
                n.Send(subscriber, Notification{
                    ID:      msg.ID,
                    Message: msg.Message,
                    Sender:  from,
                })
            }
            n.Send("analytics", BroadcastSent{
                ID:          msg.ID,
                Subscribers: len(n.subscribers),
            })
            return nil
        }
        return nil
    }
    
    type Subscribe struct{ PID gen.PID }
    type SubscriptionConfirmed struct{}
    type Broadcast struct{ ID string; Message string }
    type Notification struct{ ID, Message string; Sender gen.PID }
    type BroadcastSent struct{ ID string; Subscribers int }
    func TestNotificationService_MessageSending(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryNotificationService)
    
        subscriber1 := gen.PID{Node: "test", ID: 101}
        subscriber2 := gen.PID{Node: "test", ID: 102}
    
        // Test subscription
        actor.SendMessage(gen.PID{}, Subscribe{PID: subscriber1})
        actor.SendMessage(gen.PID{}, Subscribe{PID: subscriber2})
    
        // Verify subscription confirmations
        actor.ShouldSend().To(subscriber1).Message(SubscriptionConfirmed{}).Once().Assert()
        actor.ShouldSend().To(subscriber2).Message(SubscriptionConfirmed{}).Once().Assert()
    
        // Test broadcast
        broadcaster := gen.PID{Node: "test", ID: 200}
        actor.SendMessage(broadcaster, Broadcast{ID: "msg-123", Message: "Hello World"})
    
        // Verify notifications sent to all subscribers
        actor.ShouldSend().To(subscriber1).MessageMatching(func(msg any) bool {
            if notif, ok := msg.(Notification); ok {
                return notif.ID == "msg-123" && 
                       notif.Message == "Hello World" &&
                       notif.Sender == broadcaster
            }
            return false
        }).Once().Assert()
    
        actor.ShouldSend().To(subscriber2).MessageMatching(func(msg any) bool {
            if notif, ok := msg.(Notification); ok {
                return notif.ID == "msg-123" && notif.Message == "Hello World"
            }
            return false
        }).Once().Assert()
    
        // Verify analytics
        actor.ShouldSend().To("analytics").Message(BroadcastSent{
            ID:          "msg-123",
            Subscribers: 2,
        }).Once().Assert()
    
        // Test multiple sends to same target
        actor.SendMessage(broadcaster, Broadcast{ID: "msg-124", Message: "Second message"})
        actor.ShouldSend().To("analytics").Times(2).Assert() // Total of 2 analytics messages
    }
    // Message type matching
    actor.ShouldSend().MessageMatching(unit.IsTypeGeneric[CountChanged]()).Assert()
    
    // Field-based matching
    actor.ShouldSend().MessageMatching(unit.HasField("Count", unit.Equals(5))).Assert()
    
    // Structure matching with custom field validation
    actor.ShouldSend().MessageMatching(
        unit.StructureMatching(Notification{}, map[string]unit.Matcher{
            "ID":      unit.Equals("msg-123"),
            "Sender":  unit.IsValidPID(),
        }),
    ).Assert()
    
    // Never sent verification
    actor.ShouldNotSend().To("error_handler").Message("error").Assert()
    type workerSupervisor struct {
        act.Actor
        workers    map[string]gen.PID
        maxWorkers int
    }
    
    func (w *workerSupervisor) Init(args ...any) error {
        w.workers = make(map[string]gen.PID)
        w.maxWorkers = 3
        return nil
    }
    
    func (w *workerSupervisor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case StartWorker:
            if len(w.workers) >= w.maxWorkers {
                w.Send(from, WorkerError{Error: "max workers reached"})
                return nil
            }
    
            // Spawn worker with dynamic name
            workerPID, err := w.Spawn(factoryWorker, gen.ProcessOptions{}, msg.WorkerID)
            if err != nil {
                w.Send(from, WorkerError{Error: err.Error()})
                return nil
            }
    
            w.workers[msg.WorkerID] = workerPID
            w.Send(from, WorkerStarted{WorkerID: msg.WorkerID, PID: workerPID})
            w.Send("monitor", SupervisorStatus{
                ActiveWorkers: len(w.workers),
                MaxWorkers:    w.maxWorkers,
            })
            return nil
    
        case StopWorker:
            if pid, exists := w.workers[msg.WorkerID]; exists {
                w.SendExit(pid, gen.TerminateReasonShutdown)
                delete(w.workers, msg.WorkerID)
                w.Send(from, WorkerStopped{WorkerID: msg.WorkerID})
            }
            return nil
    
        case StopAllWorkers:
            for workerID, pid := range w.workers {
                w.SendExit(pid, gen.TerminateReasonShutdown)
                delete(w.workers, workerID)
            }
            w.Send(from, AllWorkersStopped{Count: len(w.workers)})
            return nil
        }
        return nil
    }
    
    type StartWorker struct{ WorkerID string }
    type StopWorker struct{ WorkerID string }
    type StopAllWorkers struct{}
    type WorkerStarted struct{ WorkerID string; PID gen.PID }
    type WorkerStopped struct{ WorkerID string }
    type WorkerError struct{ Error string }
    type AllWorkersStopped struct{ Count int }
    type SupervisorStatus struct{ ActiveWorkers, MaxWorkers int }
    
    func factoryWorker() gen.ProcessBehavior { return &worker{} }
    func factoryWorkerSupervisor() gen.ProcessBehavior { return &workerSupervisor{} }
    
    type worker struct{ act.Actor }
    func (w *worker) HandleMessage(from gen.PID, message any) error { return nil }
    func TestWorkerSupervisor_SpawnManagement(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryWorkerSupervisor)
        client := gen.PID{Node: "test", ID: 999}
    
        // Test worker spawning
        actor.SendMessage(client, StartWorker{WorkerID: "worker-1"})
    
        // Capture the spawn event to get the PID
        spawnResult := actor.ShouldSpawn().Factory(factoryWorker).Once().Capture()
        unit.NotNil(t, spawnResult)
    
        // Verify worker started response
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if started, ok := msg.(WorkerStarted); ok {
                return started.WorkerID == "worker-1" && started.PID == spawnResult.PID
            }
            return false
        }).Once().Assert()
    
        // Verify monitor notification
        actor.ShouldSend().To("monitor").Message(SupervisorStatus{
            ActiveWorkers: 1,
            MaxWorkers:    3,
        }).Once().Assert()
    
        // Test multiple workers
        actor.SendMessage(client, StartWorker{WorkerID: "worker-2"})
        actor.SendMessage(client, StartWorker{WorkerID: "worker-3"})
    
        // Should have spawned 3 workers total
        actor.ShouldSpawn().Factory(factoryWorker).Times(3).Assert()
    
        // Test max worker limit
        actor.SendMessage(client, StartWorker{WorkerID: "worker-4"})
        actor.ShouldSend().To(client).Message(WorkerError{Error: "max workers reached"}).Once().Assert()
        
        // Should still only have 3 spawned workers
        actor.ShouldSpawn().Factory(factoryWorker).Times(3).Assert()
    
        // Test stopping a worker
        actor.SendMessage(client, StopWorker{WorkerID: "worker-1"})
        actor.ShouldSend().To(client).Message(WorkerStopped{WorkerID: "worker-1"}).Once().Assert()
    }
    func TestDynamicProcessCreation(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryTaskProcessor)
    
        // Test dynamic process creation with captured PIDs
        actor.SendMessage(gen.PID{}, CreateSessionWorker{UserID: "user123"})
    
        // Capture the spawn to get dynamic PID
        spawnResult := actor.ShouldSpawn().Once().Capture()
        sessionPID := spawnResult.PID
    
        // Verify session was registered with the dynamic PID
        actor.ShouldSend().To("session_registry").MessageMatching(func(msg any) bool {
            if reg, ok := msg.(SessionRegistered); ok {
                return reg.UserID == "user123" && reg.SessionPID == sessionPID
            }
            return false
        }).Once().Assert()
    
        // Test sending work to the dynamic session
        actor.SendMessage(gen.PID{}, SendToSession{
            UserID: "user123", 
            Task:   "process_data",
        })
    
        // Should route to the captured session PID
        actor.ShouldSend().To(sessionPID).MessageMatching(func(msg any) bool {
            if task, ok := msg.(SessionTask); ok {
                return task.Task == "process_data"
            }
            return false
        }).Once().Assert()
    }
    
    // Required message types for this example:
    type CreateSessionWorker struct{ UserID string }
    type SessionRegistered struct{ UserID string; SessionPID gen.PID }
    type SendToSession struct{ UserID, Task string }
    type SessionTask struct{ Task string }
    // factoryTaskProcessor() gen.ProcessBehavior function would be defined separately
    type distributedCoordinator struct {
        act.Actor
        nodeAvailability map[gen.Atom]bool
        roundRobin      int
    }
    
    func (dc *distributedCoordinator) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case SpawnRemoteWorker:
            if !dc.isNodeAvailable(msg.NodeName) {
                dc.Send(from, RemoteSpawnError{
                    NodeName: msg.NodeName,
                    Error:    "node not available",
                })
                return nil
            }
    
            // Use RemoteSpawn which generates RemoteSpawnEvent
            pid, err := dc.RemoteSpawn(msg.NodeName, msg.WorkerName, gen.ProcessOptions{}, msg.Config)
            if err != nil {
                dc.Send(from, RemoteSpawnError{NodeName: msg.NodeName, Error: err.Error()})
                return nil
            }
    
            dc.Send(from, RemoteWorkerSpawned{
                NodeName:   msg.NodeName,
                WorkerName: msg.WorkerName,
                PID:        pid,
            })
            return nil
    
        case SpawnRemoteService:
            // Use RemoteSpawnRegister which generates RemoteSpawnEvent with registration
            pid, err := dc.RemoteSpawnRegister(msg.NodeName, msg.ServiceName, msg.RegisterName, gen.ProcessOptions{})
            if err != nil {
                dc.Send(from, RemoteSpawnError{NodeName: msg.NodeName, Error: err.Error()})
                return nil
            }
    
            dc.Send(from, RemoteServiceSpawned{
                NodeName:     msg.NodeName,
                ServiceName:  msg.ServiceName,
                RegisterName: msg.RegisterName,
                PID:          pid,
            })
            return nil
        }
        return nil
    }
    
    type SpawnRemoteWorker struct{ NodeName, WorkerName gen.Atom; Config map[string]any }
    type SpawnRemoteService struct{ NodeName, ServiceName, RegisterName gen.Atom }
    type RemoteWorkerSpawned struct{ NodeName, WorkerName gen.Atom; PID gen.PID }
    type RemoteServiceSpawned struct{ NodeName, ServiceName, RegisterName gen.Atom; PID gen.PID }
    type RemoteSpawnError struct{ NodeName gen.Atom; Error string }
    func TestDistributedCoordinator_RemoteSpawn(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryDistributedCoordinator)
    
        // Setup remote nodes for testing
        actor.CreateRemoteNode("worker@node1", true)  // Available
        actor.CreateRemoteNode("worker@node2", false) // Unavailable
    
        clientPID := gen.PID{Node: "test", ID: 100}
        actor.ClearEvents() // Clear initialization events
    
        // Test basic remote spawn
        actor.SendMessage(clientPID, SpawnRemoteWorker{
            NodeName:   "worker@node1",
            WorkerName: "data-processor",
            Config:     map[string]any{"timeout": 30},
        })
    
        // Verify remote spawn event
        actor.ShouldRemoteSpawn().
            ToNode("worker@node1").
            WithName("data-processor").
            Once().
            Assert()
    
        // Test remote spawn with registration
        actor.SendMessage(clientPID, SpawnRemoteService{
            NodeName:     "worker@node1",
            ServiceName:  "user-service",
            RegisterName: "users",
        })
    
        // Verify remote spawn with register
        actor.ShouldRemoteSpawn().
            ToNode("worker@node1").
            WithName("user-service").
            WithRegister("users").
            Once().
            Assert()
    
        // Test total remote spawns
        actor.ShouldRemoteSpawn().Times(2).Assert()
    
        // Test negative assertion - should not spawn on unavailable node
        actor.SendMessage(clientPID, SpawnRemoteWorker{
            NodeName:   "worker@node2",
            WorkerName: "test-worker",
        })
    
        actor.ShouldNotRemoteSpawn().ToNode("worker@node2").Assert()
    }
    type connectionManager struct {
        act.Actor
        connections map[string]*Connection
        maxRetries  int
    }
    
    func (c *connectionManager) Init(args ...any) error {
        c.connections = make(map[string]*Connection)
        c.maxRetries = 3
        c.Log().Info("Connection manager started")
        return nil
    }
    
    func (c *connectionManager) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case CreateConnection:
            conn := &Connection{ID: msg.ID, Status: "active"}
            c.connections[msg.ID] = conn
            c.Send(from, ConnectionCreated{ID: msg.ID})
            c.Log().Info("Created connection %s", msg.ID)
            return nil
    
        case CloseConnection:
            if conn, exists := c.connections[msg.ID]; exists {
                conn.Close()
                delete(c.connections, msg.ID)
                c.Send(from, ConnectionClosed{ID: msg.ID})
                c.Log().Info("Closed connection %s", msg.ID)
            }
            return nil
    
        case "shutdown":
            // Graceful shutdown - close all connections
            for id, conn := range c.connections {
                conn.Close()
                c.Log().Info("Shutdown: closed connection %s", id)
            }
            c.Send("monitor", ShutdownComplete{ConnectionsClosed: len(c.connections)})
            return gen.TerminateReasonShutdown
    
        case ConnectionError:
            c.Log().Error("Connection error for %s: %s", msg.ID, msg.Error)
            msg.RetryCount++
            
            if msg.RetryCount >= c.maxRetries {
                c.Log().Error("Max retries exceeded for connection %s", msg.ID)
                return fmt.Errorf("connection failed after %d retries: %s", c.maxRetries, msg.Error)
            }
            
            // Retry the connection
            c.Send(c.PID(), CreateConnection{ID: msg.ID})
            return nil
    
        case "force_error":
            // Simulate critical error
            return fmt.Errorf("critical system error: database unavailable")
        }
        return nil
    }
    
    type CreateConnection struct{ ID string }
    type CloseConnection struct{ ID string }
    type ConnectionCreated struct{ ID string }
    type ConnectionClosed struct{ ID string }
    type ConnectionError struct{ ID, Error string; RetryCount int }
    type ShutdownComplete struct{ ConnectionsClosed int }
    
    type Connection struct {
        ID     string
        Status string
    }
    
    func (c *Connection) Close() { c.Status = "closed" }
    
    func factoryConnectionManager() gen.ProcessBehavior {
        return &connectionManager{}
    }
    func TestConnectionManager_TerminationHandling(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryConnectionManager)
        client := gen.PID{Node: "test", ID: 100}
    
        // Test normal operation first
        actor.SendMessage(client, CreateConnection{ID: "conn-1"})
        actor.ShouldSend().To(client).Message(ConnectionCreated{ID: "conn-1"}).Once().Assert()
        
        // Verify actor is not terminated during normal operation
        unit.Equal(t, false, actor.IsTerminated())
        unit.Nil(t, actor.TerminationReason())
    
        // Test graceful shutdown
        actor.SendMessage(client, "shutdown")
        
        // Verify shutdown message sent
        actor.ShouldSend().To("monitor").MessageMatching(func(msg any) bool {
            if shutdown, ok := msg.(ShutdownComplete); ok {
                return shutdown.ConnectionsClosed == 1
            }
            return false
        }).Once().Assert()
    
        // Verify graceful termination
        unit.Equal(t, true, actor.IsTerminated())
        unit.Equal(t, gen.TerminateReasonShutdown, actor.TerminationReason())
    
        // Verify termination event was captured
        actor.ShouldTerminate().
            WithReason(gen.TerminateReasonShutdown).
            Once().
            Assert()
    }
    
    func TestConnectionManager_ErrorTermination(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryConnectionManager)
    
        // Test abnormal termination due to critical error
        actor.SendMessage(gen.PID{}, "force_error")
    
        // Verify actor terminated with error
        unit.Equal(t, true, actor.IsTerminated())
        unit.NotNil(t, actor.TerminationReason())
        unit.Contains(t, actor.TerminationReason().Error(), "critical system error")
    
        // Verify termination event with specific error
        actor.ShouldTerminate().
            ReasonMatching(func(reason error) bool {
                return strings.Contains(reason.Error(), "database unavailable")
            }).
            Once().
            Assert()
    }
    
    func TestConnectionManager_RetryBeforeTermination(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryConnectionManager)
    
        // Test retry logic before termination
        actor.SendMessage(gen.PID{}, CreateConnection{ID: "conn-retry"})
        actor.ClearEvents() // Clear creation events
    
        // Send connection errors that should trigger retries
        for i := 0; i < 2; i++ {
            actor.SendMessage(gen.PID{}, ConnectionError{
                ID:         "conn-retry",
                Error:      "network timeout",
                RetryCount: i,
            })
    
            // Should not terminate yet
            unit.Equal(t, false, actor.IsTerminated())
            
            // Should retry by sending CreateConnection
            actor.ShouldSend().To(actor.PID()).MessageMatching(func(msg any) bool {
                if create, ok := msg.(CreateConnection); ok {
                    return create.ID == "conn-retry"
                }
                return false
            }).Once().Assert()
        }
    
        // Final error that exceeds max retries
        actor.SendMessage(gen.PID{}, ConnectionError{
            ID:         "conn-retry",
            Error:      "network timeout",
            RetryCount: 3, // Exceeds maxRetries
        })
    
        // Now should terminate with error
        unit.Equal(t, true, actor.IsTerminated())
        unit.Contains(t, actor.TerminationReason().Error(), "connection failed after 3 retries")
    
        // Verify termination assertion
        actor.ShouldTerminate().
            ReasonMatching(func(reason error) bool {
                return strings.Contains(reason.Error(), "retries") && 
                       strings.Contains(reason.Error(), "network timeout")
            }).
            Once().
            Assert()
    }
    
    func TestTerminatedActor_NoFurtherProcessing(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryConnectionManager)
    
        // Terminate the actor
        actor.SendMessage(gen.PID{}, "force_error")
        unit.Equal(t, true, actor.IsTerminated())
    
        actor.ClearEvents() // Clear termination events
    
        // Try to send more messages - should not be processed
        actor.SendMessage(gen.PID{}, CreateConnection{ID: "should-not-work"})
        
        // Should not process the message (no CreateConnection response)
        actor.ShouldNotSend().To(gen.PID{}).Message(ConnectionCreated{ID: "should-not-work"}).Assert()
        
        // Should not create any new events
        events := actor.Events()
        unit.Equal(t, 0, len(events), "Terminated actor should not process messages")
    }
    
    #### Termination Testing Methods
    
    **TestActor Termination Status:**
    ```go
    // Check if actor is terminated
    isTerminated := actor.IsTerminated() // bool
    
    // Get termination reason (nil if not terminated)
    reason := actor.TerminationReason() // error or nil
    
    // Test that actor should terminate
    actor.ShouldTerminate().Once().Assert()
    
    // Test with specific reason
    actor.ShouldTerminate().WithReason(gen.TerminateReasonShutdown).Assert()
    
    // Test with reason matching
    actor.ShouldTerminate().ReasonMatching(func(reason error) bool {
        return strings.Contains(reason.Error(), "expected error")
    }).Assert()
    
    // Test that actor should NOT terminate
    actor.ShouldNotTerminate().Assert()
    // Test multiple termination attempts
    actor.ShouldTerminate().Times(1).Assert() // Should terminate exactly once
    
    // Capture termination for detailed analysis
    terminationResult := actor.ShouldTerminate().Once().Capture()
    unit.NotNil(t, terminationResult)
    unit.Equal(t, expectedReason, terminationResult.Reason)
    
    // Test termination with timeout
    success := unit.WithTimeout(func() {
        actor.SendMessage(gen.PID{}, "shutdown")
        actor.ShouldTerminate().Once().Assert()
    }, 5*time.Second)
    unit.True(t, success(), "Actor should terminate within timeout")
    type processSupervisor struct {
        act.Actor
        workers map[string]gen.PID
        maxWorkers int
    }
    
    func (p *processSupervisor) Init(args ...any) error {
        p.workers = make(map[string]gen.PID)
        p.maxWorkers = 5
        return nil
    }
    
    func (p *processSupervisor) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case StartWorker:
            if len(p.workers) >= p.maxWorkers {
                p.Send(from, WorkerStartError{Error: "max workers reached"})
                return nil
            }
    
            workerPID, err := p.Spawn(factoryWorkerProcess, gen.ProcessOptions{}, msg.WorkerID)
            if err != nil {
                p.Send(from, WorkerStartError{Error: err.Error()})
                return nil
            }
    
            p.workers[msg.WorkerID] = workerPID
            p.Send(from, WorkerStarted{WorkerID: msg.WorkerID, PID: workerPID})
            return nil
    
        case StopWorker:
            if workerPID, exists := p.workers[msg.WorkerID]; exists {
                // Send exit signal to worker
                p.SendExit(workerPID, gen.TerminateReasonShutdown)
                delete(p.workers, msg.WorkerID)
                p.Send(from, WorkerStopped{WorkerID: msg.WorkerID})
                p.Log().Info("Sent exit signal to worker %s", msg.WorkerID)
            } else {
                p.Send(from, WorkerStopError{WorkerID: msg.WorkerID, Error: "worker not found"})
            }
            return nil
    
        case EmergencyShutdown:
            // Send exit signals to all workers with error reason
            shutdownReason := fmt.Errorf("emergency shutdown: %s", msg.Reason)
            
            for workerID, workerPID := range p.workers {
                p.SendExit(workerPID, shutdownReason)
                p.Log().Warning("Emergency shutdown: sent exit to worker %s", workerID)
            }
            
            // Send meta exit signal to monitoring system
            p.SendExitMeta(gen.PID{Node: "monitor", ID: 999}, shutdownReason)
            
            p.Send(from, EmergencyShutdownComplete{
                WorkersTerminated: len(p.workers),
                Reason:           msg.Reason,
            })
            
            p.workers = make(map[string]gen.PID) // Clear workers map
            return nil
    
        case TerminateWorkerWithError:
            if workerPID, exists := p.workers[msg.WorkerID]; exists {
                errorReason := fmt.Errorf("worker error: %s", msg.Error)
                p.SendExit(workerPID, errorReason)
                delete(p.workers, msg.WorkerID)
                
                p.Send(from, WorkerTerminated{
                    WorkerID: msg.WorkerID,
                    Reason:   msg.Error,
                })
            }
            return nil
        }
        return nil
    }
    
    type StartWorker struct{ WorkerID string }
    type StopWorker struct{ WorkerID string }
    type EmergencyShutdown struct{ Reason string }
    type TerminateWorkerWithError struct{ WorkerID, Error string }
    
    type WorkerStarted struct{ WorkerID string; PID gen.PID }
    type WorkerStopped struct{ WorkerID string }
    type WorkerStartError struct{ Error string }
    type WorkerStopError struct{ WorkerID, Error string }
    type EmergencyShutdownComplete struct{ WorkersTerminated int; Reason string }
    type WorkerTerminated struct{ WorkerID, Reason string }
    
    type workerProcess struct{ act.Actor }
    func (w *workerProcess) HandleMessage(from gen.PID, message any) error { return nil }
    func factoryWorkerProcess() gen.ProcessBehavior { return &workerProcess{} }
    func factoryProcessSupervisor() gen.ProcessBehavior { return &processSupervisor{} }
    func TestProcessSupervisor_ExitSignals(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryProcessSupervisor)
        client := gen.PID{Node: "test", ID: 100}
    
        // Start some workers
        actor.SendMessage(client, StartWorker{WorkerID: "worker-1"})
        actor.SendMessage(client, StartWorker{WorkerID: "worker-2"})
    
        // Capture worker PIDs for validation
        spawn1 := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
        spawn2 := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
        
        worker1PID := spawn1.PID
        worker2PID := spawn2.PID
    
        actor.ClearEvents() // Clear spawn events
    
        // Test graceful worker stop
        actor.SendMessage(client, StopWorker{WorkerID: "worker-1"})
    
        // Verify exit signal sent to worker
        actor.ShouldSendExit().
            To(worker1PID).
            WithReason(gen.TerminateReasonShutdown).
            Once().
            Assert()
    
        // Verify stop confirmation
        actor.ShouldSend().To(client).Message(WorkerStopped{WorkerID: "worker-1"}).Once().Assert()
    
        // Test worker termination with custom error
        actor.SendMessage(client, TerminateWorkerWithError{
            WorkerID: "worker-2",
            Error:    "memory leak detected",
        })
    
        // Verify exit signal with custom error reason
        actor.ShouldSendExit().
            To(worker2PID).
            ReasonMatching(func(reason error) bool {
                return strings.Contains(reason.Error(), "memory leak detected")
            }).
            Once().
            Assert()
    
        // Verify termination response
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if terminated, ok := msg.(WorkerTerminated); ok {
                return terminated.WorkerID == "worker-2" && 
                       terminated.Reason == "memory leak detected"
            }
            return false
        }).Once().Assert()
    }
    
    func TestProcessSupervisor_EmergencyShutdown(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryProcessSupervisor)
        client := gen.PID{Node: "test", ID: 100}
    
        // Start multiple workers
        for i := 1; i <= 3; i++ {
            actor.SendMessage(client, StartWorker{WorkerID: fmt.Sprintf("worker-%d", i)})
        }
    
        // Capture all worker PIDs
        workers := make([]gen.PID, 3)
        for i := 0; i < 3; i++ {
            spawn := actor.ShouldSpawn().Factory(factoryWorkerProcess).Once().Capture()
            workers[i] = spawn.PID
        }
    
        actor.ClearEvents() // Clear spawn events
    
        // Trigger emergency shutdown
        actor.SendMessage(client, EmergencyShutdown{Reason: "system overload"})
    
        // Verify exit signals sent to all workers
        for _, workerPID := range workers {
            actor.ShouldSendExit().
                To(workerPID).
                ReasonMatching(func(reason error) bool {
                    return strings.Contains(reason.Error(), "emergency shutdown") &&
                           strings.Contains(reason.Error(), "system overload")
                }).
                Once().
                Assert()
        }
    
        // Verify meta exit signal sent to monitoring
        monitorPID := gen.PID{Node: "monitor", ID: 999}
        actor.ShouldSendExitMeta().
            To(monitorPID).
            ReasonMatching(func(reason error) bool {
                return strings.Contains(reason.Error(), "system overload")
            }).
            Once().
            Assert()
    
        // Verify shutdown completion message
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if complete, ok := msg.(EmergencyShutdownComplete); ok {
                return complete.WorkersTerminated == 3 && 
                       complete.Reason == "system overload"
            }
            return false
        }).Once().Assert()
    
        // Verify total exit signals (3 workers + 1 meta)
        actor.ShouldSendExit().Times(3).Assert()
        actor.ShouldSendExitMeta().Times(1).Assert()
    }
    
    func TestExitSignal_NegativeAssertions(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryProcessSupervisor)
        client := gen.PID{Node: "test", ID: 100}
    
        // Try to stop non-existent worker
        actor.SendMessage(client, StopWorker{WorkerID: "non-existent"})
    
        // Should not send any exit signals
        actor.ShouldNotSendExit().Assert()
        actor.ShouldNotSendExitMeta().Assert()
    
        // Should send error response instead
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if stopError, ok := msg.(WorkerStopError); ok {
                return stopError.WorkerID == "non-existent" && 
                       stopError.Error == "worker not found"
            }
            return false
        }).Once().Assert()
    }
    // Test that exit signal was sent
    actor.ShouldSendExit().To(targetPID).Once().Assert()
    
    // Test with specific reason
    actor.ShouldSendExit().To(targetPID).WithReason(gen.TerminateReasonShutdown).Assert()
    
    // Test with reason matching
    actor.ShouldSendExit().ReasonMatching(func(reason error) bool {
        return strings.Contains(reason.Error(), "expected error")
    }).Assert()
    
    // Test meta exit signals
    actor.ShouldSendExitMeta().To(monitorPID).WithReason(errorReason).Assert()
    
    // Negative assertions
    actor.ShouldNotSendExit().To(targetPID).Assert()
    actor.ShouldNotSendExitMeta().Assert()
    // Test multiple exit signals
    actor.ShouldSendExit().Times(3).Assert() // Should send exactly 3 exit signals
    
    // Test exit signals to specific targets
    actor.ShouldSendExit().To(worker1PID).Once().Assert()
    actor.ShouldSendExit().To(worker2PID).Once().Assert()
    
    // Capture exit signal for detailed analysis
    exitResult := actor.ShouldSendExit().Once().Capture()
    unit.NotNil(t, exitResult)
    unit.Equal(t, expectedPID, exitResult.To)
    unit.Equal(t, expectedReason, exitResult.Reason)
    
    // Combined assertions
    actor.ShouldSendExit().To(workerPID).WithReason(gen.TerminateReasonShutdown).Once().Assert()
    actor.ShouldSendExitMeta().To(monitorPID).ReasonMatching(func(r error) bool {
        return strings.Contains(r.Error(), "shutdown complete")
    }).Once().Assert()
    type taskScheduler struct {
        act.Actor
        taskCounter int
        schedules   map[string]gen.CronJobSchedule
    }
    
    func (t *taskScheduler) Init(args ...any) error {
        t.taskCounter = 0
        t.schedules = make(map[string]gen.CronJobSchedule)
        t.Log().Info("Task scheduler started")
        return nil
    }
    
    func (t *taskScheduler) HandleMessage(from gen.PID, message any) error {
        switch msg := message.(type) {
        case ScheduleTask:
            // Add a new cron job
            jobID, err := t.Cron().AddJob(msg.Schedule, gen.CronJobFunction(func() {
                t.taskCounter++
                t.Send("output", TaskExecuted{
                    TaskID:    msg.TaskID,
                    Count:     t.taskCounter,
                    Timestamp: time.Now(),
                })
                t.Log().Info("Executed scheduled task %s (count: %d)", msg.TaskID, t.taskCounter)
            }))
            
            if err != nil {
                t.Send(from, ScheduleError{TaskID: msg.TaskID, Error: err.Error()})
                return nil
            }
    
            t.schedules[msg.TaskID] = gen.CronJobSchedule{ID: jobID, Schedule: msg.Schedule}
            t.Send(from, TaskScheduled{TaskID: msg.TaskID, JobID: jobID})
            t.Log().Debug("Scheduled task %s with job ID %s", msg.TaskID, jobID)
            return nil
    
        case UnscheduleTask:
            if schedule, exists := t.schedules[msg.TaskID]; exists {
                err := t.Cron().RemoveJob(schedule.ID)
                if err != nil {
                    t.Send(from, UnscheduleError{TaskID: msg.TaskID, Error: err.Error()})
                    return nil
                }
                
                delete(t.schedules, msg.TaskID)
                t.Send(from, TaskUnscheduled{TaskID: msg.TaskID})
                t.Log().Debug("Unscheduled task %s", msg.TaskID)
            } else {
                t.Send(from, UnscheduleError{TaskID: msg.TaskID, Error: "task not found"})
            }
            return nil
    
        case EnableTask:
            if schedule, exists := t.schedules[msg.TaskID]; exists {
                err := t.Cron().EnableJob(schedule.ID)
                if err != nil {
                    t.Send(from, TaskError{TaskID: msg.TaskID, Error: err.Error()})
                    return nil
                }
                t.Send(from, TaskEnabled{TaskID: msg.TaskID})
            }
            return nil
    
        case DisableTask:
            if schedule, exists := t.schedules[msg.TaskID]; exists {
                err := t.Cron().DisableJob(schedule.ID)
                if err != nil {
                    t.Send(from, TaskError{TaskID: msg.TaskID, Error: err.Error()})
                    return nil
                }
                t.Send(from, TaskDisabled{TaskID: msg.TaskID})
            }
            return nil
    
        case GetTaskInfo:
            if schedule, exists := t.schedules[msg.TaskID]; exists {
                info, err := t.Cron().JobInfo(schedule.ID)
                if err != nil {
                    t.Send(from, TaskError{TaskID: msg.TaskID, Error: err.Error()})
                    return nil
                }
                t.Send(from, TaskInfo{
                    TaskID:   msg.TaskID,
                    JobID:    schedule.ID,
                    Schedule: schedule.Schedule,
                    Enabled:  info.Enabled,
                    NextRun:  info.NextRun,
                })
            } else {
                t.Send(from, TaskError{TaskID: msg.TaskID, Error: "task not found"})
            }
            return nil
        }
        return nil
    }
    
    type ScheduleTask struct{ TaskID, Schedule string }
    type UnscheduleTask struct{ TaskID string }
    type EnableTask struct{ TaskID string }
    type DisableTask struct{ TaskID string }
    type GetTaskInfo struct{ TaskID string }
    
    type TaskScheduled struct{ TaskID, JobID string }
    type TaskUnscheduled struct{ TaskID string }
    type TaskEnabled struct{ TaskID string }
    type TaskDisabled struct{ TaskID string }
    type TaskExecuted struct{ TaskID string; Count int; Timestamp time.Time }
    type TaskInfo struct{ TaskID, JobID, Schedule string; Enabled bool; NextRun time.Time }
    type ScheduleError struct{ TaskID, Error string }
    type UnscheduleError struct{ TaskID, Error string }
    type TaskError struct{ TaskID, Error string }
    
    func factoryTaskScheduler() gen.ProcessBehavior {
        return &taskScheduler{}
    }
    func TestTaskScheduler_CronJobs(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryTaskScheduler)
        client := gen.PID{Node: "test", ID: 100}
    
        // Test basic job scheduling
        actor.SendMessage(client, ScheduleTask{
            TaskID:   "daily-backup",
            Schedule: "0 2 * * *", // Daily at 2 AM
        })
    
        // Verify cron job was added
        actor.ShouldAddCronJob().
            WithSchedule("0 2 * * *").
            Once().
            Assert()
    
        // Verify scheduling response
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if scheduled, ok := msg.(TaskScheduled); ok {
                return scheduled.TaskID == "daily-backup" && scheduled.JobID != ""
            }
            return false
        }).Once().Assert()
    
        // Test job execution by triggering it
        actor.TriggerCronJob("0 2 * * *") // Manually trigger the scheduled job
    
        // Verify job execution
        actor.ShouldExecuteCronJob().
            WithSchedule("0 2 * * *").
            Once().
            Assert()
    
        // Verify task execution message
        actor.ShouldSend().To("output").MessageMatching(func(msg any) bool {
            if executed, ok := msg.(TaskExecuted); ok {
                return executed.TaskID == "daily-backup" && executed.Count == 1
            }
            return false
        }).Once().Assert()
    }
    
    func TestTaskScheduler_MockTimeControl(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryTaskScheduler)
        client := gen.PID{Node: "test", ID: 100}
    
        // Set initial mock time
        baseTime := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
        actor.SetCronMockTime(baseTime)
    
        // Schedule a job for every minute
        actor.SendMessage(client, ScheduleTask{
            TaskID:   "minute-task",
            Schedule: "* * * * *", // Every minute
        })
    
        cronJob := actor.ShouldAddCronJob().Once().Capture()
        actor.ClearEvents()
    
        // Advance time by 1 minute - should trigger the job
        actor.SetCronMockTime(baseTime.Add(1 * time.Minute))
    
        // Verify job executed
        actor.ShouldExecuteCronJob().
            WithJobID(cronJob.ID).
            Once().
            Assert()
    
        // Advance time by another minute
        actor.SetCronMockTime(baseTime.Add(2 * time.Minute))
    
        // Should execute again
        actor.ShouldExecuteCronJob().
            WithJobID(cronJob.ID).
            Times(2). // Total of 2 executions
            Assert()
    }
    // Test that cron job was added
    actor.ShouldAddCronJob().WithSchedule("0 2 * * *").Once().Assert()
    
    // Test job execution
    actor.ShouldExecuteCronJob().WithSchedule("0 * * * *").Times(3).Assert()
    
    // Test job removal
    actor.ShouldRemoveCronJob().WithJobID("job-123").Once().Assert()
    
    // Test job enable/disable
    actor.ShouldEnableCronJob().WithJobID("job-123").Once().Assert()
    actor.ShouldDisableCronJob().WithJobID("job-123").Once().Assert()
    
    // Negative assertions
    actor.ShouldNotAddCronJob().Assert()
    actor.ShouldNotExecuteCronJob().Assert()
    // Set mock time for deterministic testing
    baseTime := time.Date(2024, 1, 1, 12, 0, 0, 0, time.UTC)
    actor.SetCronMockTime(baseTime)
    
    // Advance time to trigger scheduled jobs
    actor.SetCronMockTime(baseTime.Add(1 * time.Hour))
    
    // Manually trigger cron jobs for testing
    actor.TriggerCronJob("0 * * * *") // Trigger hourly job
    actor.TriggerCronJob("job-id-123") // Trigger by job ID
    // Capture cron job for detailed analysis
    cronJob := actor.ShouldAddCronJob().Once().Capture()
    jobID := cronJob.ID
    schedule := cronJob.Schedule
    
    // Test multiple job executions with time control
    for i := 0; i < 5; i++ {
        actor.SetCronMockTime(baseTime.Add(time.Duration(i) * time.Minute))
        actor.TriggerCronJob("* * * * *") // Every minute
    }
    actor.ShouldExecuteCronJob().Times(5).Assert()
    func TestBuiltInAssertions(t *testing.T) {
        // Equality assertions
        unit.Equal(t, "expected", "expected")
        unit.NotEqual(t, "different", "value")
    
        // Boolean assertions  
        unit.True(t, true)
        unit.False(t, false)
    
        // Nil assertions
        unit.Nil(t, nil)
        unit.NotNil(t, "not nil")
    
        // String assertions
        unit.Contains(t, "hello world", "world")
        
        // Type assertions
        unit.IsType(t, "", "string value")
    }
    func TestDynamicValues(t *testing.T) {
        actor, _ := unit.Spawn(t, factorySessionManager)
    
        // Send request that will generate dynamic session ID
        actor.SendMessage(gen.PID{}, CreateSession{UserID: "user123"})
    
        // Capture the spawn to get the dynamic session PID
        spawnResult := actor.ShouldSpawn().Once().Capture()
        sessionPID := spawnResult.PID
    
        // Use captured PID in subsequent assertions
        actor.ShouldSend().MessageMatching(func(msg any) bool {
            if created, ok := msg.(SessionCreated); ok {
                return created.SessionPID == sessionPID && created.UserID == "user123"
            }
            return false
        }).Once().Assert()
    }
    func TestEventInspection(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryComplexActor)
    
        // Perform operations
        actor.SendMessage(gen.PID{}, ComplexOperation{})
    
        // Get all events for inspection
        events := actor.Events()
        
        var sendCount, spawnCount, logCount, remoteSpawnCount int
        for _, event := range events {
            switch event.(type) {
            case unit.SendEvent:
                sendCount++
            case unit.SpawnEvent:
                spawnCount++
            case unit.LogEvent:
                logCount++
            case unit.RemoteSpawnEvent:
                remoteSpawnCount++
            }
        }
    
        unit.True(t, sendCount > 0, "Should have send events")
        unit.True(t, spawnCount == 2, "Should spawn exactly 2 processes")
        unit.True(t, logCount >= 1, "Should have log events")
    }
    func TestLastEvent(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryExampleActor)
    
        actor.SendMessage(gen.PID{}, "test")
        
        // Get the most recent event
        lastEvent := actor.LastEvent()
        unit.NotNil(t, lastEvent, "Should have a last event")
        unit.Equal(t, "send", lastEvent.Type())
        
        if sendEvent, ok := lastEvent.(unit.SendEvent); ok {
            unit.Equal(t, "test", sendEvent.Message)
        }
    }
    func TestClearEvents(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryExampleActor)
    
        // Perform some operations
        actor.SendMessage(gen.PID{}, "setup")
        actor.ShouldSend().Once().Assert()
    
        // Clear events before main test
        actor.ClearEvents()
    
        // Now test the main functionality
        actor.SendMessage(gen.PID{}, "main_operation")
        
        // Only the main operation events are captured
        events := actor.Events()
        unit.Equal(t, 1, len(events), "Should only have main operation event")
    }
    import (
        "testing"
        "time"
        "ergo.services/ergo/testing/unit"
    )
    
    func TestWithTimeout(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryExampleActor)
    
        // Test that assertion completes within timeout
        success := unit.WithTimeout(func() {
            actor.SendMessage(gen.PID{}, "test")
            actor.ShouldSend().Once().Assert()
        }, 5*time.Second)
    
        unit.True(t, success(), "Assertion should complete within timeout")
    }
    // Good: Tests one specific behavior
    func TestUserManager_CreateUser_Success(t *testing.T) { ... }
    func TestUserManager_CreateUser_DuplicateEmail(t *testing.T) { ... }
    func TestUserManager_CreateUser_InvalidData(t *testing.T) { ... }
    
    // Poor: Tests multiple behaviors in one test
    func TestUserManager_AllOperations(t *testing.T) { ... }
    func TestWorkerSupervisor_MaxWorkersReached(t *testing.T) {
        // Test that supervisor properly rejects requests when at capacity
        // Test that appropriate error messages are sent
        // Test that the supervisor remains functional after rejecting requests
    }
    // Good: Easy to test with pattern matching
    type UserCreated struct {
        UserID   string
        Email    string
        Created  time.Time
    }
    
    // Poor: Hard to validate in tests
    type GenericMessage struct {
        Type string
        Data map[string]interface{}
    }
    type OrderProcessed struct {
        OrderID   string    // Predictable - can be set in test
        Total     float64   // Predictable - can be set in test
        ProcessedAt time.Time // Dynamic - use pattern matching
        RequestID string     // Dynamic - capture and validate
    }
    import (
        "fmt"
        "testing"
        "ergo.services/ergo/testing/unit"
    )
    
    func TestWorkerPool_ConcurrentRequests(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryWorkerPool)
        
        // Send multiple requests concurrently
        for i := 0; i < 100; i++ {
            actor.SendMessage(gen.PID{}, ProcessRequest{ID: fmt.Sprintf("req-%d", i)})
        }
        
        // Verify all requests were processed
        actor.ShouldSend().To("output").Times(100).Assert()
    }
    
    // Note: This example assumes you have defined:
    // - type ProcessRequest struct{ ID string }
    // - factoryWorkerPool() gen.ProcessBehavior function
    // Import the testing library
    import "ergo.services/ergo/testing/unit"
    
    // Run all tests
    go test -v ergo.services/ergo/testing/unit
    
    // Run feature-specific tests
    go test -v -run TestBasic ergo.services/ergo/testing/unit
    go test -v -run TestNetwork ergo.services/ergo/testing/unit
    go test -v -run TestWorkflow ergo.services/ergo/testing/unit
    go test -v -run TestCall ergo.services/ergo/testing/unit
    go test -v -run TestCron ergo.services/ergo/testing/unit
    go test -v -run TestTermination ergo.services/ergo/testing/unit
    func TestDatabaseActor_ConfigurationBehavior(t *testing.T) {
        // Test with different configurations
        
        // Development configuration
        devActor, _ := unit.Spawn(t, newDatabaseActor, 
            unit.WithEnv(map[gen.Env]any{
                "DB_POOL_SIZE": 5,
                "LOG_QUERIES":  true,
            }))
        
        devActor.SendMessage(gen.PID{}, ExecuteQuery{SQL: "SELECT * FROM users"})
        devActor.ShouldLog().Level(gen.LogLevelDebug).Containing("SELECT * FROM users").Assert()
        
        // Production configuration  
        prodActor, _ := unit.Spawn(t, newDatabaseActor,
            unit.WithEnv(map[gen.Env]any{
                "DB_POOL_SIZE": 50,
                "LOG_QUERIES":  false,
            }))
        
        prodActor.SendMessage(gen.PID{}, ExecuteQuery{SQL: "SELECT * FROM users"})
        prodActor.ShouldLog().Level(gen.LogLevelDebug).Times(0).Assert() // No query logging in prod
    }
    func TestOrderProcessor_WorkflowSteps(t *testing.T) {
        actor, _ := unit.Spawn(t, newOrderProcessor)
        client := gen.PID{Node: "client", ID: 1}
        
        // Start an order
        actor.SendMessage(client, CreateOrder{Items: []string{"book", "pen"}})
        
        // Should trigger a sequence of operations
        actor.ShouldSend().To("inventory").Message("check_availability").Once().Assert()
        actor.ShouldSend().To("payment").Message("calculate_total").Once().Assert()
        actor.ShouldSend().To("shipping").Message("estimate_delivery").Once().Assert()
        
        // Should send status back to client
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if status, ok := msg.(OrderStatus); ok {
                return status.Status == "processing"
            }
            return false
        }).Once().Assert()
    }
    func TestSecurityGate_AccessControl(t *testing.T) {
        actor, _ := unit.Spawn(t, newSecurityGate)
        
        // Test admin access
        admin := gen.PID{Node: "admin", ID: 1}
        actor.SendMessage(admin, AccessRequest{Resource: "admin_panel", User: "admin"})
        actor.ShouldSend().To(admin).Message(AccessGranted{}).Once().Assert()
        
        // Test regular user access to admin panel
        user := gen.PID{Node: "user", ID: 2}
        actor.SendMessage(user, AccessRequest{Resource: "admin_panel", User: "regular_user"})
        actor.ShouldSend().To(user).Message(AccessDenied{Reason: "insufficient privileges"}).Once().Assert()
        
        // Test regular user access to public resources
        actor.SendMessage(user, AccessRequest{Resource: "public_content", User: "regular_user"})
        actor.ShouldSend().To(user).Message(AccessGranted{}).Once().Assert()
    }
    func TestTaskManager_WorkerCreation(t *testing.T) {
        actor, _ := unit.Spawn(t, newTaskManager)
        client := gen.PID{Node: "client", ID: 1}
        
        // Request a new worker
        actor.SendMessage(client, CreateWorker{TaskType: "data_processing"})
        
        // Should spawn a worker process
        actor.ShouldSpawn().Once().Assert()
        
        // Should confirm to client
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if response, ok := msg.(WorkerCreated); ok {
                return response.TaskType == "data_processing"
            }
            return false
        }).Once().Assert()
    }
    func TestSessionManager_UserSessions(t *testing.T) {
        actor, _ := unit.Spawn(t, newSessionManager)
        client := gen.PID{Node: "client", ID: 1}
        
        // Create a session for a user
        actor.SendMessage(client, CreateSession{UserID: "alice"})
        
        // Capture the spawned session process
        sessionSpawn := actor.ShouldSpawn().Once().Capture()
        sessionPID := sessionSpawn.PID
        
        // Verify session was registered
        actor.ShouldSend().To(client).MessageMatching(func(msg any) bool {
            if response, ok := msg.(SessionCreated); ok {
                return response.UserID == "alice" && response.SessionPID == sessionPID
            }
            return false
        }).Once().Assert()
        
        // Send work to the session
        actor.SendMessage(client, SendToSession{UserID: "alice", Data: "important_data"})
        
        // Should route to the captured session PID
        actor.ShouldSend().To(sessionPID).Message("important_data").Once().Assert()
    }
    func TestComplexActor_DebugFailures(t *testing.T) {
        actor, _ := unit.Spawn(t, newComplexActor)
        
        // Perform some operations
        actor.SendMessage(gen.PID{}, TriggerComplexWorkflow{})
        
        // If something goes wrong, inspect all events
        events := actor.Events()
        t.Logf("Total events captured: %d", len(events))
        
        for i, event := range events {
            t.Logf("Event %d: %s - %s", i, event.Type(), event.String())
        }
        
        // Clear events and test specific behavior
        actor.ClearEvents()
        actor.SendMessage(gen.PID{}, SimpleBehavior{})
        
        // Now only simple behavior events are captured
        simpleEvents := actor.Events()
        unit.Equal(t, 1, len(simpleEvents), "Should only have one event after clearing")
    }
    func TestActorWithFailureInjection(t *testing.T) {
        actor, err := unit.Spawn(t, factoryMyActor)
        if err != nil {
            t.Fatal(err)
        }
        
        // Inject failure for spawn operations
        actor.Process().SetMethodFailure("Spawn", errors.New("resource limit exceeded"))
        
        // Test how the actor handles spawn failures
        actor.SendMessage(gen.PID{}, CreateWorker{WorkerType: "data_processor"})
        
        // Verify the actor handles the failure gracefully
        actor.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(WorkerCreationError); ok {
                return strings.Contains(err.Error, "resource limit exceeded")
            }
            return false
        }).Once().Assert()
    }
    // Fail every call to the method
    actor.Process().SetMethodFailure("Send", errors.New("network error"))
    
    // Fail only once
    actor.Process().SetMethodFailureOnce("Spawn", errors.New("temporary failure"))
    
    // Fail after N successful calls
    actor.Process().SetMethodFailureAfter("Send", 3, errors.New("rate limit"))
    
    // Fail when arguments match a pattern
    actor.Process().SetMethodFailurePattern("RegisterName", "worker", errors.New("pattern match"))
    
    // Clear specific failure
    actor.Process().ClearMethodFailure("Send")
    
    // Clear all failures
    actor.Process().ClearMethodFailures()
    
    // Get call count for a method
    count := actor.Process().GetMethodCallCount("Spawn")
    func TestSupervisor_SpawnFailures(t *testing.T) {
        supervisor, _ := unit.Spawn(t, factorySupervisor)
        
        // Inject spawn failure
        supervisor.Process().SetMethodFailure("Spawn", errors.New("resource exhausted"))
        
        supervisor.SendMessage(gen.PID{}, StartChild{ID: "worker-1"})
        
        // Verify supervisor handles spawn failure
        supervisor.ShouldSend().MessageMatching(func(msg any) bool {
            if resp, ok := msg.(StartChildResponse); ok {
                return !resp.Success && strings.Contains(resp.Error, "resource exhausted")
            }
            return false
        }).Once().Assert()
    }
    func TestRouter_SendFailures(t *testing.T) {
        router, _ := unit.Spawn(t, factoryMessageRouter)
        
        // Inject send failure
        router.Process().SetMethodFailure("Send", errors.New("destination unreachable"))
        
        router.SendMessage(gen.PID{}, RouteMessage{
            Destination: "remote_service",
            Message:     "important_data",
        })
        
        // Verify router handles send failure
        router.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(RoutingError); ok {
                return strings.Contains(err.Error, "destination unreachable")
            }
            return false
        }).Once().Assert()
    }
    func TestProcessor_IntermittentFailures(t *testing.T) {
        processor, _ := unit.Spawn(t, factoryDataProcessor)
        
        // Fail after 2 successful operations
        processor.Process().SetMethodFailureAfter("Send", 2, errors.New("network timeout"))
        
        // First two sends succeed
        processor.SendMessage(gen.PID{}, ProcessData{ID: "1"})
        processor.SendMessage(gen.PID{}, ProcessData{ID: "2"})
        processor.ShouldSend().Times(2).Assert()
        
        // Third send fails
        processor.SendMessage(gen.PID{}, ProcessData{ID: "3"})
        processor.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(ProcessingError); ok {
                return strings.Contains(err.Error, "network timeout")
            }
            return false
        }).Once().Assert()
    }
    func TestRegistry_PatternFailures(t *testing.T) {
        registry, _ := unit.Spawn(t, factoryRegistry)
        
        // Fail registration for names containing "temp"
        registry.Process().SetMethodFailurePattern("RegisterName", "temp", errors.New("temporary names not allowed"))
        
        // Normal registration succeeds
        registry.SendMessage(gen.PID{}, Register{Name: "service"})
        registry.ShouldSend().Message(RegisterSuccess{Name: "service"}).Once().Assert()
        
        // Temporary registration fails
        registry.SendMessage(gen.PID{}, Register{Name: "temp_worker"})
        registry.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(RegisterError); ok {
                return strings.Contains(err.Error, "temporary names not allowed")
            }
            return false
        }).Once().Assert()
    }
    func TestResilience_RecoveryFromFailure(t *testing.T) {
        actor, _ := unit.Spawn(t, factoryResilientActor)
        
        // Inject one-time failure
        actor.Process().SetMethodFailureOnce("Send", errors.New("temporary network error"))
        
        // First attempt fails
        actor.SendMessage(gen.PID{}, SendData{Data: "attempt1"})
        actor.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(SendError); ok {
                return strings.Contains(err.Error, "temporary network error")
            }
            return false
        }).Once().Assert()
        
        // Second attempt succeeds (failure was one-time only)
        actor.SendMessage(gen.PID{}, SendData{Data: "attempt2"})
        actor.ShouldSend().Message(SendSuccess{Data: "attempt2"}).Once().Assert()
    }
    func TestSupervisor_RestartBehavior(t *testing.T) {
        supervisor, _ := unit.Spawn(t, factoryOneForOneSupervisor)
        
        // Start children
        supervisor.SendMessage(gen.PID{}, StartChildren{Count: 3})
        supervisor.ShouldSpawn().Times(3).Assert()
        
        // Clear events before failure injection
        supervisor.ClearEvents()
        
        // Make child restarts fail after first success
        supervisor.Process().SetMethodFailureAfter("Spawn", 1, errors.New("restart failed"))
        
        // Simulate child failure requiring restart
        supervisor.SendMessage(gen.PID{}, ChildFailed{ID: "child-2"})
        
        // Verify supervisor attempts restart and handles failure
        supervisor.ShouldSpawn().Once().Assert() // First restart attempt
        supervisor.ShouldSend().MessageMatching(func(msg any) bool {
            if status, ok := msg.(SupervisorStatus); ok {
                return status.RestartsFailed == 1
            }
            return false
        }).Once().Assert()
    }
    func TestRateLimiter_CallCounting(t *testing.T) {
        limiter, _ := unit.Spawn(t, factoryRateLimiter)
        
        // Send multiple requests
        for i := 0; i < 5; i++ {
            limiter.SendMessage(gen.PID{}, Request{ID: i})
        }
        
        // Check how many times Send was called
        sendCount := limiter.Process().GetMethodCallCount("Send")
        unit.Equal(t, 5, sendCount, "Should have called Send 5 times")
        
        // Inject failure after checking count
        limiter.Process().SetMethodFailure("Send", errors.New("rate limit exceeded"))
        
        // Next request should fail
        limiter.SendMessage(gen.PID{}, Request{ID: 6})
        limiter.ShouldSend().MessageMatching(func(msg any) bool {
            if err, ok := msg.(RateLimitError); ok {
                return err.CallCount == 6 // Should include the failed attempt
            }
            return false
        }).Once().Assert()
    }