RockPaperScissorsTrainer

This example demonstrates Melodie’s Trainer module using a Rock-Paper-Scissors game model. In this model, agents start with heterogeneous payoff preferences and strategy weights. The trainer then uses a genetic algorithm (GA) to evolve these strategy parameters, aiming for higher accumulated payoffs for each agent.

Trainer: Project Structure

examples/rock_paper_scissors_trainer
├── core/
│   ├── agent.py            # Defines agent's strategy, actions, and payoff logic
│   ├── data_collector.py   # Collects micro and macro simulation results
│   ├── environment.py      # Manages pairwise agent battles
│   ├── model.py            # Contains the main simulation loop
│   ├── scenario.py         # Defines scenarios and generates agent parameters
│   └── trainer.py          # Configures and runs the GA-based training
├── data/
│   ├── input/
│   │   ├── SimulatorScenarios.csv
│   │   ├── TrainerScenarios.csv
│   │   └── TrainerParamsScenarios.csv
│   └── output/
└── main.py

Trainer: GA Concepts

While the Calibrator tunes parameters to match an external target, the Trainer is designed for an internal optimization process. Its goal is to allow agents to “learn” and evolve their individual behaviors to maximize their own objectives within the model’s world.

Core Idea: Maximizing Utility

Agent-Level Objective: Each agent has a goal, which is quantified by a utility function. In this example, the utility for an agent is its accumulated_payoff at the end of a simulation run. The Trainer’s objective is to find the behavioral parameters that maximize this utility for each agent.
Parameters: You specify which agent-level parameter(s) the Trainer should evolve. Here, it’s the three strategy weights: strategy_param_1, strategy_param_2, and strategy_param_3.
Utility Function: You must implement a utility(agent) method. This function is called after a simulation run and must return a single float value representing that agent’s performance or “fitness.”

Genetic Algorithm (GA) for Agent Learning

The Trainer applies a separate genetic algorithm to each agent individually.

Chromosome: An agent’s set of trainable parameters (e.g., the three strategy weights) is treated as its “chromosome.”
Population: For each agent, the GA maintains a “population” of these chromosomes (i.e., different sets of strategy weights).
Fitness: To evaluate a chromosome, the model is run with the agent using that set of strategy weights. The agent’s final utility (accumulated payoff) serves as the fitness score.
Evolution: Through generations, the GA for each agent independently selects, breeds, and mutates its population of strategy weights, converging on the strategy that yields the highest personal payoff for that agent, given the behavior of all other agents.
Parameter Encoding: The Trainer uses the same binary encoding mechanism as the Calibrator to represent continuous strategy parameters for the GA. For a detailed explanation of this process, please see the Parameter Encoding: From Float to Binary section in the CovidContagionCalibrator documentation.

Parameter Configuration (`TrainerParamsScenarios.csv`)

This file controls the behavior of the genetic algorithm for all agents:

id: Unique identifier for a set of GA parameters.
path_num: How many independent training processes (paths) to run. Each path is a complete run of the GA, helping to test the robustness of the evolutionary outcome.
generation_num: The number of generations the GA will run for.
strategy_population: The size of the chromosome population maintained for each agent.
mutation_prob: The probability of random mutations.
strategy_param_code_length: The precision of the strategy parameters.
strategy_param_1_min/max, etc.: These define the search space bounds for each parameter being trained.

Trainer: How It Works

Simulation Loop: In each period, agents normalize their strategy weights into action probabilities (rock, paper, or scissors). They are then randomly paired to play one round, and their payoffs are recorded. At the end of the simulation run, each agent’s long-term share of actions is calculated.
Training Loop: The RPSTrainer evolves the strategy parameters for every agent. The fitness is each agent’s accumulated_payoff. Training settings come from TrainerParamsScenarios.csv.
Input Data: The scenarios for the simulator and trainer are defined in their respective CSV files. Agent-level parameters are generated dynamically within the RPSScenario class. agent_num must be an even number, as agents are paired up in each period.

Trainer: Results

The following plots, taken from the original model, show the evolution of the mean and coefficient of variance for the total accumulated payoff across all agents. This illustrates the effectiveness of the training process.

Total payoff coefficient of variance across generations

Trainer: Running the Model

You can run both the simulator and the trainer using the main script:

python examples/rock_paper_scissors_trainer/main.py

The trainer clears the output directory before it runs. Therefore, after executing the command, you will only find the trainer’s output tables in examples/rock_paper_scissors_trainer/data/output. If you need the simulator’s results, you should run it separately (for example, by commenting out the run_trainer call in main.py).

Parallel Execution Mode

The Trainer supports two parallelization modes, controlled by the parallel_mode parameter when creating the trainer instance:

``parallel_mode=”process”`` (default): Uses subprocess-based parallelism via multiprocessing. This is the traditional approach and works on all Python versions. It is recommended for most use cases.
``parallel_mode=”thread”``: Uses thread-based parallelism via ThreadPoolExecutor. This mode is recommended for Python 3.13+ (free-threaded/No-GIL builds) as it can provide better performance by avoiding the overhead of process creation and data serialization. In older Python versions, this mode will still run but may be limited by the Global Interpreter Lock (GIL).

You can specify the mode when creating the trainer:

trainer = RPSTrainer(
    config=cfg,
    scenario_cls=RPSScenario,
    model_cls=RPSModel,
    processors=4,
    parallel_mode="thread",  # or "process" (default)
)

Trainer: Code

This section shows the key code implementation for the trainer model.

Model Structure

Defined in core/model.py.

from typing import TYPE_CHECKING

from Melodie import Model

from examples.rock_paper_scissors_trainer.core.agent import RPSAgent
from examples.rock_paper_scissors_trainer.core.data_collector import RPSDataCollector
from examples.rock_paper_scissors_trainer.core.environment import RPSEnvironment
from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario

if TYPE_CHECKING:
    from Melodie import AgentList


class RPSModel(Model):
    """
    The main model class for the Rock-Paper-Scissors simulation.

    It sets up the agents, environment, and data collector, and defines the
    main simulation loop that runs for a specified number of periods.
    """

    scenario: "RPSScenario"
    data_collector: RPSDataCollector

    def create(self) -> None:
        self.agents: "AgentList[RPSAgent]" = self.create_agent_list(RPSAgent)
        self.environment: RPSEnvironment = self.create_environment(RPSEnvironment)
        self.data_collector = self.create_data_collector(RPSDataCollector)

    def setup(self) -> None:
        # Populates the agent list using the dynamically generated parameter table
        # from the scenario.
        self.agents.setup_agents(
            agents_num=self.scenario.agent_num,
            params_df=self.scenario.agent_params,
        )

    def run(self) -> None:
        """Executes the main simulation loop for `scenario.period_num` periods."""
        for period in range(self.scenario.period_num):
            self.environment.agents_setup_data(self.agents)
            self.environment.run_game_rounds(self.agents)
            self.environment.agents_calc_action_share(period, self.agents)
            self.data_collector.collect(period)
        self.data_collector.save()

Environment Logic

Defined in core/environment.py.

import random
from typing import TYPE_CHECKING

from Melodie import AgentList, Environment

from examples.rock_paper_scissors_trainer.core.agent import RPSAgent

if TYPE_CHECKING:
    from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario


class RPSEnvironment(Environment):
    """
    The environment for the Rock-Paper-Scissors model.

    It is responsible for orchestrating the agent interactions by randomly
    pairing them up for battles in each period. It also tracks the total
    accumulated payoff across all agents as a macro-level indicator.
    """

    scenario: "RPSScenario"

    def setup(self) -> None:
        """Initializes environment-level properties."""
        self.total_accumulated_payoff = 0.0

    @staticmethod
    def agents_setup_data(agents: "AgentList[RPSAgent]") -> None:
        """
        Prepares derived variables for all agents at the beginning of each period.
        """
        for agent in agents:
            agent.setup_action_prob()
            agent.setup_action_payoff()

    def run_game_rounds(self, agents: "AgentList[RPSAgent]") -> None:
        """
        In each period, randomly pairs up all agents to play one round of the game.
        """
        assert self.scenario.agent_num % 2 == 0, "scenario.agent_num must be even."
        agent_ids = list(range(self.scenario.agent_num))
        random.shuffle(agent_ids)
        for idx in range(0, len(agent_ids), 2):
            opponent_idx = idx + 1
            if opponent_idx >= len(agent_ids):
                break
            self.agents_battle(agents[agent_ids[idx]], agents[agent_ids[opponent_idx]])

    def agents_battle(self, agent_1: "RPSAgent", agent_2: "RPSAgent") -> None:
        """
        Executes a single battle between two agents, determines the outcome,
        and updates their payoffs.
        """
        agent_1.id_competitor = agent_2.id
        agent_2.id_competitor = agent_1.id

        agent_1.select_action()
        agent_2.select_action()

        if agent_1.action == agent_2.action:
            agent_1.result = agent_2.result = "tie"
        elif (
            (agent_1.action == "rock" and agent_2.action == "paper")
            or (agent_1.action == "paper" and agent_2.action == "scissors")
            or (agent_1.action == "scissors" and agent_2.action == "rock")
        ):
            agent_1.result = "lose"
            agent_2.result = "win"
        else:
            agent_1.result = "win"
            agent_2.result = "lose"

        agent_1.set_action_payoff()
        agent_2.set_action_payoff()
        self.total_accumulated_payoff += agent_1.payoff + agent_2.payoff

    def agents_calc_action_share(self, period: int, agents: "AgentList[RPSAgent]") -> None:
        """
        Triggers the calculation of action shares for all agents at the very
        end of the simulation run.
        """
        if period == self.scenario.period_num - 1:
            for agent in agents:
                agent.calc_action_percentage()

Agent Behavior

Defined in core/agent.py.

import random
from typing import Dict, Tuple, TYPE_CHECKING

from Melodie import Agent

if TYPE_CHECKING:
    from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario


class RPSAgent(Agent):
    """
    Represents an agent playing the Rock-Paper-Scissors game.

    Each agent has its own unique payoff settings for winning, losing, or tying,
    as well as three strategy parameters that determine the probabilities of
    choosing rock, paper, or scissors. These strategy parameters are the target
    for the evolutionary training by the Trainer module. The agent also tracks
    its own play history and accumulated payoffs.
    """

    scenario: "RPSScenario"

    def setup(self) -> None:
        # Payoff settings injected via agent params dataframe.
        self.payoff_rock_win: float = 0.0
        self.payoff_rock_lose: float = 0.0
        self.payoff_paper_win: float = 0.0
        self.payoff_paper_lose: float = 0.0
        self.payoff_scissors_win: float = 0.0
        self.payoff_scissors_lose: float = 0.0
        self.payoff_tie: float = 0.0

        # Strategy weights to be trained by the Trainer.
        self.strategy_param_1: float = 0.0
        self.strategy_param_2: float = 0.0
        self.strategy_param_3: float = 0.0

        self._reset_counters()

    def _reset_counters(self) -> None:
        """Initializes or resets the agent's state for a new simulation run."""
        self.id_competitor: int = 0
        self.action: str = ""
        self.result: str = ""
        self.payoff: float = 0.0
        self.accumulated_payoff: float = 0.0
        self.n_rock: int = 0
        self.n_paper: int = 0
        self.n_scissors: int = 0
        self.share_rock: float = 0.0
        self.share_paper: float = 0.0
        self.share_scissors: float = 0.0

    def setup_action_prob(self) -> None:
        """
        Normalizes the three strategy parameters into a probability distribution
        for choosing rock, paper, or scissors.
        """
        if self.strategy_param_1 == self.strategy_param_2 == self.strategy_param_3 == 0:
            # Avoid division by zero if a chromosome is all zeros during training.
            self.strategy_param_1 = self.strategy_param_2 = self.strategy_param_3 = 1.0
        total = self.strategy_param_1 + self.strategy_param_2 + self.strategy_param_3
        self.action_prob = {
            "rock": self.strategy_param_1 / total,
            "paper": self.strategy_param_2 / total,
            "scissors": self.strategy_param_3 / total,
        }

    def setup_action_payoff(self) -> None:
        """
        Creates a lookup dictionary that maps (action, outcome) pairs to their
        corresponding payoffs. This helps to keep the battle logic clean.
        """
        self.action_payoff: Dict[Tuple[str, str], float] = {
            ("rock", "win"): self.payoff_rock_win,
            ("rock", "lose"): self.payoff_rock_lose,
            ("paper", "win"): self.payoff_paper_win,
            ("paper", "lose"): self.payoff_paper_lose,
            ("scissors", "win"): self.payoff_scissors_win,
            ("scissors", "lose"): self.payoff_scissors_lose,
            ("rock", "tie"): self.payoff_tie,
            ("paper", "tie"): self.payoff_tie,
            ("scissors", "tie"): self.payoff_tie,
        }

    def select_action(self) -> None:
        """
        Selects an action (rock, paper, or scissors) based on the current
        strategy probabilities and updates the action counter.
        """
        rand = random.random()
        if rand <= self.action_prob["rock"]:
            self.action = "rock"
            self.n_rock += 1
        elif rand <= self.action_prob["rock"] + self.action_prob["paper"]:
            self.action = "paper"
            self.n_paper += 1
        else:
            self.action = "scissors"
            self.n_scissors += 1

    def set_action_payoff(self) -> None:
        """
        Records the payoff for the last action taken and adds it to the
        accumulated payoff for the current simulation run.
        """
        self.payoff = self.action_payoff[(self.action, self.result)]
        self.accumulated_payoff += self.payoff

    def calc_action_percentage(self) -> None:
        """
        Calculates the agent's long-term share of each action at the end of a
        simulation run. This is useful for analyzing the evolved strategies.
        """
        if self.scenario.period_num > 0:
            self.share_rock = self.n_rock / self.scenario.period_num
            self.share_paper = self.n_paper / self.scenario.period_num
            self.share_scissors = self.n_scissors / self.scenario.period_num

Data Collection Setup

Defined in core/data_collector.py.

from Melodie import DataCollector


class RPSDataCollector(DataCollector):
    """
    A custom data collector for the Rock-Paper-Scissors model.

    It is configured to save detailed agent-level data about each game round
    (e.g., opponent, action, result, payoff) as well as the final evolved
    strategy shares. It also records the environment-level total payoff.
    """

    def setup(self) -> None:
        self.add_agent_property("agents", "id_competitor")
        self.add_agent_property("agents", "action")
        self.add_agent_property("agents", "n_rock")
        self.add_agent_property("agents", "n_paper")
        self.add_agent_property("agents", "n_scissors")
        self.add_agent_property("agents", "result")
        self.add_agent_property("agents", "payoff")
        self.add_agent_property("agents", "accumulated_payoff")
        self.add_agent_property("agents", "share_rock")
        self.add_agent_property("agents", "share_paper")
        self.add_agent_property("agents", "share_scissors")
        self.add_environment_property("total_accumulated_payoff")

Scenario Definition

Defined in core/scenario.py.

from typing import Any, Dict

import numpy as np
import pandas as pd
from Melodie import Scenario


class RPSScenario(Scenario):
    """
    Defines and manages scenarios for the Rock-Paper-Scissors model.

    This class handles loading all input data and, notably, dynamically
    generates the agent parameter dataframe for each scenario. This approach
    is used instead of a static input file to ensure that each simulation
    run uses a unique but reproducible set of heterogeneous agent parameters
    based on the scenario's defined bounds.
    """

    def setup(self):
        # Payoff parameters, loaded from scenario tables.
        self.payoff_win_min = 0.0
        self.payoff_win_max = 0.0
        self.payoff_lose_min = 0.0
        self.payoff_lose_max = 0.0
        self.payoff_tie = 0.0

    def load_data(self) -> None:
        self.agent_params = self._generate_agent_params()

    def _generate_agent_params(self) -> pd.DataFrame:
        """
        Dynamically builds the agent parameter dataframe.

        For each agent, it generates random payoff values within the bounds
        specified by the current scenario (`self.payoff_win_min`, etc.) and
        initial random strategy parameters. Using the scenario `id` as a seed
        for the random number generator ensures that the parameter set is
        reproducible for each specific scenario.
        """
        assert self.agent_num > 0, "agent_num must be positive."
        rng = np.random.default_rng(self.id)

        def generator(agent_id: int) -> Dict[str, Any]:
            return {
                "id": agent_id,
                "id_scenario": self.id,
                "payoff_rock_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
                "payoff_rock_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
                "payoff_paper_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
                "payoff_paper_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
                "payoff_scissors_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
                "payoff_scissors_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
                "payoff_tie": self.payoff_tie,
                "strategy_param_1": rng.uniform(0, 100),
                "strategy_param_2": rng.uniform(0, 100),
                "strategy_param_3": rng.uniform(0, 100),
            }

        return pd.DataFrame(generator(agent_id) for agent_id in range(self.agent_num))

Trainer Definition

Defined in core/trainer.py.

from Melodie import Trainer

from examples.rock_paper_scissors_trainer.core.agent import RPSAgent


class RPSTrainer(Trainer):
    """
    A custom Trainer for evolving agents' strategies in the Rock-Paper-Scissors model.

    This trainer uses a genetic algorithm to tune the three strategy parameters
    for each agent. The goal is to maximize the agent's final accumulated payoff,
    which serves as the fitness function for the evolutionary process.
    """

    def setup(self) -> None:
        # Configures the trainer to target the three strategy parameters for all
        # agents in the 'agents' list.
        self.add_agent_training_property(
            "agents",
            ["strategy_param_1", "strategy_param_2", "strategy_param_3"],
            lambda scenario: list(range(scenario.agent_num)),
        )

    def collect_data(self) -> None:
        # Specifies which properties to save during the training process.
        # This allows for analysis of how strategies and outcomes evolve over generations.
        self.add_agent_property("agents", "strategy_param_1")
        self.add_agent_property("agents", "strategy_param_2")
        self.add_agent_property("agents", "strategy_param_3")
        self.add_agent_property("agents", "share_rock")
        self.add_agent_property("agents", "share_paper")
        self.add_agent_property("agents", "share_scissors")
        self.add_environment_property("total_accumulated_payoff")

    def utility(self, agent: RPSAgent) -> float:
        """
        Defines the fitness function for the genetic algorithm.

        The trainer will aim to maximize this value. In this model, the fitness
        is simply the agent's total accumulated payoff at the end of a simulation run.
        """
        return agent.accumulated_payoff