RockPaperScissorsTrainer

This example demonstrates Melodie’s Trainer module using a Rock-Paper-Scissors game model. In this model, agents start with heterogeneous payoff preferences and strategy weights. The trainer then uses a genetic algorithm (GA) to evolve these strategy parameters, aiming for higher accumulated payoffs for each agent.

Trainer: Project Structure

examples/rock_paper_scissors_trainer
├── core/
│   ├── agent.py            # Defines agent's strategy, actions, and payoff logic
│   ├── data_collector.py   # Collects micro and macro simulation results
│   ├── environment.py      # Manages pairwise agent battles
│   ├── model.py            # Contains the main simulation loop
│   ├── scenario.py         # Defines scenarios and generates agent parameters
│   └── trainer.py          # Configures and runs the GA-based training
├── data/
│   ├── input/
│   │   ├── SimulatorScenarios.csv
│   │   ├── TrainerScenarios.csv
│   │   └── TrainerParamsScenarios.csv
│   └── output/
└── main.py

Trainer: GA Concepts

While the Calibrator tunes parameters to match an external target, the Trainer is designed for an internal optimization process. Its goal is to allow agents to “learn” and evolve their individual behaviors to maximize their own objectives within the model’s world.

Core Idea: Maximizing Utility

  • Agent-Level Objective: Each agent has a goal, which is quantified by a utility function. In this example, the utility for an agent is its accumulated_payoff at the end of a simulation run. The Trainer’s objective is to find the behavioral parameters that maximize this utility for each agent.

  • Parameters: You specify which agent-level parameter(s) the Trainer should evolve. Here, it’s the three strategy weights: strategy_param_1, strategy_param_2, and strategy_param_3.

  • Utility Function: You must implement a utility(agent) method. This function is called after a simulation run and must return a single float value representing that agent’s performance or “fitness.”

Genetic Algorithm (GA) for Agent Learning

The Trainer applies a separate genetic algorithm to each agent individually.

  • Chromosome: An agent’s set of trainable parameters (e.g., the three strategy weights) is treated as its “chromosome.”

  • Population: For each agent, the GA maintains a “population” of these chromosomes (i.e., different sets of strategy weights).

  • Fitness: To evaluate a chromosome, the model is run with the agent using that set of strategy weights. The agent’s final utility (accumulated payoff) serves as the fitness score.

  • Evolution: Through generations, the GA for each agent independently selects, breeds, and mutates its population of strategy weights, converging on the strategy that yields the highest personal payoff for that agent, given the behavior of all other agents.

  • Parameter Encoding: The Trainer uses the same binary encoding mechanism as the Calibrator to represent continuous strategy parameters for the GA. For a detailed explanation of this process, please see the Parameter Encoding: From Float to Binary section in the CovidContagionCalibrator documentation.

Parameter Configuration (`TrainerParamsScenarios.csv`)

This file controls the behavior of the genetic algorithm for all agents:

  • id: Unique identifier for a set of GA parameters.

  • path_num: How many independent training processes (paths) to run. Each path is a complete run of the GA, helping to test the robustness of the evolutionary outcome.

  • generation_num: The number of generations the GA will run for.

  • strategy_population: The size of the chromosome population maintained for each agent.

  • mutation_prob: The probability of random mutations.

  • strategy_param_code_length: The precision of the strategy parameters.

  • strategy_param_1_min/max, etc.: These define the search space bounds for each parameter being trained.

Trainer: How It Works

  • Simulation Loop: In each period, agents normalize their strategy weights into action probabilities (rock, paper, or scissors). They are then randomly paired to play one round, and their payoffs are recorded. At the end of the simulation run, each agent’s long-term share of actions is calculated.

  • Training Loop: The RPSTrainer evolves the strategy parameters for every agent. The fitness is each agent’s accumulated_payoff. Training settings come from TrainerParamsScenarios.csv.

  • Input Data: The scenarios for the simulator and trainer are defined in their respective CSV files. Agent-level parameters are generated dynamically within the RPSScenario class. agent_num must be an even number, as agents are paired up in each period.

Trainer: Results

The following plots, taken from the original model, show the evolution of the mean and coefficient of variance for the total accumulated payoff across all agents. This illustrates the effectiveness of the training process.

Total payoff mean across generations Total payoff coefficient of variance across generations

Trainer: Running the Model

You can run both the simulator and the trainer using the main script:

python examples/rock_paper_scissors_trainer/main.py

The trainer clears the output directory before it runs. Therefore, after executing the command, you will only find the trainer’s output tables in examples/rock_paper_scissors_trainer/data/output. If you need the simulator’s results, you should run it separately (for example, by commenting out the run_trainer call in main.py).

Parallel Execution Mode

The Trainer supports two parallelization modes, controlled by the parallel_mode parameter when creating the trainer instance:

  • ``parallel_mode=”process”`` (default): Uses subprocess-based parallelism via multiprocessing. This is the traditional approach and works on all Python versions. It is recommended for most use cases.

  • ``parallel_mode=”thread”``: Uses thread-based parallelism via ThreadPoolExecutor. This mode is recommended for Python 3.13+ (free-threaded/No-GIL builds) as it can provide better performance by avoiding the overhead of process creation and data serialization. In older Python versions, this mode will still run but may be limited by the Global Interpreter Lock (GIL).

You can specify the mode when creating the trainer:

trainer = RPSTrainer(
    config=cfg,
    scenario_cls=RPSScenario,
    model_cls=RPSModel,
    processors=4,
    parallel_mode="thread",  # or "process" (default)
)

Trainer: Code

This section shows the key code implementation for the trainer model.

Model Structure

Defined in core/model.py.

 1from typing import TYPE_CHECKING
 2
 3from Melodie import Model
 4
 5from examples.rock_paper_scissors_trainer.core.agent import RPSAgent
 6from examples.rock_paper_scissors_trainer.core.data_collector import RPSDataCollector
 7from examples.rock_paper_scissors_trainer.core.environment import RPSEnvironment
 8from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario
 9
10if TYPE_CHECKING:
11    from Melodie import AgentList
12
13
14class RPSModel(Model):
15    """
16    The main model class for the Rock-Paper-Scissors simulation.
17
18    It sets up the agents, environment, and data collector, and defines the
19    main simulation loop that runs for a specified number of periods.
20    """
21
22    scenario: "RPSScenario"
23    data_collector: RPSDataCollector
24
25    def create(self) -> None:
26        self.agents: "AgentList[RPSAgent]" = self.create_agent_list(RPSAgent)
27        self.environment: RPSEnvironment = self.create_environment(RPSEnvironment)
28        self.data_collector = self.create_data_collector(RPSDataCollector)
29
30    def setup(self) -> None:
31        # Populates the agent list using the dynamically generated parameter table
32        # from the scenario.
33        self.agents.setup_agents(
34            agents_num=self.scenario.agent_num,
35            params_df=self.scenario.agent_params,
36        )
37
38    def run(self) -> None:
39        """Executes the main simulation loop for `scenario.period_num` periods."""
40        for period in range(self.scenario.period_num):
41            self.environment.agents_setup_data(self.agents)
42            self.environment.run_game_rounds(self.agents)
43            self.environment.agents_calc_action_share(period, self.agents)
44            self.data_collector.collect(period)
45        self.data_collector.save()
46

Environment Logic

Defined in core/environment.py.

 1import random
 2from typing import TYPE_CHECKING
 3
 4from Melodie import AgentList, Environment
 5
 6from examples.rock_paper_scissors_trainer.core.agent import RPSAgent
 7
 8if TYPE_CHECKING:
 9    from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario
10
11
12class RPSEnvironment(Environment):
13    """
14    The environment for the Rock-Paper-Scissors model.
15
16    It is responsible for orchestrating the agent interactions by randomly
17    pairing them up for battles in each period. It also tracks the total
18    accumulated payoff across all agents as a macro-level indicator.
19    """
20
21    scenario: "RPSScenario"
22
23    def setup(self) -> None:
24        """Initializes environment-level properties."""
25        self.total_accumulated_payoff = 0.0
26
27    @staticmethod
28    def agents_setup_data(agents: "AgentList[RPSAgent]") -> None:
29        """
30        Prepares derived variables for all agents at the beginning of each period.
31        """
32        for agent in agents:
33            agent.setup_action_prob()
34            agent.setup_action_payoff()
35
36    def run_game_rounds(self, agents: "AgentList[RPSAgent]") -> None:
37        """
38        In each period, randomly pairs up all agents to play one round of the game.
39        """
40        assert self.scenario.agent_num % 2 == 0, "scenario.agent_num must be even."
41        agent_ids = list(range(self.scenario.agent_num))
42        random.shuffle(agent_ids)
43        for idx in range(0, len(agent_ids), 2):
44            opponent_idx = idx + 1
45            if opponent_idx >= len(agent_ids):
46                break
47            self.agents_battle(agents[agent_ids[idx]], agents[agent_ids[opponent_idx]])
48
49    def agents_battle(self, agent_1: "RPSAgent", agent_2: "RPSAgent") -> None:
50        """
51        Executes a single battle between two agents, determines the outcome,
52        and updates their payoffs.
53        """
54        agent_1.id_competitor = agent_2.id
55        agent_2.id_competitor = agent_1.id
56
57        agent_1.select_action()
58        agent_2.select_action()
59
60        if agent_1.action == agent_2.action:
61            agent_1.result = agent_2.result = "tie"
62        elif (
63            (agent_1.action == "rock" and agent_2.action == "paper")
64            or (agent_1.action == "paper" and agent_2.action == "scissors")
65            or (agent_1.action == "scissors" and agent_2.action == "rock")
66        ):
67            agent_1.result = "lose"
68            agent_2.result = "win"
69        else:
70            agent_1.result = "win"
71            agent_2.result = "lose"
72
73        agent_1.set_action_payoff()
74        agent_2.set_action_payoff()
75        self.total_accumulated_payoff += agent_1.payoff + agent_2.payoff
76
77    def agents_calc_action_share(self, period: int, agents: "AgentList[RPSAgent]") -> None:
78        """
79        Triggers the calculation of action shares for all agents at the very
80        end of the simulation run.
81        """
82        if period == self.scenario.period_num - 1:
83            for agent in agents:
84                agent.calc_action_percentage()
85

Agent Behavior

Defined in core/agent.py.

  1import random
  2from typing import Dict, Tuple, TYPE_CHECKING
  3
  4from Melodie import Agent
  5
  6if TYPE_CHECKING:
  7    from examples.rock_paper_scissors_trainer.core.scenario import RPSScenario
  8
  9
 10class RPSAgent(Agent):
 11    """
 12    Represents an agent playing the Rock-Paper-Scissors game.
 13
 14    Each agent has its own unique payoff settings for winning, losing, or tying,
 15    as well as three strategy parameters that determine the probabilities of
 16    choosing rock, paper, or scissors. These strategy parameters are the target
 17    for the evolutionary training by the Trainer module. The agent also tracks
 18    its own play history and accumulated payoffs.
 19    """
 20
 21    scenario: "RPSScenario"
 22
 23    def setup(self) -> None:
 24        # Payoff settings injected via agent params dataframe.
 25        self.payoff_rock_win: float = 0.0
 26        self.payoff_rock_lose: float = 0.0
 27        self.payoff_paper_win: float = 0.0
 28        self.payoff_paper_lose: float = 0.0
 29        self.payoff_scissors_win: float = 0.0
 30        self.payoff_scissors_lose: float = 0.0
 31        self.payoff_tie: float = 0.0
 32
 33        # Strategy weights to be trained by the Trainer.
 34        self.strategy_param_1: float = 0.0
 35        self.strategy_param_2: float = 0.0
 36        self.strategy_param_3: float = 0.0
 37
 38        self._reset_counters()
 39
 40    def _reset_counters(self) -> None:
 41        """Initializes or resets the agent's state for a new simulation run."""
 42        self.id_competitor: int = 0
 43        self.action: str = ""
 44        self.result: str = ""
 45        self.payoff: float = 0.0
 46        self.accumulated_payoff: float = 0.0
 47        self.n_rock: int = 0
 48        self.n_paper: int = 0
 49        self.n_scissors: int = 0
 50        self.share_rock: float = 0.0
 51        self.share_paper: float = 0.0
 52        self.share_scissors: float = 0.0
 53
 54    def setup_action_prob(self) -> None:
 55        """
 56        Normalizes the three strategy parameters into a probability distribution
 57        for choosing rock, paper, or scissors.
 58        """
 59        if self.strategy_param_1 == self.strategy_param_2 == self.strategy_param_3 == 0:
 60            # Avoid division by zero if a chromosome is all zeros during training.
 61            self.strategy_param_1 = self.strategy_param_2 = self.strategy_param_3 = 1.0
 62        total = self.strategy_param_1 + self.strategy_param_2 + self.strategy_param_3
 63        self.action_prob = {
 64            "rock": self.strategy_param_1 / total,
 65            "paper": self.strategy_param_2 / total,
 66            "scissors": self.strategy_param_3 / total,
 67        }
 68
 69    def setup_action_payoff(self) -> None:
 70        """
 71        Creates a lookup dictionary that maps (action, outcome) pairs to their
 72        corresponding payoffs. This helps to keep the battle logic clean.
 73        """
 74        self.action_payoff: Dict[Tuple[str, str], float] = {
 75            ("rock", "win"): self.payoff_rock_win,
 76            ("rock", "lose"): self.payoff_rock_lose,
 77            ("paper", "win"): self.payoff_paper_win,
 78            ("paper", "lose"): self.payoff_paper_lose,
 79            ("scissors", "win"): self.payoff_scissors_win,
 80            ("scissors", "lose"): self.payoff_scissors_lose,
 81            ("rock", "tie"): self.payoff_tie,
 82            ("paper", "tie"): self.payoff_tie,
 83            ("scissors", "tie"): self.payoff_tie,
 84        }
 85
 86    def select_action(self) -> None:
 87        """
 88        Selects an action (rock, paper, or scissors) based on the current
 89        strategy probabilities and updates the action counter.
 90        """
 91        rand = random.random()
 92        if rand <= self.action_prob["rock"]:
 93            self.action = "rock"
 94            self.n_rock += 1
 95        elif rand <= self.action_prob["rock"] + self.action_prob["paper"]:
 96            self.action = "paper"
 97            self.n_paper += 1
 98        else:
 99            self.action = "scissors"
100            self.n_scissors += 1
101
102    def set_action_payoff(self) -> None:
103        """
104        Records the payoff for the last action taken and adds it to the
105        accumulated payoff for the current simulation run.
106        """
107        self.payoff = self.action_payoff[(self.action, self.result)]
108        self.accumulated_payoff += self.payoff
109
110    def calc_action_percentage(self) -> None:
111        """
112        Calculates the agent's long-term share of each action at the end of a
113        simulation run. This is useful for analyzing the evolved strategies.
114        """
115        if self.scenario.period_num > 0:
116            self.share_rock = self.n_rock / self.scenario.period_num
117            self.share_paper = self.n_paper / self.scenario.period_num
118            self.share_scissors = self.n_scissors / self.scenario.period_num
119

Data Collection Setup

Defined in core/data_collector.py.

 1from Melodie import DataCollector
 2
 3
 4class RPSDataCollector(DataCollector):
 5    """
 6    A custom data collector for the Rock-Paper-Scissors model.
 7
 8    It is configured to save detailed agent-level data about each game round
 9    (e.g., opponent, action, result, payoff) as well as the final evolved
10    strategy shares. It also records the environment-level total payoff.
11    """
12
13    def setup(self) -> None:
14        self.add_agent_property("agents", "id_competitor")
15        self.add_agent_property("agents", "action")
16        self.add_agent_property("agents", "n_rock")
17        self.add_agent_property("agents", "n_paper")
18        self.add_agent_property("agents", "n_scissors")
19        self.add_agent_property("agents", "result")
20        self.add_agent_property("agents", "payoff")
21        self.add_agent_property("agents", "accumulated_payoff")
22        self.add_agent_property("agents", "share_rock")
23        self.add_agent_property("agents", "share_paper")
24        self.add_agent_property("agents", "share_scissors")
25        self.add_environment_property("total_accumulated_payoff")
26

Scenario Definition

Defined in core/scenario.py.

 1from typing import Any, Dict
 2
 3import numpy as np
 4import pandas as pd
 5from Melodie import Scenario
 6
 7
 8class RPSScenario(Scenario):
 9    """
10    Defines and manages scenarios for the Rock-Paper-Scissors model.
11
12    This class handles loading all input data and, notably, dynamically
13    generates the agent parameter dataframe for each scenario. This approach
14    is used instead of a static input file to ensure that each simulation
15    run uses a unique but reproducible set of heterogeneous agent parameters
16    based on the scenario's defined bounds.
17    """
18
19    def setup(self):
20        # Payoff parameters, loaded from scenario tables.
21        self.payoff_win_min = 0.0
22        self.payoff_win_max = 0.0
23        self.payoff_lose_min = 0.0
24        self.payoff_lose_max = 0.0
25        self.payoff_tie = 0.0
26
27    def load_data(self) -> None:
28        self.agent_params = self._generate_agent_params()
29
30    def _generate_agent_params(self) -> pd.DataFrame:
31        """
32        Dynamically builds the agent parameter dataframe.
33
34        For each agent, it generates random payoff values within the bounds
35        specified by the current scenario (`self.payoff_win_min`, etc.) and
36        initial random strategy parameters. Using the scenario `id` as a seed
37        for the random number generator ensures that the parameter set is
38        reproducible for each specific scenario.
39        """
40        assert self.agent_num > 0, "agent_num must be positive."
41        rng = np.random.default_rng(self.id)
42
43        def generator(agent_id: int) -> Dict[str, Any]:
44            return {
45                "id": agent_id,
46                "id_scenario": self.id,
47                "payoff_rock_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
48                "payoff_rock_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
49                "payoff_paper_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
50                "payoff_paper_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
51                "payoff_scissors_win": rng.uniform(self.payoff_win_min, self.payoff_win_max),
52                "payoff_scissors_lose": rng.uniform(self.payoff_lose_min, self.payoff_lose_max),
53                "payoff_tie": self.payoff_tie,
54                "strategy_param_1": rng.uniform(0, 100),
55                "strategy_param_2": rng.uniform(0, 100),
56                "strategy_param_3": rng.uniform(0, 100),
57            }
58
59        return pd.DataFrame(generator(agent_id) for agent_id in range(self.agent_num))
60

Trainer Definition

Defined in core/trainer.py.

 1from Melodie import Trainer
 2
 3from examples.rock_paper_scissors_trainer.core.agent import RPSAgent
 4
 5
 6class RPSTrainer(Trainer):
 7    """
 8    A custom Trainer for evolving agents' strategies in the Rock-Paper-Scissors model.
 9
10    This trainer uses a genetic algorithm to tune the three strategy parameters
11    for each agent. The goal is to maximize the agent's final accumulated payoff,
12    which serves as the fitness function for the evolutionary process.
13    """
14
15    def setup(self) -> None:
16        # Configures the trainer to target the three strategy parameters for all
17        # agents in the 'agents' list.
18        self.add_agent_training_property(
19            "agents",
20            ["strategy_param_1", "strategy_param_2", "strategy_param_3"],
21            lambda scenario: list(range(scenario.agent_num)),
22        )
23
24    def collect_data(self) -> None:
25        # Specifies which properties to save during the training process.
26        # This allows for analysis of how strategies and outcomes evolve over generations.
27        self.add_agent_property("agents", "strategy_param_1")
28        self.add_agent_property("agents", "strategy_param_2")
29        self.add_agent_property("agents", "strategy_param_3")
30        self.add_agent_property("agents", "share_rock")
31        self.add_agent_property("agents", "share_paper")
32        self.add_agent_property("agents", "share_scissors")
33        self.add_environment_property("total_accumulated_payoff")
34
35    def utility(self, agent: RPSAgent) -> float:
36        """
37        Defines the fitness function for the genetic algorithm.
38
39        The trainer will aim to maximize this value. In this model, the fitness
40        is simply the agent's total accumulated payoff at the end of a simulation run.
41        """
42        return agent.accumulated_payoff
43