Implementation Details
========================

.. figure:: /Documentation/images/architecture.png
   :width: 900
   :align: center
   :alt: architecture

--------------------------------------------------------------

This section provides detailed information about the architecture, training process, and the integration of the Language Model (LLM) as part of the reinforcement learning agent for playing checkers.

Agent Architecture
-------------------

The reinforcement learning agent is designed with a neural network to approximate the Q-values for state-action pairs. Below are the key components of the agent architecture:

- **Input Layer:**
  - Accepts a 5-dimensional feature vector representing the game state.

- **Hidden Layers:**
  - Two dense layers:
    - First layer: 32 neurons with ReLU activation.
    - Second layer: 16 neurons with ReLU activation and L2 regularization (regularization parameter = 0.1).

- **Output Layer:**
  - A single neuron with ReLU activation, predicting the Q-value for the input state.

- **Optimizer and Loss Function:**
  - Optimizer: Nadam (Nesterov-accelerated adaptive moment estimation).
  - Loss Function: Binary cross-entropy.

Training Process
----------------

The training process involves a hybrid approach combining reinforcement learning and supervised fine-tuning. Below are the detailed steps:

1. **Self-Play:**
   - The agent plays against itself or other opponents (e.g., random moves, minimax strategy).

2. **Exploration vs. Exploitation:**
   - An exploration probability (initially 0.95) determines whether the agent explores random actions or exploits the policy learned so far.

3. **Reward Assignment:**
   - Rewards are assigned based on the game outcome:
     - Win: +10
     - Loss: -10
     - Draw: Neutral (implicit handling).

4. **Temporal Difference Learning:**
   - The Q-values are updated using the temporal difference formula:
     
     .. math::
        Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left( r + \gamma \cdot \max_a Q(s', a) - Q(s, a) \right)
     
     Where:
     - \( \alpha \): Learning rate (set to 0.5).
     - \( \gamma \): Discount factor (set to 0.95).
     - \( r \): Reward for the current state.
     - \( Q(s', a) \): Predicted future Q-values.

5. **Supervised Fine-Tuning:**
   - Labels are updated Q-values, and the agent fine-tunes the model with backpropagation.

6. **Performance Tracking:**
   - Win rates, losses, and rewards are tracked across generations.

LLM Integration
----------------

The LLM acts as a core component to dynamically limit the action space. Its integration is as follows:

1. **Action Space Filtering:**
   - For each game state, the LLM evaluates all possible actions and filters them to the most promising \( n \) actions.

2. **Efficient Decision-Making:**
   - The RL agent focuses only on the reduced action space, leading to faster and more effective decision-making.

3. **Seamless Interaction:**
   - The LLM works alongside the RL agent in real-time, ensuring minimal latency while filtering actions.

Generations
-----------

The training process runs for multiple generations, each consisting of several games. Key details include:

- **Generations:**
  - A total of 25 generations.

- **Games per Generation:**
  - Each generation involves 10 games against a chosen opponent.

- **Exploration Decay:**
  - The exploration probability decreases over generations, allowing the agent to rely more on exploitation.

Next Steps
----------

- **Interface Details:**
  Learn how to interact with the trained model and visualize its performance in the next section.