How do I specify multiple, heterogeneous actions for the rl.env.MATLABEnvironment of the Reinforcement learning toolbox or another way, if there is one?

35 views (last 30 days)
How would I go about defining very different actions for ActionInfo? It's a "specification object" but I'm not sure where the definition is for this. The documentation and examples are all for a single array with actions. This may mean the agent would have a much harder time optimizing these actions due to redundant action spaces due to the same limit for every action. See example below for 2 very different actions that I might need:
% Define action info
ActionInfo(1) = rlNumericSpec([1 1, LowerLimit 4, UpperLimit 10]);
ActionInfo(1).Name = 'speed';
ActionInfo(2) = rlNumericSpec([1 1, LowerLimit 3000, UpperLimit 10000]);
ActionInfo(2).Name = 'distance';
Thanks for your help and this awesome toolbox!

Accepted Answer

Emmanouil Tzorakoleftherakis
Edited: Emmanouil Tzorakoleftherakis on 18 Feb 2021
Hello,
I probably don't understand your objective but the two actions you mention above (distance and speed) are still scalars. What difference would it make to split them like that? Do you want different layers for each of the two for feature extraction?
If you want two scalar actions with different limits, you can do
ActionInfo = rlNumericSpec([2 1],'LowerLimit',[4;3000],'UpperLimit',[10;10000])
If you want to see how to use rlNumericSpec with different types of inputs (e.g. images and scalars) this example that uses heterogeneous observations (scalars and images) may be helpful. Check obsInfo to see how it's set up or type
edit rl.env.AbstractSimplePendlumWithImage
to look into how the predefined environment is created (see lines 41-44).
  6 Comments
Felix Windels
Felix Windels on 17 Feb 2022
Referring to the original question, is it possible that the actions are not within the limits 0 to 1?
ActionInfo = rlNumericSpec([4 1],'LowerLimit',[0;0;0;0],'UpperLimit',[1;1;1;1]);
I think I have followed the programming from the example above accurately. Maybe there is a more elegant way? Because all 4 actions have the same limits?
It is noticeable that when training different agents, I achieve episode rewards that are far below the minimum I expect. A possible explanation would be the violated limits of the actions.
Thanks for any help and suggestions
Here is the complete environment:
A note: Each property with a name: ..._time is an array with 8760 entries (annual hours). I create the final script for transfer to the RL app with a short code in which these are transferred including the "simple" scalars eta, Batt_P etc..
Sorry that my comments are in German.
classdef Environmentcontinuous < rl.env.MATLABEnvironment
%ENVIRONMENTBASIS: Template for defining custom environment in MATLAB.
% Properties (set properties' attributes accordingly)
properties
eta = [];
Batt_P = [];
Batt_cap = [];
P_Inverter = [];
PV_Peak = [];
PV_gen_time = [];
Load_time = [];
Price_time = [];
Compense_time = [];
Faktor_time = [];
end
properties
% Initialize system state;
State = [0;0;0;0;0;1]
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination()
IsDone = false;
end
% Necessary Methods
methods
% Contructor method creates an instance of the environment
% Change class name and constructor name accordingly
function this = Environmentcontinuous()
% Initialize Observation settings
ObservationInfo = rlNumericSpec([5 1]);
ObservationInfo.Name = 'Observation';
ObservationInfo.Description = 'Load_obs, PV_gen_obs, batt_obs, Price_obs, Compense_obs';
% Initialize Action settings
ActionInfo = rlNumericSpec([4 1],'LowerLimit',[0;0;0;0],'UpperLimit',[1;1;1;1]);
ActionInfo.Name = 'Prozentuale_Verwendung';
ActionInfo.Description = 'Verwendung der PV-Leistung zur Lasdeckung, Verwndung der verbliebenen PV-Leistung zur Batteriebeladung, Verwendung der verfügbaren Batteriebeladung zur Lastdeckung, Verwendung der verfügbaren Beldadeleistung zur BEladung der Batterie';
% The following line implements built-in functions of RL env
this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
% updateActionInfo(this);
end
% Apply system dynamics and simulates the environment with the
% given action for one step.
function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
% Werte aus Zeitreihen
Load_sys = this.Load_time(this.State(6)); % Last in kWh
PV_gen = this.PV_gen_time(this.State(6)); % PV Erzeugung in kWh
Price = this.Price_time;
Compense = this.Compense_time;
Batt_stor = this.State(3);
if this.State(6) < length(this.PV_gen_time)
Load_obs = this.Load_time(this.State(6)+1);
PV_gen_obs = this.PV_gen_time(this.State(6)+1);
else
Load_obs = 0;
PV_gen_obs = 0;
end
Price_obs = this.Price_time;
Compense_obs = this.Compense_time;
%
% als Action werdn 4 Werte Action(1-4) übergeben die je
% zwischen 0 bis 1 (kontinuierlich)
% liegen.
% Action(1) Anteil der Verfügbaren PV-Leistung zur Lastdeckung
% Action(2) Anteil der verbliebenen PV-Leistung zum Laden der Batterie
% Action(3) Anteil der verfügbaren Batterieladung zur verbliebenen Last
% Action(4) Anteil der verfügbaren Beladeleistung zur Beladung aus dem Netz
% Mit diesen vier Werten sollten in dem unten
% stehenden Code, alle denkbaren Energieflüsse abzubilden sein.
% Das gewählte Konzept ergibt sich weil die Observations
% (PV-Einspeisung, Last, Batterieladestand) wirken sich nicht nur auf die "Entscheidung" des Agents aus,
% sie wären in diesem Fall auch harte Grenze für die Action.
% Darüber hinaus hätten Entscheidungen des Agents (z.B. die gesamte PV-Leistung zur Deckung der Last zu
% verwenden) auswirkungen auf weitere Actions in diesem Timestep --> für die Einspeisung ins Netz oder die Beladung der
% Batterie stünde ja jetzt keine Leistung mehr zur Verfügung.
PV_Load = min([Action(1)*PV_gen Load_sys this.P_Inverter]); % Lastdeckung durch PV
PV_Batt = min([(PV_gen-PV_Load)*Action(2) this.Batt_P this.Batt_cap-Batt_stor 0]); % Laden der Batterie mit PV
Batt_stor = Batt_stor + PV_Batt; % Beladung der Batterie nach Beladen mit PV-Strom
PV_Grid = max([PV_gen-PV_Load-PV_Batt]); % Einspeisung PV ins Netz
Batt_Load = this.eta * min([(Load_sys-PV_Load)*Action(3) Batt_stor*Action(3) (this.Batt_P-PV_Batt)*Action(3) 0]); % Lastdeckung durch die Batterie: Minimum aus 1)verbleibender Last nach Deckung durch PV, 2)Der Verfügbaren Ladung der Batterie 3)Der Differenz aus maximaler Ladeleistung und "bereits" verwendeter Leistung
Batt_stor = Batt_stor - Batt_Load; % Beladung der Batterie nach Lastdeckung durch Batterie
Grid_Load = (Load_sys-PV_Load-Batt_Load); % Lasdeckung durch Netzbezug
Grid_Batt = min([Action(4)*(this.Batt_P-PV_Batt-Batt_Load) Action(4)*(this.Batt_cap-Batt_stor) 0])*this.eta; % Beladen der Batterie aus dem Netz
Reward = PV_Grid*Compense - (Grid_Load+Grid_Batt)*Price;
% Speicherbeladung
Batt_obs = Batt_stor;
LoggedSignals = [];
% Die Abfrage der IsDone Flag liegt vor der Erhöhung des Index
% für den Timestep, da sonst bei 8759 Zeitschritten abgebrochen
% wird Alternativ:
% IsDone = this.State(6) == length(this.PV_gen_time)+1 Diese
% Forem ermöglicht es die Episodenlänge in den Ausgangsdaten
% beliebig zu verändern
IsDone = this.State(6) == length(this.PV_gen_time);
this.IsDone = IsDone;
if this.State(6) < length(this.PV_gen_time)
this.State(6) = this.State(6)+1;
else
end
% Erzeuge das Observation Array
Observation = [Load_obs;PV_gen_obs;Batt_obs;Price_obs;Compense_obs];
% Update system states
this.State = [Observation; this.State(6)];
end
% Reset environment to initial state and output initial observation
% 'Load_obs, PV_gen_obs, batt_obs, Price_obs,Compense_obs
function InitialObservation = reset(this)
Load_obs0 = this.Load_time(1);
PV_obs0 = this.PV_gen_time(1);
batt_obs0 = 0;
Price_obs0 = 0;
Compense_obs0 = 0;
timestep0 = 1;
InitialObservation = [Load_obs0;PV_obs0;batt_obs0;Price_obs0;Compense_obs0];
this.State = [InitialObservation; timestep0];
notifyEnvUpdated(this);
end
end
% Optional Methods (set methods' attributes accordingly)
methods
end
methods(Access = protected)
% (optional) update visualization everytime the environment is updated
% (notifyEnvUpdated is called)
% function envUpdatedCallback(this)
end
end
Imola Fodor
Imola Fodor on 4 Mar 2024
hi, i wonder if it possible to specify one action that is from a finite set n, and the other the number of steps for which it should be repeated (so numeric)? or i should have n+1 actions, one of them having the meaning to keep whatever there was previously, so i am still calling the agent at each step?

Sign in to comment.

More Answers (1)

John Doe
John Doe on 18 Feb 2021
thank you!! awesomeness

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!