How do I specify multiple, heterogeneous actions for the rl.env.MATLABEnvironment of the Reinforcement learning toolbox or another way, if there is one?

Question

John Doe on 18 Feb 2021

1
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/748787-how-do-i-specify-multiple-heterogeneous-actions-for-the-rl-env-matlabenvironment-of-the-reinforceme

Commented: Imola Fodor on 4 Mar 2024

Accepted Answer: Emmanouil Tzorakoleftherakis

How would I go about defining very different actions for ActionInfo? It's a "specification object" but I'm not sure where the definition is for this. The documentation and examples are all for a single array with actions. This may mean the agent would have a much harder time optimizing these actions due to redundant action spaces due to the same limit for every action. See example below for 2 very different actions that I might need:

% Define action info

ActionInfo(1) = rlNumericSpec([1 1, LowerLimit 4, UpperLimit 10]);

ActionInfo(1).Name = 'speed';

ActionInfo(2) = rlNumericSpec([1 1, LowerLimit 3000, UpperLimit 10000]);

ActionInfo(2).Name = 'distance';

Thanks for your help and this awesome toolbox!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Emmanouil Tzorakoleftherakis on 18 Feb 2021

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/748787-how-do-i-specify-multiple-heterogeneous-actions-for-the-rl-env-matlabenvironment-of-the-reinforceme#answer_627107

Edited: Emmanouil Tzorakoleftherakis on 18 Feb 2021

Hello,

I probably don't understand your objective but the two actions you mention above (distance and speed) are still scalars. What difference would it make to split them like that? Do you want different layers for each of the two for feature extraction?

If you want two scalar actions with different limits, you can do

ActionInfo = rlNumericSpec([2 1],'LowerLimit',[4;3000],'UpperLimit',[10;10000])

If you want to see how to use rlNumericSpec with different types of inputs (e.g. images and scalars) this example that uses heterogeneous observations (scalars and images) may be helpful. Check obsInfo to see how it's set up or type

edit rl.env.AbstractSimplePendlumWithImage

to look into how the predefined environment is created (see lines 41-44).

6 Comments
Show 4 older commentsHide 4 older comments

Felix Windels on 17 Feb 2022

@Emmanouil Tzorakoleftherakis

Referring to the original question, is it possible that the actions are not within the limits 0 to 1?

 ActionInfo = rlNumericSpec([4 1],'LowerLimit',[0;0;0;0],'UpperLimit',[1;1;1;1]);

I think I have followed the programming from the example above accurately. Maybe there is a more elegant way? Because all 4 actions have the same limits?

It is noticeable that when training different agents, I achieve episode rewards that are far below the minimum I expect. A possible explanation would be the violated limits of the actions.

Thanks for any help and suggestions

Here is the complete environment:

A note: Each property with a name: ..._time is an array with 8760 entries (annual hours). I create the final script for transfer to the RL app with a short code in which these are transferred including the "simple" scalars eta, Batt_P etc..

Sorry that my comments are in German.

classdef Environmentcontinuous < rl.env.MATLABEnvironment
    %ENVIRONMENTBASIS: Template for defining custom environment in MATLAB.
    % Properties (set properties' attributes accordingly)
    properties
        eta             =       [];
        Batt_P          =       [];
        Batt_cap        =       [];
        P_Inverter      =       [];
        PV_Peak         =       [];
        PV_gen_time     =       [];
        Load_time       =       [];
        Price_time      =       [];
        Compense_time   =       [];
        Faktor_time     =       [];
    end
    properties
        % Initialize system state;
        State = [0;0;0;0;0;1]
    end
    properties(Access = protected)
        % Initialize internal flag to indicate episode termination()
        IsDone = false;
    end
    % Necessary Methods
    methods
        % Contructor method creates an instance of the environment
        % Change class name and constructor name accordingly
        function this = Environmentcontinuous()
            %           Initialize Observation settings
            ObservationInfo = rlNumericSpec([5 1]);
            ObservationInfo.Name = 'Observation';
            ObservationInfo.Description = 'Load_obs, PV_gen_obs, batt_obs, Price_obs, Compense_obs';
            % Initialize Action settings
            ActionInfo = rlNumericSpec([4 1],'LowerLimit',[0;0;0;0],'UpperLimit',[1;1;1;1]);
            ActionInfo.Name = 'Prozentuale_Verwendung';
            ActionInfo.Description = 'Verwendung der PV-Leistung zur Lasdeckung, Verwndung der verbliebenen PV-Leistung zur Batteriebeladung, Verwendung der verfügbaren Batteriebeladung zur Lastdeckung, Verwendung der verfügbaren Beldadeleistung zur BEladung der Batterie';
            % The following line implements built-in functions of RL env
            this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
            %             updateActionInfo(this);
        end
        % Apply system dynamics and simulates the environment with the
        % given action for one step.
        function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
            % Werte aus Zeitreihen
            Load_sys                = this.Load_time(this.State(6));                % Last in kWh
            PV_gen                  = this.PV_gen_time(this.State(6));              % PV Erzeugung in kWh
            Price       = this.Price_time;
            Compense    = this.Compense_time;
            Batt_stor = this.State(3);
            if this.State(6) < length(this.PV_gen_time)
                Load_obs            = this.Load_time(this.State(6)+1);
                PV_gen_obs          = this.PV_gen_time(this.State(6)+1);
            else
                Load_obs            = 0;
                PV_gen_obs          = 0;
            end
            Price_obs       = this.Price_time;
            Compense_obs    = this.Compense_time;
% 
%             als Action werdn 4 Werte Action(1-4) übergeben die je
%             zwischen 0 bis 1 (kontinuierlich)       
%             liegen.
%             Action(1) Anteil der Verfügbaren PV-Leistung zur Lastdeckung
%             Action(2) Anteil der verbliebenen PV-Leistung zum Laden der Batterie
%             Action(3) Anteil der verfügbaren Batterieladung zur verbliebenen Last
%             Action(4) Anteil der verfügbaren Beladeleistung zur Beladung aus dem Netz
%             Mit diesen vier Werten sollten in dem unten
%             stehenden Code, alle denkbaren Energieflüsse abzubilden sein.
%             Das gewählte Konzept ergibt sich weil die Observations
%             (PV-Einspeisung, Last, Batterieladestand) wirken sich nicht nur auf die "Entscheidung" des Agents aus, 
%             sie wären in diesem Fall auch harte Grenze für die Action. 
%             Darüber hinaus hätten Entscheidungen des Agents (z.B. die gesamte PV-Leistung zur Deckung der Last zu
%             verwenden) auswirkungen auf weitere Actions in diesem Timestep --> für die Einspeisung ins Netz oder die Beladung der
%             Batterie stünde ja jetzt keine Leistung mehr zur Verfügung. 
            PV_Load = min([Action(1)*PV_gen Load_sys this.P_Inverter]);                                   % Lastdeckung durch PV
            PV_Batt = min([(PV_gen-PV_Load)*Action(2) this.Batt_P this.Batt_cap-Batt_stor 0]);   % Laden der Batterie mit PV 
            Batt_stor = Batt_stor + PV_Batt;                                                              % Beladung der Batterie nach Beladen mit PV-Strom
            PV_Grid = max([PV_gen-PV_Load-PV_Batt]);                                                      % Einspeisung PV ins Netz 
            Batt_Load = this.eta * min([(Load_sys-PV_Load)*Action(3) Batt_stor*Action(3) (this.Batt_P-PV_Batt)*Action(3) 0]);   % Lastdeckung durch die Batterie: Minimum aus 1)verbleibender Last nach Deckung durch PV, 2)Der Verfügbaren Ladung der Batterie 3)Der Differenz aus maximaler Ladeleistung und "bereits" verwendeter Leistung
            Batt_stor = Batt_stor - Batt_Load;                                                            % Beladung der Batterie nach Lastdeckung durch Batterie
            Grid_Load = (Load_sys-PV_Load-Batt_Load);                                                     % Lasdeckung durch Netzbezug
            Grid_Batt = min([Action(4)*(this.Batt_P-PV_Batt-Batt_Load) Action(4)*(this.Batt_cap-Batt_stor) 0])*this.eta; % Beladen der Batterie aus dem Netz
            Reward = PV_Grid*Compense - (Grid_Load+Grid_Batt)*Price;
            
            % Speicherbeladung
            Batt_obs = Batt_stor;
            LoggedSignals = [];
            % Die Abfrage der IsDone Flag liegt vor der Erhöhung des Index
            % für den Timestep, da sonst bei 8759 Zeitschritten abgebrochen
            % wird Alternativ:
            % IsDone = this.State(6) == length(this.PV_gen_time)+1 Diese
            % Forem ermöglicht es die Episodenlänge in den Ausgangsdaten
            % beliebig zu verändern
            IsDone = this.State(6) == length(this.PV_gen_time);
            this.IsDone = IsDone;
            if this.State(6) < length(this.PV_gen_time)
                this.State(6) = this.State(6)+1;
            else
            end
            % Erzeuge das Observation Array
            Observation = [Load_obs;PV_gen_obs;Batt_obs;Price_obs;Compense_obs];
            % Update system states
            this.State = [Observation; this.State(6)];
        end
        % Reset environment to initial state and output initial observation
        % 'Load_obs, PV_gen_obs, batt_obs, Price_obs,Compense_obs
        function InitialObservation = reset(this)
            Load_obs0 = this.Load_time(1);
            PV_obs0 = this.PV_gen_time(1);
            batt_obs0 = 0;
            Price_obs0 = 0;
            Compense_obs0 = 0;
            timestep0  = 1;
            InitialObservation = [Load_obs0;PV_obs0;batt_obs0;Price_obs0;Compense_obs0];
            this.State = [InitialObservation; timestep0];
            notifyEnvUpdated(this);
        end
    end
    % Optional Methods (set methods' attributes accordingly)
    methods
    
    end
methods(Access = protected)
    % (optional) update visualization everytime the environment is updated
    % (notifyEnvUpdated is called)
    %         function envUpdatedCallback(this)
end
end

Imola Fodor on 4 Mar 2024

hi, i wonder if it possible to specify one action that is from a finite set n, and the other the number of steps for which it should be repeated (so numeric)? or i should have n+1 actions, one of them having the meaning to keep whatever there was previously, so i am still calling the agent at each step?

Sign in to comment.

Answer 2

John Doe on 18 Feb 2021

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/748787-how-do-i-specify-multiple-heterogeneous-actions-for-the-rl-env-matlabenvironment-of-the-reinforceme#answer_627242

thank you!! awesomeness

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How do I specify multiple, heterogeneous actions for the rl.env.MATLABEnvironment of the Reinforcement learning toolbox or another way, if there is one?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How do I specify multiple, heterogeneous actions for the rl.env.MATLABEnvironment of the Reinforcement learning toolbox or another way, if there is one?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments

0 Comments
Show -2 older commentsHide -2 older comments