Custom RL environment creation

7 views (last 30 days)
Hello everyone,
I am trying to implement the following custom environment for my RL agent. There is a rectangular area, as shown in the figure, with smaller grids of some fixed dimension (say 20m x 20m). Every grid has a profit. The agent starts from the lower left bottom (x=0, y=0) and moves in each step (say each 0.25s), and finally stays in a position that gives the highest profit sum. The total duration is 30s. So once it finds the best place, it stays there for the rest of the duration and enjoys the profit. The profit sum is calculated as the coverage of the agent (say the coverage range of the agent is 50m). However, the agent not only wants to maximize the profit sum but also wants to minimize its travel distance. So essentially, the optimization is maximizing (profit sum/traveled distance). The agent can take two continuous actions- distance (D) and angle of movement (Theta). D is [0, 12.5m] and Theta[0, 359.9999deg). This means the position of the agent updates the following way in each step.
x(new) = x(old) + D*cos(Theta)
y(new) = y(old) + D*sin(Theta)
I thought the observation space to be like these. It's a tuple of four elements.
[Profit collected, Distance traveled, x pos of the agent, y pos of the agent].
And my reward is something like = [delta(profit)/delta(distance traveled)]. Where delta(profit) means profit gained(or lost) by moving in a step and the same thing for delta(distance traveled). Because my target is to gain high profit with less movement. I am looking and trying to implement the custom environment by looking into
https://in.mathworks.com/help/reinforcement-learning/ug/create-custom-matlab-environment-from-template.html
I have implemented some part of it but the last two methods are not clear to me. And doubtful about the overall code as well. The respective section is commented in the code. Any suggestions?
Another thing, how do I introduce multiple agents into the scenario?
Thanks for reading such a long question.
classdef MyEnvironment < rl.env.MATLABEnvironment
%MYENVIRONMENT: Template for defining custom environment in MATLAB.
%% Properties (set properties' attributes accordingly)
properties
% Specify and initialize environment's necessary properties
% X and Y grid numbers
XGrid = 10
YGrid = 10
% Grid size in meter (square grids)
GridSize = 20.0
% Full dimension of the Grid
XMax = 200
YMax = 200
% Max and Min Angle the agent can move in degree (in each step)
MaxAngle = 359.9999
MinAngle = 0
% Sample time (S)
Ts = 0.25
% Max Distance the agent can travel in meter (in each sample time)
MaxD = 50 % in 1s agent can travel 50m
MaxDistance = 12.50 % = 50*0.25
MinDistance = 0
% System dynamics dont change for this interval in sec
FixDuration = 30
SimuDuration = 30 % for now Simulation time is same as Fix duration i.e. no change in profit over time
no_of_steps = 120 % no. of steps possible in one episode (30/0.25)
% Coverage range (in m)--agent can cover upto this range.
CovRange = 50
% Penalty when the agent goes outside boundary
PenaltyForGoingOutside = -100
end
properties
% Initialize system state 4 values. All zeros.
% They are -- [Collected Profit sum, Traveled distance, x, and y pos of the
% agent]
% [ProfitSum, DistTravel, agent_x, agent_y]'
State = zeros(4,1)
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
end
%% Necessary Methods
methods
% Contructor method creates an instance of the environment
function this = MyEnvironment()
% Initialize Observation settings
ObservationInfo = rlNumericSpec([4 1]);
ObservationInfo.Name = 'Grid States';
ObservationInfo.Description = 'Profit, Distance, x, y';
% Initialize Action settings
% Two actions -- distance and angle. Both limited by the
% provided range
ActionInfo = rlNumericSpec([2 1],'LowerLimit',[0;0],'UpperLimit',[12.5;359.9999]);
ActionInfo.Name = 'dist;angle';
% The following line implements built-in functions of RL env
this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
% Grids' centre position and profit details
% Grid is a struct which stores 3 values. x pos, y pos and profit
% for each grid.
% This information is necessary to compute the coverage profit sum
Total_Grids = this.XGrid*this.YGrid;
Grid(Total_Grids) = struct();
G = 1;
for i = 1:this.XGrid
for j = 1:this.YGrid
Grid(G).X = (this.GridSize/2)+ (j-1)*(this.GridSize); % x pos of each grid centre
Grid(G).Y = (this.GridSize/2)+ (i-1)*(this.GridSize); % y pos of each grid centre
G = G + 1;
end
end
Profits = randi([0,20],100,1); % profit of each grid, 100 grids are there
G = 1;
for i = 1:Total_Grids
Grid(G).Profit = Profits(G); % stored in Grid structure
G = G+1;
end
% Initialize property values and pre-compute necessary values
updateActionInfo(this);
end
% Apply system dynamics and simulates the environment with the
% given action for one step.
function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
LoggedSignals = [];
% n is used to count total number of steps taken.
% when n == possible time steps in one episode then end the
% episode
persistent n
if isempty(n)
n = 1;
else
n = n+1;
end
% Get actions
[dist,angle] = getMovement(this,Action);
% Unpack state vector
Profit = this.State(1);
Distance = this.State(2);
x = this.State(3);
y = this.State(4);
% Computation of the necessary values
CosTheta = cosd(angle);
SinTheta = sind(Theta);
x_new = x + dist*CosTheta;
y_new = y + dist*SinTheta;
% To compute the new profit after taking the actions
% Idea is if the centre of a grid is within the coverage range
% of the agent, then it is covered and its profit is obtained.
P = 0;
for k = 1: this.Total_Grids
if sqrt((x_new-this.Grid(k).X)^2 + (y_new-this.Grid(k).Y)^2)<= this.CovRange
P = P + this.Grid(k).Profit;
end
end
new_Profit = P;
dist_Traveled = dist;
delta_profit = new_Profit-Profit;
delta_dist = dist;
% New Observation
Observation = [new_profit, dist_traveled, x_new, y_new];
% Update system states
this.State = Observation;
% Check terminal condition
if n == this.no_of_steps
this.IsDone = true;
end
% Reward::
% If goes outside the region, penalize the agent
if (x_new > this.XMax || y_new > this.YMax)
penalty = this.PenaltyForGoingOutside;
else
penalty = 0;
end
Reward = 10*(delta_profit/delta_dist)+ penalty;
end
% Reset environment to initial state and output initial observation
function InitialObservation = reset(this)
% Profit sum goes to 0
P0 = 0;
% Distance travelled goes to 0
D0 = 0;
% Initial x pos of the robot
X0 = 0;
% Initial y pos of the robot
Y0 = 0;
InitialObservation = [P0;D0;X0;Y0];
this.State = InitialObservation;
end
end
methods
% Helper methods to create the environment
% Not sure how to update this two methods???
function [dist,angle] = getMovement(this,action)
if ~ismember(action,this.ActionInfo.Elements)
error('Action must be limited by the valid range');
end
[dist,angle] = action;
end
% Update the action info based on Values
% Not sure how to update this in my case??? kept same as cart pole
function updateActionInfo(this)
this.ActionInfo.Elements = this.MaxForce*[-1 1];
end
end
end

Accepted Answer

Emmanouil Tzorakoleftherakis
Hello,
Based on the updated files you sent on this post, you are setting this.IsDone, however this is a class variable which is different than IsDone that is needed as output of 'step'. You need to set both to eliminate the error you are seeing.
There is an additional error which happends after that and it's due to how the reward (line 168 in attached) is defined. Specifically there is division by zero - make sure you account for that in your reward logic.
Hope that helps

More Answers (0)

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!