Stochastic Policy Navigation

November 26, 2025

Overview

This project was built as part of CSCE 5210 at the University of North Texas. The goal was to compute an optimal navigation policy for an agent moving through a grid-world environment with terminal states (positive and negative rewards), obstacles, 8-directional movement (including diagonals), and stochastic action noise.

The full implementation is available in the Google Colab notebook.

The system uses a value-iteration approach to find the best policy — a mapping from each cell to the direction the agent should move — under configurable parameters like discount factor, living reward, and noise probability.

Architecture

Grid Environment

The environment is built on a NetworkX graph using nx.grid_2d_graph extended with diagonal edges. Terminal states are placed at fixed positions: a positive reward at (2,0) and negative rewards at (0,0) and (4,0). Obstacles can be placed anywhere on the grid:

class Grid:
  def __init__(self, reward, starting_cell, rows, cols, obstacles):
    self.data = nx.grid_2d_graph(rows, cols)

    # Add diagonal edges
    for x in range(rows):
      for y in range(cols):
        if x + 1 < rows and y + 1 < cols:
          self.data.add_edge((x, y), (x + 1, y + 1))
        if x + 1 < rows and y - 1 >= 0:
          self.data.add_edge((x, y), (x + 1, y - 1))

    nx.set_node_attributes(self.data, 0, NODE_KEY_VAL)
    nx.set_node_attributes(self.data, NODE_TYPE_OPEN, NODE_KEY_TYPE)

    reward_node_values = {
      (0,0): -reward,
      (2,0): reward,
      (4,0): -reward
    }
    nx.set_node_attributes(self.data, reward_node_values, NODE_KEY_VAL)

Each node stores a value (the computed utility) and a type (direction, terminal, obstacle, or open). The grid grows from a 5x8 layout, giving the agent a meaningful space to navigate.

Direction System and Movement Validation

The agent can move in 8 directions — 4 cardinal (north, south, east, west) and 4 intercardinal (north-east, north-west, south-east, south-west). The DirectionEvaluator class handles boundary checking and determines valid moves from any cell:

DIRECTION_MAP = {
  NORTH:       (-1,  0),
  SOUTH:       (1,  0),
  WEST:        (0, -1),
  EAST:        (0,  1),
  NORTH_WEST:  (-1, -1),
  NORTH_EAST:  (-1,  1),
  SOUTH_WEST:  (1, -1),
  SOUTH_EAST:  (1,  1),
}

An important design choice is how noise directions relate to the intended direction. If the agent intends a cardinal direction, it might accidentally move in an intercardinal direction instead (and vice versa). This creates realistic uncertainty in the agent's movement.

PolicyFinder with Probabilistic Transitions

The PolicyFinder class implements the value computation. For each cell and intended direction, it calculates the expected value accounting for noise — the intended direction succeeds with probability 1-noise, and the remaining probability is split equally among valid alternative directions:

def compute_by_direction(self, cell, direction):
  vals = []
  evaluator = DirectionEvaluator(
    cell=cell, dimensions=self.grid.get_dimensions()
  )
  for dir in self.directions:
    next_cell = evaluator.get_next_cell(direction=direction)
    if next_cell is None:
      continue

    cell_val = self.grid.get_cell_value(cell=next_cell)
    if cell_val is None:
      continue

    p = self.p
    if direction != dir:
      all_valid_moves_from_cell = evaluator.get_all_valid_moves(direction=direction)
      total_moves = len(all_valid_moves_from_cell)
      if total_moves == 0:
        continue
      p = self.p / total_moves

    vals.append(cell_val * p)

  return 0 if len(vals) == 0 else max(vals)

The compute_cell_value function evaluates all 8 directions and picks the one with the highest expected discounted reward plus living reward. This gives both the optimal value and the optimal direction for each cell.

Frontier-Based Value Iteration

Rather than initializing and iterating over the entire grid at once, the policy computation starts from a seed cell near the positive terminal and expands outward. The compute_policy method iterates until changes fall below a termination threshold:

def compute_policy(self, max_iterations=500):
  keep_going = True
  iterations = 0
  while keep_going:
    keep_going = False
    next_vals = []
    for cell in self.grid.get_update_list():
      cur_val = self.grid.get_cell_value(cell=cell)
      next_val, next_d = self.compute_cell_value(cell=cell)

      update_list_full = self.grid.update_list_full
      termination_threshold_valid = abs(cur_val - next_val) > self.termination_const
      if not update_list_full or termination_threshold_valid:
        keep_going = True

      self.grid.set_cell(cell=cell, val=next_val, direction=next_d)

      if not update_list_full:
        potential_frontier = self.expand_frontier(cell=cell)
        next_vals.extend([v for v in potential_frontier if v not in next_vals])

    self.grid.extend_update_list(next_vals)

This frontier expansion means cells closer to the reward get values first, which then propagate outward to inform more distant cells. Once all cells are in the update list, the algorithm continues iterating until convergence.

Policy Evaluation

After the policy is computed, the get_policy_graph method converts the grid into a directed graph where edges follow the chosen policy directions. Then get_state_paths finds all paths from each starting position in the top row to the reward cell and computes the maximum cumulative reward:

def get_state_paths(self):
  start_states = [(0, i) for i in range(1, 8)]
  G = self.get_policy_graph()
  max_val = 0
  sum_path = lambda path: sum([self.get_cell_value(cell=v) for v in path])

  for state in start_states:
    paths = list(nx.all_simple_paths(G=G, source=state, target=self.reward_cell))
    if len(paths) == 0:
      continue
    max_path_sum = max(map(sum_path, paths))
    if max_path_sum > max_val:
      max_val = max_path_sum

  return max_val

This evaluation metric captures how well the policy guides the agent from various starting positions to the goal.

Scenarios

The project explores two configurations:

  • R1: A 5x8 grid with no obstacles, sweeping the living reward from -20 to 0 to find the value that produces the highest-quality policy. The living reward penalizes each step, encouraging the agent to find shorter paths while balancing risk from the negative terminal states.
  • R2: The same setup but with an obstacle placed at position (3,4), forcing the policy to route around it and showing how obstacles reshape the optimal navigation strategy.