The Overall Model
Why This Model Exists
Model-based GUI automation requires coordinating several critical concerns:
- Perception: What's currently visible on screen?
- Knowledge Representation: What states, elements, and transitions exist in the GUI?
- State Awareness: Where are we currently in the GUI environment?
- Decision Making: What path should we take to reach our goal?
- Execution: How do we perform actions and transitions reliably?
Traditional process-based automation embeds all of these concerns within sequential scripts, creating tight coupling and exponential complexity. The Overall Model provides a formal framework that separates these concerns into six interrelated components, enabling robust, maintainable automation.
The Big Picture
The Overall Model unifies perception, knowledge, and execution into a cohesive system:
Domain Knowledge (User-Defined)
↓
State Structure (Ω)
[States, Elements, Transitions]
↓
┌─────────────────────┴─────────────────────┐
↓ ↓
Strategic Knowledge (Framework) Visible GUI (Ξ)
[a, M, τ, §] [Screen Pixels]
↓ ↓
Actions → State Updates ← Element Detection ────┘
↓
Navigation & Recovery
The framework continuously compares its internal belief of active states (S_Ξ) against the reality of the screen (Ξ), enabling detection of unexpected events and dynamic adaptation.
Formal Definition
The overall model of GUI automation is defined as a six-tuple (Ξ, Ω, a, M, τ, §).
This tuple unifies the core components into a formal system that specifies how Brobot perceives the GUI, manages state, performs actions, and navigates the interface.
Component Definitions
Ξ - The Visible GUI
Formal Definition:
The visible GUI Ξ contains:
- The scene, or the pixel output of the screen
- E_Ξ = f(Ξ) ⊆ E, the set of all GUI elements in the visible GUI
where f(Ξ) is the element extraction function that identifies which GUI elements from the State Structure are currently visible on screen.
Implementation: BrobotScreenCapture.java, UnifiedCaptureService.java
Concept: Ξ represents the ground truth of what's actually on screen at any moment. Unlike process-based approaches that use the screen as transient input, Brobot treats Ξ as an explicit component that can be queried and verified. This enables a crucial feedback loop: the framework continuously compares its belief about active states against the reality of Ξ, detecting unexpected events (like popup windows) and confirming transition outcomes.
The scene captures pixel-level representation of the entire GUI, including all elements, background images, and white space. This detailed information is crucial for analyzing the effects of individual actions and determining the next feasible steps, especially in stochastic environments where GUI changes may not follow predictable patterns.
Ω - The State Structure
Formal Definition:
The State Structure is a tuple Ω = (E, S, T) defined by:
-
E = {e₁, e₂, ..., eₙ}: The set of all GUI elements selected to model the environment (images, regions, locations, text patterns, etc.)
-
S: The set of all GUI states, where:
- Each state s ∈ S is a subset of E (a state is a collection of related GUI elements)
- Multiple states can be active simultaneously: S_Ξ ⊆ S
- A state is active if and only if its elements are visible: s ∈ S_Ξ ⟺ s ∩ E_Ξ ≠ ∅
-
T: The set of all transitions between states
Implementation: State.java, StateImage.java, StateRegion.java, StateTransition.java
Concept: The State Structure externalizes domain knowledge about the GUI environment into an explicit, machine-readable format. This is the developer-defined "map" of states, elements, and transitions specific to your application.
Process-based approaches embed environment knowledge within action sequences, creating tight coupling. By externalizing this knowledge into Ω, the framework enables:
- Independent testing of environment representation
- Reusable state definitions across multiple automation tasks
- Clear separation between "what exists" (states) and "how to navigate" (transitions)
State Definition: A state in model-based GUI automation is a collection of related GUI elements. State objects are often grouped spatially or appear at the same time. Objects used together in a process are likely candidates for belonging to the same state. However, the definition of a state is subjective and should make sense in the context of your automation goals.
a - The Action Model
Formal Definition:
An atomic action is a tuple a = (o_a, E_a, ζ) comprising:
- o_a: parameters or options associated with the action
- E_a: the elements acted on
- ζ: the success function specific to the action type
The action function is defined as:
f_a: (a, Ξ, Θ) → (Ξ', r_a) where:
- a: the action, which can modify or observe the GUI
- Ξ: the visible GUI before the action
- Θ: environmental stochasticity (all external events and factors that can unpredictably influence the GUI state)
- Ξ': the resulting GUI after the action is performed
- r_a: data describing the action's results
From these action results, state information is derived by the function:
S_a: r_a → P(S × {True, False})
This function maps action results to a set of (state, boolean) pairs indicating which states should be activated (True) or deactivated (False).
Implementation: Action.java, ClickOptions.java, PatternFindOptions.java, TypeOptions.java
Concept: The Action Model defines atomic operations for interacting with the GUI (click, type, find, move, drag, etc.). Unlike process-based tools that also perform actions, the Action Model establishes a standardized result structure (r_a) that creates a consistent interface contract. This allows the State Management system to reliably process outcomes without custom logic for each action type.
The model abstracts environmental stochasticity (Θ) from an action's internal logic, narrowing the implementation scope of each action to its core task. Actions aren't responsible for handling unpredictable events like popup windows—instead, the framework's state management and path traversal components handle the consequences of such events, preventing code duplication and centralizing error recovery logic.
Environmental Stochasticity (Θ): Unexpected environmental events can significantly influence action execution. For example, a popup window appearing before a click action will cause the click to act on the popup instead of the intended button. By maintaining synchronization between the framework's knowledge of the GUI and the reality of Ξ, the system can detect and reconcile such discrepancies.
M - The State Management System
Formal Definition:
The state management model is defined as M = (S_Ξ) where S_Ξ ⊆ S is the current set of active states.
The state management function processes updates from actions or transitions:
f_M: (S_Ξ, S_a, S_t) → S'_Ξ where:
- S_a and S_t are state information from actions and transitions respectively, each a set of (s, value) pairs where s ∈ S and value ∈ {True, False}
- S_+ = {s ∈ S | (s, True) ∈ (S_a ∪ S_t)}: states to activate
- S_- = {s ∈ S | (s, False) ∈ (S_a ∪ S_t)}: states to deactivate
- S'Ξ = (S_Ξ ∪ S+) \ S_-: the resulting set of active states
Implementation: StateMemory.java, StateDetector.java, StateVisibilityManager.java
Concept: Process-based automation tracks the GUI's state implicitly through the current position in the action sequence (e.g., "if line 15 is executing, we must be on the Payment Details screen"). This method fails in complex scenarios because it lacks a formal, explicit representation of the GUI's state.
The State Management system (M) maintains an explicit set of active states (S_Ξ), enabling:
-
Handling Multiple Active States: The system can reason holistically about combinations like {StateA, StateB} as a coherent "world view," finding transitions available from any of the currently active states.
-
Recovering from Unexpected Events: When environmental stochasticity (Θ) causes unexpected state changes, M adjusts the active states using results from transitions and actions, giving the system an accurate understanding of its current context for recovery.
-
Enabling Dynamic Pathfinding: After a failed transition, the Path Traversal Model can query M for the current set of active states (S_Ξ) and use that precise starting point to find a new, viable path to the target.
The State Management system acts as the reliable "You Are Here" marker on the map. Without M, the system cannot reliably reason about its current position in the GUI environment—a prerequisite for any complex or long-running automation task.
τ - The Transition Model
Formal Definition:
A transition is a tuple t = (A, S_t^def) comprising:
- A = (a¹, a², ..., aⁿ): a process or sequence of actions executed as part of the transition
- S_t^def ∈ P(S × {True, False}): the intended state information if the transition succeeds
The transition function is defined as:
f_τ: (A, S_t^def) → (Ξ', r_t) where:
- Ξ': the resulting GUI after the process is executed
- r_t: the transition result, containing:
- success or failure status
- state information S_t = S_t^def if transition succeeds, ∅ if it fails
The process function builds on the action function by iteratively applying it:
f_A: (A) → (Ξ')
Each action operates on the scene (Ξ) produced by the previous action. The final output (Ξ') represents the cumulative effect of applying all actions in sequence.
The transition relation δ ⊆ S × T defines which transitions are accessible from which states, where an element (s_i, t_j) ∈ δ indicates that transition t_j is accessible from state s_i.
Implementation: TransitionExecutor.java, StateTransition.java, StateTransitions.java
Concept: The Transition Model manages the execution of defined action sequences that move between states. Transitions encapsulate multi-step processes (like logging in: type username, type password, click submit) as reusable units.
The state information S_t^def provides explicit state change details that complement information inferred from actions. For example, if state A always disappears when state B is activated, S_t^def can include this rule directly, eliminating the need for additional actions to verify the absence of state A. This optimizes state management in GUI environments with well-defined state relationships.
§ - The Path Traversal Model
Formal Definition:
Within the Path Traversal Model, a path is defined as a tuple ρ = (S_ρ, T_ρ) comprising:
-
S_ρ = [s₀, s₁, ..., sₙ]: the sequence of states, where:
- s₀ ∈ S_Ξ (the first state is in the set of active states)
- sₙ = s_target (the last state is the target state)
-
T_ρ = [t₀, t₁, ..., tₙ₋₁]: the sequence of transitions, where:
- t_i ∈ T (each transition is in the set of all transitions)
- (s_i, t_i) ∈ δ (each transition is accessible from its begin state)
- (s_{i+1}, True) ∈ S_t_i^def (the transition's end state is activated if the transition succeeds)
Path traversal is a tuple § = (Ω, S_Ξ, s_target, H) comprising:
- Ω: the State Structure
- S_Ξ: the set of active states
- s_target ⊆ S: the state the automation aims to reach
- H : P → ℝ: the heuristic used for path selection, where h_ρ is the heuristic value for path ρ
The pathfinding function delivers all valid paths from active states to the target state:
f_pathfind: (Ω, S_Ξ, s_target) → P(ρ)
where P(ρ) is the set of all possible paths from the active states to the target state.
The path traversal function is:
f_§: (Ω, S_Ξ, s_target, H) → (Ξ')
where Ξ' is the resulting GUI after path execution attempts.
The function f_§ internally uses f_pathfind to determine possible paths P(ρ), applies the heuristic H to select an optimal path, and then executes the selected path. During execution, each transition t_i for i > 0 executes only if t_{i-1} is successful.
Path Cost depends on individual state and transition costs:
- State Cost Function: c_S : S → ℝ assigns a cost to each state
- Transition Cost Function: c_T : T → ℝ assigns a cost to each transition
- Path Cost: c(ρ) = Σ_{i=0}^{n-1}[c_T(t_i) + c_S(s_{i+1})]
- Path Selection: ρ = arg min_{ρ ∈ P} [ c(ρ) ]*
Implementation: PathFinder.java (pathfinding), PathTraverser.java (execution), Path.java, PathManager.java
Concept: The Path Traversal Model is responsible for finding and executing a path through the state graph to reach a target state. This component eliminates the need to explicitly code every possible path, fundamentally reducing development complexity.
Process-based approaches require manually accounting for a number of paths that grows exponentially. In contrast, the model-based effort is polynomial: define n states and approximately n × m transitions (where m is the average transitions per state), reducing development complexity to O(nm).
For a GUI with 30 states and 5 transitions per state, this is the difference between coding on the order of trillions of paths versus defining approximately 180 states and transitions—a shift from practically impossible to well-defined and scalable.
Model in Action: A Concrete Example
When you call stateNavigator.openState("Dashboard"):
-
State Management (M) determines active states by checking which elements from the State Structure (Ω) are visible in the current GUI (Ξ). Suppose M determines we're currently in the "LoginPage" state.
-
Path Traversal (§) calls the pathfinding function f_pathfind(Ω, S_Ξ, "Dashboard") which queries the State Structure to find all valid paths from "LoginPage" to "Dashboard".
-
The heuristic H evaluates each path using cost functions c_S and c_T, selecting the optimal path ρ* (e.g., LoginPage → Dashboard via login transition).
-
Path Execution begins: For each transition t_i in the path:
- Transition Model (τ) executes the action sequence A = (a¹, a², ..., aⁿ) defined in the transition
- Each Action (a) operates on the visible GUI (Ξ) and returns results (r_a)
- The State Information function S_a derives which states should be activated/deactivated from action results
-
State Management (M) updates the active states using f_M(S_Ξ, S_a, S_t), adjusting S_Ξ based on what elements are now visible.
-
If an unexpected popup appears (environmental stochasticity Θ):
- Actions detect the popup through visual comparison with Ξ
- S_a indicates the popup state is active
- M updates S_Ξ to include the popup state
- Path Traversal (§) checks if transitions exist from {Dashboard, Popup}
- If a "close popup" transition exists, it's executed
- Process resumes navigation to Dashboard
This continuous feedback loop between Ξ, M, and § enables robust error recovery that's impossible in process-based approaches.
Architecture: Domain vs Strategic Knowledge
The architecture separates environment representation (Domain Knowledge) from interaction mechanisms (Strategic Knowledge), a fundamental design principle adapted from human problem-solving research.
Domain Knowledge (Ω)
The State Structure (Ω) is defined by the user and contains problem-specific information about the GUI environment:
- States (S): What screens/configurations exist in your application
- Elements (E): What visual features identify each state
- Transitions (T): What actions connect states
Your responsibility: Define these using Brobot's @State, @StateImage, and @Transition annotations.
Strategic Knowledge (a, M, τ, §)
The Framework (F) implements problem-agnostic automation capabilities that work with any GUI:
- Action Model (a): Atomic operation execution and result interpretation
- State Management (M): Tracking active states and maintaining GUI awareness
- Transition Model (τ): Executing action sequences that change states
- Path Traversal (§): Finding and executing paths to target states
Framework's responsibility: Brobot provides these components—you don't write pathfinding or state management code.
The Visual API
The combination of the State Structure (Ω) and Framework (F) creates a Visual API that abstracts GUI automation complexity:
// Domain Knowledge: Define your environment
@State(name = "LoginPage")
public class LoginState {
@StateImage private Image loginButton;
}
// Strategic Knowledge: Framework handles navigation
stateNavigator.openState("Dashboard");
// Framework automatically: finds path, manages states, executes transitions, recovers from errors
Data Flow
- The Action Model (a) bridges both domains: It receives input from user-defined state elements, acts on the visible GUI (Ξ), and provides results that inform the State Management System (M).
- The State Management System (M) tracks the set of active states based on information from actions and transitions.
- The Path Traversal Model (§) takes the current active states from M and uses the map provided by the State Structure (Ω) to find and execute a path to a target state.
This architecture allows the framework to handle complex, reusable strategic logic while the user's application focuses only on domain knowledge and business logic.
Addressing the Robustness-Code Complexity Trade-off
This architecture directly resolves the fundamental robustness-code complexity trade-off identified in traditional GUI automation:
Robustness through Modularity
Because the State Structure (Ω) is decoupled from execution logic, GUI changes often require updating only a single, localized state or transition definition. The explicit state tracking provided by M combined with verification against the visible GUI (Ξ) allows the system to detect and recover from unexpected events, rather than failing when the GUI diverges from an implicit assumption.
Complexity Reduction through Abstraction
The Path Traversal component (§) eliminates the need to explicitly code every possible path. The developer defines n states and approximately n × m transitions, reducing development complexity to a manageable polynomial scale O(nm).
Example Impact: For a 30-state GUI with 5 transitions per state:
- Process-based approach: Must code on the order of trillions of paths (exponential)
- Model-based approach: Define approximately 180 states and transitions (polynomial)
This represents a shift from a practically impossible task to a well-defined and scalable one.
Related Documentation
Theoretical Foundations
- Introduction - Practical introduction with code examples and conceptual overview
- Academic Foundation - Research background, empirical evidence, and citations
- States - Detailed explanation of State Structure (Ω) and state management
- Transitions - Transition model (τ) and path traversal (§) in depth
- Testing the Automation - Novel testing capabilities enabled by the model
Practical Implementation
- Getting Started - Hands-on tutorials for building Brobot applications
- AI Brobot Project Creation - Complete API reference and implementation patterns
Mathematical Notation Reference
For quick reference, the key mathematical symbols used throughout this document:
- Ξ (Xi) = Visible GUI (screen pixels and visible elements)
- Ω = (E, S, T) = State Structure (Elements, States, Transitions)
- a = (o_a, E_a, ζ) = Action (options, elements, success function)
- M = (S_Ξ) = State Management (set of active states)
- τ = (A, S_t^def) = Transition (action sequence, state information)
- § = (Ω, S_Ξ, s_target, H) = Path Traversal (structure, active states, target, heuristic)
- f_a, f_M, f_τ, f_§ = Core transformation functions
- Θ (Theta) = Environmental stochasticity (unpredictable events)
- δ ⊆ S × T = Transition accessibility relation
These symbols provide precise mathematical definitions. For conceptual understanding, focus on the ideas and examples rather than mathematical rigor.