paper Link: https://arxiv.org/pdf/1205.2606.pdf

Exploring Compact Reinforcement-learning Representations with Linear Regression

  • KWIK Linear Regression
    • KWIK (Knows What It Knows) is a framework for studying supervised learning algorithms and was designed to unify the analysis of model-based reinforcement-learning algorithms.
    • Formally, a KWIK learner operates over an input space and an output space . At every time step , an input is chosen and presented to the learner.
    • If the learner can make an accurate prediction on this input, it can predict , otherwise it must admit it does not know by returning (“I don’t know”), allowing it to see the true or a noisy version .
      • and
    • An algorithm is said to be KWIK if and only if, with high probability, and the number of s returned over the agent’s lifetime is bounded by a polynomial function over the size of the input problem.
    • One of the first uses of the KWIK framework was in the analysis of an online linear regression algorithm used to learn linear transitions in continuous state MDPs.
      • This algorithm uses the least squares estimate of the weight vector for inputs where the output is known with high certainty.
        • Certainty is measured by two terms representing (1) the number and proximity of previous samples to the current point and (2) the appropriateness of the previous samples for making a least squares estimate.
      • When certainty is low for either measure, the algorithm reports .
    • Some notation
      • Let , and let be a linear function with slope , i.e. .
      • Fix a timestep .
      • For each , denote the stored samples by , their (unknown) expected values by , and their observed values by
        • where the noise is assumed to form a martingale, i.e. , and bounded:
      • Define the matrix and vectors and , and let be an identity matrix.
    • Suppose that a new query arrives. If we were able to solve the linear regression problem , then we could predict , where is the least-squares solution to the system.
      • However, solving this system directly is problematic because
        1. If is rank-deficient the least-squares solution may not be unique.
        2. Even if we have a solution, we have no information on its confidence.
      • We can avoid the first problem by regularization, i.e. by augmenting the system with Iθ = ~v, where ~v is some arbitrary vector. Regularization certainly distorts the solution, but this gives us a measure of confidence: if the distortion is large, the predictor should have low confidence and output ⊥. On the other hand, if the distortion is low, it has two important consequences. First, the choice of ~v has little effect, and second, the fluctuations caused by using ~zt instead of ~yt are also minor.

B) Related

C) References