paper Link: https://arxiv.org/pdf/1205.2606.pdf
Exploring Compact Reinforcement-learning Representations with Linear Regression
- KWIK Linear Regression
- KWIK (Knows What It Knows) is a framework for studying supervised learning algorithms and was designed to unify the analysis of model-based reinforcement-learning algorithms.
- Formally, a KWIK learner operates over an input space and an output space . At every time step , an input is chosen and presented to the learner.
- If the learner can make an accurate prediction on this input, it can predict , otherwise it must admit it does not know by returning (“I don’t know”), allowing it to see the true or a noisy version .
- and
- An algorithm is said to be KWIK if and only if, with high probability, and the number of s returned over the agent’s lifetime is bounded by a polynomial function over the size of the input problem.
- One of the first uses of the KWIK framework was in the analysis of an online linear regression algorithm used to learn linear transitions in continuous state MDPs.
- This algorithm uses the least squares estimate of the weight vector for inputs where the output is known with high certainty.
- Certainty is measured by two terms representing (1) the number and proximity of previous samples to the current point and (2) the appropriateness of the previous samples for making a least squares estimate.
- When certainty is low for either measure, the algorithm reports .
- This algorithm uses the least squares estimate of the weight vector for inputs where the output is known with high certainty.
- Some notation
- Let , and let be a linear function with slope , i.e. .
- Fix a timestep .
- For each , denote the stored samples by , their (unknown) expected values by , and their observed values by
- where the noise is assumed to form a martingale, i.e. , and bounded:
- Define the matrix and vectors and , and let be an identity matrix.
- Suppose that a new query arrives. If we were able to solve the linear regression problem , then we could predict , where is the least-squares solution to the system.
- However, solving this system directly is problematic because
- If is rank-deficient the least-squares solution may not be unique.
- Even if we have a solution, we have no information on its confidence.
- We can avoid the first problem by regularization, i.e. by augmenting the system with Iθ = ~v, where ~v is some arbitrary vector. Regularization certainly distorts the solution, but this gives us a measure of confidence: if the distortion is large, the predictor should have low confidence and output ⊥. On the other hand, if the distortion is low, it has two important consequences. First, the choice of ~v has little effect, and second, the fluctuations caused by using ~zt instead of ~yt are also minor.
- However, solving this system directly is problematic because