This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.Expand

A framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability) is described and a model-based and model-free variants of the elimination method are provided.Expand

The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?.Expand

This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.Expand

A novel distance between distributions, discrepancy distance, is introduced that is tailored to adaptation problems with arbitrary loss functions, and Rademacher complexity bounds are given for estimating the discrepancy distance from finite samples for different loss functions.Expand

In this paper, Boolean functions in ,4C0 are studied using harmonic analysis on the cube. The main result is that an ACO Boolean function has almost all of its "power spectrum" on the low-order… Expand

This paper derives convergence rates for Q-learning from a polynomial learning rate and shows that for a linear learning rate, one which is 1/t at time t, the convergence rate has an exponential dependence on 1/(1-γ).Expand

This work analyzes the behavior of agents that incrementally adapt their strategy through gradient ascent on expected payoff, in the simple setting of two-player, two-action, iterated general-sum games, and shows that either the agents will converge to a Nash equilibrium, or if the strategies themselves do not converge, then their average payoffs will nevertheless converge to the payoffs of a Nashilibrium.Expand

It is proved that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees.Expand

The authors demonstrate that any functionf whose L -norm is polynomial can be approximated by a polynomially sparse function, and prove that boolean decision trees with linear operations are a subset of this class of functions.Expand