Tuesday, April 5, 2022
HomeArtificial IntelligenceThe Unsupervised Reinforcement Studying Benchmark – The Berkeley Synthetic Intelligence Analysis Weblog

The Unsupervised Reinforcement Studying Benchmark – The Berkeley Synthetic Intelligence Analysis Weblog


Reinforcement Studying (RL) is a robust paradigm for fixing many issues of curiosity in AI, corresponding to controlling autonomous automobiles, digital assistants, and useful resource allocation to call just a few. We’ve seen over the past 5 years that, when supplied with an extrinsic reward operate, RL brokers can grasp very advanced duties like taking part in Go, Starcraft, and dextrous robotic manipulation. Whereas large-scale RL brokers can obtain gorgeous outcomes, even the perfect RL brokers in the present day are slender. Most RL algorithms in the present day can solely resolve the only activity they have been educated on and don’t exhibit cross-task or cross-domain generalization capabilities.

A side-effect of the narrowness of in the present day’s RL programs is that in the present day’s RL brokers are additionally very knowledge inefficient. If we have been to coach AlphaGo-like brokers on many duties every agent would doubtless require billions of coaching steps as a result of in the present day’s RL brokers don’t have the capabilities to reuse prior data to unravel new duties extra effectively. RL as we all know it’s supervised – brokers overfit to a particular extrinsic reward which limits their skill to generalize.



Thus far, essentially the most promising path towards generalist AI programs in language and imaginative and prescient has been by means of unsupervised pre-training. Masked informal and bi-directional transformers have emerged as scalable strategies for pre-training language fashions which have proven unprecedented generalization capabilities. Siamese architectures and extra just lately masked auto-encoders have additionally turn into state-of-the-art strategies for attaining quick downstream activity adaptation in imaginative and prescient.

If we consider that pre-training is a robust method in the direction of creating generalist AI brokers, then it’s pure to ask whether or not there exist self-supervised goals that might enable us to pre-train RL brokers. In contrast to imaginative and prescient and language fashions which act on static knowledge, RL algorithms actively affect their very own knowledge distribution. Like in imaginative and prescient and language, illustration studying is a crucial side for RL as nicely however the unsupervised downside that’s distinctive to RL is how brokers can themselves generate attention-grabbing and numerous knowledge trough self-supervised goals. That is the unsupervised RL downside – how can we be taught helpful behaviors with out supervision after which adapt them to unravel downstream duties shortly?

Unsupervised RL is similar to supervised RL. Each assume that the underlying setting is described by a Markov Resolution Course of (MDP) or a Partially Noticed MDP, and each intention to maximise rewards. The primary distinction is that supervised RL assumes that supervision is offered by the setting by means of an extrinsic reward whereas unsupervised RL defines an intrinsic reward by means of a self-supervised activity. Like supervision in NLP and imaginative and prescient, supervised rewards are both engineered or offered as labels by human operators that are onerous to scale and restrict the generalization of RL algorithms to particular duties.


On the Robotic Studying Lab (RLL), we’ve been taking steps towards making unsupervised RL a believable method towards creating RL brokers able to generalization. To this finish, we developed and launched a benchmark for unsupervised RL with open-sourced PyTorch code for 8 main or widespread baselines.

The Unsupervised Reinforcement Studying Benchmark (URLB)

Whereas quite a lot of unsupervised RL algorithms have been proposed over the previous few years, it has been unattainable to match them pretty resulting from variations in analysis, environments, and optimization. For that reason, we constructed URLB which supplies standardized analysis procedures, domains, downstream duties, and optimization for unsupervised RL algorithms

URLB splits coaching into two phases – a protracted unsupervised pre-training part adopted by a brief supervised fine-tuning part. The preliminary launch contains three domains with 4 duties every for a complete of twelve downstream duties for analysis.


Most unsupervised RL algorithms identified up to now might be labeled into three classes – knowledge-based, data-based, and competence-based. Information-based strategies maximize the prediction error or uncertainty of a predictive mannequin (e.g. Curiosity, Disagreement, RND), data-based strategies maximize the range of noticed knowledge (e.g. APT, ProtoRL), competence-based strategies maximize the mutual info between states and a few latent vector sometimes called the “ability” or “activity” vector (e.g. DIAYN, SMM, APS).

Beforehand these algorithms have been carried out utilizing completely different optimization algorithms (Rainbow DQN, DDPG, PPO, SAC, and many others). Because of this, unsupervised RL algorithms have been onerous to match. In our implementations we standardize the optimization algorithm such that the one distinction between numerous baselines is the self-supervised goal.


We carried out and launched code for eight main algorithms supporting each state and pixel-based observations on domains based mostly on the DeepMind Management Suite.


By standardizing domains, analysis, and optimization throughout all carried out baselines in URLB, the result’s a primary direct and truthful comparability between these three several types of algorithms.


Above, we present combination statistics of fine-tuning runs throughout all 12 downstream duties with 10 seeds every after pre-training on the goal area for 2M steps. We discover that at present data-based strategies (APT, ProtoRL) and RND are the main approaches on URLB.

We’ve additionally recognized numerous promising instructions for future analysis based mostly on benchmarking present strategies. For instance, competence-based exploration as a complete underperforms knowledge and knowledge-based exploration. Understanding why that is the case is an attention-grabbing line for additional analysis. For extra insights and instructions for future analysis in unsupervised RL, we refer the reader to the URLB paper.

Unsupervised RL is a promising path towards creating generalist RL brokers. We’ve launched a benchmark (URLB) for evaluating the efficiency of such brokers. We’ve open-sourced code for each URLB and hope this allows different researchers to shortly prototype and consider unsupervised RL algorithms.

Paper: URLB: Unsupervised Reinforcement Studying Benchmark
Michael Laskin*, Denis Yarats*, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, Pieter Abbeel, NeurIPS, 2021, these authors contributed equally

Code: https://github.com/rll-research/url_benchmark



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments