“I saw the No. 4 players quickly broke the back, the single knife straight, the ball, the ball, the ball, enter!”
Everyone, everyone, you are now seeing the scene of Google Ai football game, and the yellow jersey in the field is Ai players from Tsinghua University.
This Tsinghua AI may not usually, under hard training, not only the star players who have been outstanding in the world, but also have the strongest team in the world.
In a number of international competitions, we will win the championship.
“OH, now I will receive the assists from the teammates from the number 7”, and the ball is coming again! “
In the case of the collateral, the above is actually a strong multi-intelligent body strength of Tsinghua University in the football game.
Winning the champion in a number of international events means that Tikick has achieved SOTA performance in single intelligent control and multi-intelligent body control, and the first time the first implementation of ten players complete the entire football game.
How is this powerful AI team trained?
Multi-smart football AI evolved from single intelligent body strategy
Prior to this, I briefly understand the intensive learning environment used in training, which is this football game: Google Research Football (GRF).
It is published by Google in 2019, providing physical 3D football simulation, supporting all major competition rules, one or more football players in the intelligent gymnastics and other built-in Ai fighting.
In the upper and lower half of the three thousand steps, the intelligent body needs to continue to move, pass, shot, shot, disk, horses, sprints and other 19 movements to complete the goal.
It is two in such a football game environment.
First, because more intelligent environments, 10 players (excluding goalkeepers) can be used, and algorithms need to search for appropriate action combinations in such huge action spaces;
Second, everyone knows that the number of goals in the football match is very small, and the algorithm is therefore difficult to get rewards from the environment, and the difficulty of training is greatly increased.
And Tsinghua University this goal is to control multiple players to complete the game.
They first visited the Wekick team of the Wekick team that eventually won the World’s Wekick Team in 2020, and used the offline strengthening learning method from learning.
This championship only needs a player in the control court to fight.
How to learn more intelligent body strategies from a single intelligent data set?
Directly learn the single intelligent gynecology in Wekick and copy it to each player, because so everyone will only go to the ball to go to the ball door, and there will be no teamwork.
There is no data in the back-on non-active player, what should I do?
They added a twentieth action in the action set: Build-in, and give all non-active players this label (if build-in as a player is used in the game, the player will take action according to the built-in rules).
Then use a multi-intelligent behavior cloning (MABC) algorithm training model.
For offline strengthening learning, the most core idea is to identify high quality actions in the data and strengthen learning of these actions.
Therefore, it is necessary to give each tag different weight when calculating the target function, preventing players from tend to use only one action as action.
There are two points of weight distribution here:
First, pick the game from the data concentration, only use these high quality data to train, due to the dense reward, the model can accelerate convergence and improve performance.
The second is to train the critical network to all action scores, and use the results to calculate the advantage function, and then give advantageous function value high weight, which is revealed to give lower weight.
In order to avoid gradient explosion and disappearance, the advantage function has made an appropriate cropping.
The final distributed training architecture consists of a Learner with multiple workers.
The Learner is responsible for learning and updating the strategy, and Worker is responsible for collecting data, and they perform data by GRPC, the exchange and sharing of network parameters.
Soccer Jersey Discount
Worker can use multi-process simultaneous interactions with multiple gaming environments, or to read offline data synchronously via I / O.
This parallelization is implemented significantly, thereby improving the speed of data collection, thereby improving the training speed (5 hours can achieve the same performance of other distributed training algorithms for two days).
In addition, by modular design, the framework can also switch the single-node debug mode and multi-node distributed training Kits Football Kits mode without modifying any code, greatly reducing the difficulty of achieving algorithm implementation and training.
94.4% winning rate and 3 points in the field
In the different algorithm comparison results on the multi-smart (GRF) game, Tikick’s final algorithm (+ AW) achieves best performance with the highest winning rate (94.4%) and the largest target.
Trueskill (Ranking System of Competition Games in Machine) is also the first.
Manchester United Jersey
Tikick and built-in AI have reached 94.4% of the winning rate and 3 points of the field.
After the Tikick’s baseline algorithm in the GRF academic scene is discovered, Tikick has reached the best performance and the lowest sample complexity under all scenarios, and the gap is obvious.
Compared with the baseline MAPPO, it also found that four scenes in five scenarios can reach the highest score.
about the author
A doctoral student of Huang Shiyu, Tsinghua University, research direction for computer vision, strengthen learning and deep learning cross-areas. He worked in Huawei Noah’s Ark Lab, Tencent Ai, Carnegiemelon University, Shang Dynasty and Reai.
Common one is also from Chen Wenze from Tsinghua University.
In addition, the author also includes Longfei Zhang, Tencent Ai Laboratory, Tencent Ai Lab, Tencent Ai Lab.
The author is a professor of Zhu Jun, Tsinghua University.