Overview : http://people.idsia.ch/~juergen/deep-learning-overview.html

Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning

AI Blog 2015 survey of deep learning^[DL1] June 2020 article^[T20a][R12] expands material in my Critique of the 2019 Honda Prize^[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.^{[AIB][MIR][DEC][HIN]} this class of methods was pioneered in 1991^[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:^[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;^[UN-UN2] LSTMs later our Highway Nets^[HW1-3] brought it to feedforward NNs. based on LSTM^[LSTM0-6] (1990s-2005) and CTC (2006).^[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years^{[GSR][GSR15-19][DL4]} (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,^[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet^{[GPUCNN1-3,5-8][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011^{[GPUCNN1-8][R5-6]} (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers^[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990^{[AC90-20][MIR](Sec. 5)} (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^[UN1-2] vanishing gradients (1991)^[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),^[HW1-3][R5] fast weight programmers (1991).^[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} The deep NNs By the 2010s,^[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} became the first recurrent NN (RNN) to win international competitions. LSTM^{[MIR](Sec. 4)} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.^[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet^{[DAN,DAN1][R6]} winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.^{[MIR](Sec. 19)} more citations per year^[MOST] Highway Net (May 2015).^[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.^[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

appeared long before the 1980s. layers (already containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^[DEEP2] highly cited method which was still popular in the new millennium,^[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,^[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR]
(Sec. 19)} LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets^[HW1-3] brought it to feedforward NNs.^[MOST]

by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs,^[UN1-2] the vanishing gradient problem (1991)^[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)} we had this type of deep learning already in 1991;^[UN][UN1-2] see Sec.

any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]} unsupervised pre-training for deep NNs (1991),^[UN1-2][UN] vanishing gradients (1991)^[VAN1] & solutions to it (Sec. A),^{[LSTM0-17][CTC]} record-breaking deep supervised NNs and contest-winning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^[HW1-3][R5] fast weight programmers (1991),^[FWP0-2,6] Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]}

"advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),^[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.^[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.^[MIR]

DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning approach of

who failed to cite them, even in later

the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.^{[UN1-3][R4][HIN](Sec. II)}

made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}

Furthermore, by the mid 2010s, speech recognition and machine translation

(and apparently even other award committees^{[HIN](Sec. I)} recent debate:^[HIN] It is true that in 2018,

X.^{[MIR](Sec. 1)[R8]}

fast deep CNN called DanNet as well as Sec. 19 of the overview.^[MIR]

In the 2010s, was actually the LSTM of our team,^[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),^[FWP2][FWP] and "hard" attention (in observation space) in the context of RL^{[ATT][ATT0-1]} (1990). FWPs of 1991^[FWP0-1] which have become a popular alternative to RNNs. the 2010s,^[DEC] the attention terminology^[FWP2] now used

See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.

a simple application^[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,^[TR1-6] Bengio^[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.^{[FWP,FWP0-2,6]} unsupervised pre-training for deep NNs.^{[UN0-4][HIN](Sec. II)[MIR](Sec. 1)} the first NNs shown to solve very deep problems compressing or distilling one NN into another.^{[UN0-2][DIST1-2][MIR](Sec. 2)} fast weight programmers^{[FWP][FWP0-4a]} through tensor-like outer products (1991-2016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above). learning sequential attention with NNs.^{[MIR](Sec. 9)} our much earlier work on this^[ATT1][ATT] although

However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN1-3]} won the superhuman performance at ICPR 2012, our DanNet^[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See

As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^[BP2-4] (see also Amari

adaptive subgoal generators (1991)^[HRL0-2] were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) ad hominem attacks^[AH2-3][HIN] As emphasized earlier:^[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,^[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)^[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours^[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets^[HW1-3] brought it to feedforward NNs in May 2015.^[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^[DEC] and many other applications.^[DEC]

As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2]} publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.^[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.^[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology^[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

AI Blog Traditionally this is done with recurrent NNs (RNNs) published.^[FWP0-1] the attention terminology^[FWP2] now used famous vanishing gradient additive neural activations of LSTMs / Highway Nets / ResNets^[HW1-3] (Sec. 5) Annus Mirabilis of deep learning.^[MIR]

artificial neural network (NN) started in 1965 fast weights. attention^[ATT] (Sec. 4)

on 26 March 1991, attention^[ATT] That is, I separated storage and control like in traditional computers, recurrent NNs (RNNs)

attention^[ATT] (compare Sec. 4). sequence-processing recurrent NNs (RNNs) the computationally most powerful NNs of them all.^{[UN][MIR](Sec. 0)} of the same size: O(H²) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)} 4. Attention terminology of 1993 NN-programmed fast weights (Sec. 5).^{[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22]} internal spotlights of attention do not suffer during sequence learning from the famous vanishing gradient

and both of them dating back to 1991, our miraculous year of deep learning.^[MIR] Basic Long Short-Term Memory^[LSTM1] solves the problem by adding at every time step

Highway Network (May 2015),^{[HW1][HW1a][HW3]} the first working really deep Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,^[DEC] major IT companies overwhelmingly used unsupervised pre-training of deep NNs.^{[UN0-UN2][MIR](Sec. 1)} dates back to 1991^[UN] (Sec. 2).^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)}

as shown in 2005 the first machine learning Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search^[CO2] unsupervised pre-training.

My first work on metalearning machines that learn to learn was published in 1987.^[META][R3] metalearning in a very general way. used gradient descent in LSTM networks^[LSTM1] instead of traditional There is another version of this article J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber PDF. More. PS. (PDF.) Precursor of modern backpropagation.^[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? J. Schmidhuber (AI Blog, 26 March 2021, updated 2022). the attention terminology^[FWP2] now used PDF. HTML. Pictures (German). HTML overview. PDF. can be found here. Highway Nets perform roughly as well as ResNets^[HW2] on ImageNet.^[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.^[NDR] More. More. PDF. More. J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^[MIR] PDF. PDF. attention terminology in 1993.^{[ATT][FWP2][R4]} J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. PDF. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. approaches are now widely used. More.
https://people.idsia.ch/~juergen/deep-learning-history.html AI Blog is dominated by artificial neural networks (NNs) and deep learning,^[DL1-4] hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous deep learning survey^[DL1] modern AI minimize pain, maximize pleasure, drive cars, etc.^{[MIR](Sec. 0)[DL1-4]}

In 1676, Gottfried Wilhelm Leibniz

Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;^{[L84][SON18][MAD05][LEI21,a,b]} later Isaac Newton was also credited for his unpublished work.^[SON18] Their priority dispute,^[SON18] however, did not encompass the chain rule.^[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever^[ARC06]) paved the way for infinitesimals

Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L

most cited NN of the 20th century.^[MOST] most cited NN of the 21st century.^[MOST] highly cited method which was still popular in the new millennium,^[DL2] especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} the first machine learning

It took 4 decades until the backpropagation method of 1970^[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991^[UN][UN0-3] (see below), and later championed by others (2006).^[UN4] In fact, it was claimed^[VID1]

"wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^[CNN1-4]

In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),^[BP1-2] and applied to speech.^[CNN1a] Waibel did not call this CNNs but TDNNs.

CNN of 2011^[GPUCNN1] known as DanNet^{[DAN,DAN1][R6]} ICDAR 2011 Chinese handwriting DanNet^[GPUCNN1-3] IJCNN 2011 traffic signs ISBI 2012 image segmentation ICPR 2012 medical imaging MICCAI 2013 Grand Challenge ResNet,^[HW2] a
Highway Net^[HW1]
with open gates winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^[DAN1] in an international contest. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the DanNet of 2011.^{[MIR](Sec. 19)[MOST]} most cited NN,^[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).^[HW1-3][R5] The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).^[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). our graph NN-like, Transformer-like Fast Weight Programmers of 1991^{[FWP0-1][FWP6][FWP]} which learn to continually rewrite mappings from inputs to outputs (addressed below), and See overviews^{[MIR](Sec. 15, Sec. 17)} Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.^{[AC90-20][GAN1]}

given set.^{[AC20][AC][T22](Sec. XVII)} Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.^{[PM0-2][AC20][R2][MIR](Sec. 7)} recurrent NNs that learn to generate sequences of subgoals.^{[HRL1-2][PHD][MIR](Sec. 10)} Transformers with "linearized self-attention"^[TR5-6] attention terminology like the one I introduced in 1993^{[ATT][FWP2][R4]}).

Annus Mirabilis of 1990-1991.^[MIR][MOST]

The 1991 fast weight programmers 1992^{[FWPMETA1-9][HO1]} extended my 1987 diploma thesis,^[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,^[META] to learn better learning algorithms through experience. This became very popular in the 2010s^[DEC] when computers were a million times faster. Before the 1990s, however, RNNs failed to learn deep problems in practice.^{[MIR](Sec. 0)} Neural History Compressor.^[UN1] using my NN distillation procedure of 1991.^[UN0-1][MIR] Transformers with linearized self-attention were also first published^[FWP0-6] in Annus Mirabilis of 1990-1991,^[MIR][MOST] Fundamental Deep Learning Problem Long Short-Term Memory (LSTM) recurrent neural network^[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM.

Highway Network^[HW1] ImageNet 2015 contest) is a version thereof

Deep learning LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward LSTM trained by policy gradients (2007).^{[RPG07][RPG][LSTMPG]}

Alphastar whose brain has a deep LSTM core trained by PG.^[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} neural history compressors^[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators^{[HRL3][MIR](Sec. 10)} learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in

Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"^[SMO13]) Konrad Zuse Unlike Babbage, Zuse used Leibniz Turing^[TUR] (1936), and Post^[POS] (1936).

raw computational power of all human brains combined.^[RAW] any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

Gottfried Wilhelm Leibniz^[L86][WI48] (see above), In 1936, Alan M. Turing the world In 1964, Ray Solomonoff combined Bayesian (actually Laplacian^[STI83-85]) probabilistic reasoning and theoretical computer science^{[GOD][CHU][TUR][POS]} Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),^[AIT1-22] going beyond traditional information theory^[SHA48][KUL]

modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^[UN1-2] vanishing gradients (1991)^[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),^[HW1-3][R5] fast weight programmers (1991).^[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} The deep NNs By the 2010s,^[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} became the first recurrent NN (RNN) to win international competitions. LSTM^{[MIR](Sec. 4)} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.^[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet^{[DAN,DAN1][R6]} winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.^{[MIR](Sec. 19)} most cited neural network,^[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).^[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.^[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

appeared long before the 1980s. containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,^[DL2] especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} LBH failed to cite this, just like they failed to cite Amari,^[GD1] who in 1967 proposed stochastic gradient descent^[STO51-52] (SGD) for MLPs and whose implementation^[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)} However, it became really deep in 1991 in my lab,^[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR]
(Sec. 19)} LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets^[HW1-3] brought it to feedforward NNs.^[MOST]

any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning approach of

who failed to cite them, even in later

made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}

Furthermore, by the mid 2010s, speech recognition and machine translation

(and apparently even other award committees^{[HIN](Sec. I)}) recent debate:^[HIN] It is true that in 2018,

X.^{[MIR](Sec. 1)[R8]}

fast deep CNN called DanNet as well as Sec. 19 of the overview.^[MIR]

See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.

As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^[BP2-4] (see also Amari

adaptive subgoal generators (1991)^[HRL0-2] were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) ad hominem attacks^[AH2-3][HIN] As emphasized earlier:^[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,^{[DLC][DLC1-2]} unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)^[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours^[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets^[HW1-3] brought it to feedforward NNs in May 2015.^[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^[DEC] and many other applications.^[DEC]

As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2][R1]} publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.^[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.^[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed^[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology^[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning approach of

who failed to cite them, even in later

made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}

Furthermore, by the mid 2010s, speech recognition and machine translation

(and apparently even other award committees^{[HIN](Sec. I)} recent debate:^[HIN] It is true that in 2018,

X.^{[MIR](Sec. 1)[R8]}

fast deep CNN called DanNet as well as Sec. 19 of the overview.^[MIR]

See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.

As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^[BP2-4] (see also Amari

artificial neural network (NN) started in 1965 fast weights. attention^[ATT] (Sec. 4)

on 26 March 1991, attention^[ATT] That is, I separated storage and control like in traditional computers, recurrent NNs (RNNs)

and both of them dating back to 1991, our miraculous year of deep learning.^[MIR] Basic Long Short-Term Memory^[LSTM1] solves the problem by adding at every time step

In 1676, Gottfried Wilhelm Leibniz

Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L

"wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^[CNN1-4]

In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),^[BP1-2] and applied to speech.^[CNN1a] Waibel did not call this CNNs but TDNNs.

Annus Mirabilis of 1990-1991.^[MIR][MOST]

Highway Network^[HW1] ImageNet 2015 contest) is a version thereof

raw computational power of all human brains combined.^[RAW] any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^[UN1-2] vanishing gradients (1991)^[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),^[HW1-3][R5] fast weight programmers (1991).^[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} The deep NNs By the 2010s,^[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} became the first recurrent NN (RNN) to win international competitions. LSTM^{[MIR](Sec. 4)} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.^[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet^{[DAN,DAN1][R6]} winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.^{[MIR](Sec. 19)} most cited neural network,^[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).^[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.^[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning approach of

who failed to cite them, even in later

made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}

Furthermore, by the mid 2010s, speech recognition and machine translation

(and apparently even other award committees^{[HIN](Sec. I)}) recent debate:^[HIN] It is true that in 2018,

X.^{[MIR](Sec. 1)[R8]}

fast deep CNN called DanNet as well as Sec. 19 of the overview.^[MIR]

See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.

As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^[BP2-4] (see also Amari

adaptive subgoal generators (1991)^[HRL0-2] were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) ad hominem attacks^[AH2-3][HIN] As emphasized earlier:^[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,^{[DLC][DLC1-2]} unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)^[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours^[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets^[HW1-3] brought it to feedforward NNs in May 2015.^[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^[DEC] and many other applications.^[DEC]

As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2][R1]} publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.^[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.^[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed^[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology^[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}

In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning approach of

who failed to cite them, even in later

made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}

Furthermore, by the mid 2010s, speech recognition and machine translation

(and apparently even other award committees^{[HIN](Sec. I)} recent debate:^[HIN] It is true that in 2018,

X.^{[MIR](Sec. 1)[R8]}

fast deep CNN called DanNet as well as Sec. 19 of the overview.^[MIR]

See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.

As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^[BP2-4] (see also Amari