modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]
by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.
any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world
modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]
"advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]
DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of
who failed to cite them, even in later
the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)
made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]
Furthermore, by the mid 2010s, speech recognition and machine translation
(and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,
fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]
In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used
See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.
a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari
adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]
As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
AI Blog
Traditionally this is done with recurrent NNs (RNNs)
published.[FWP0-1]
the attention terminology[FWP2] now used
famous vanishing gradient
additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5)
Annus Mirabilis of deep learning.[MIR]
artificial neural network (NN)
started in 1965
fast weights.
attention[ATT] (Sec. 4)
on 26 March 1991,
attention[ATT]
That is, I separated storage and control like in traditional computers,
recurrent NNs (RNNs)
attention[ATT] (compare Sec. 4).
sequence-processing recurrent NNs (RNNs)
the computationally most powerful NNs of them all.[UN][MIR](Sec. 0)
of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)
4. Attention terminology of 1993
NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22]
internal spotlights of attention
do not suffer during sequence learning from the famous vanishing gradient
and both of them dating back to 1991, our miraculous year of deep learning.[MIR]
Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step
Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep
Remarkably, both of these dual approaches of 1991 have become successful.
the mid 2010s,[DEC]
major IT companies overwhelmingly used
unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1)
dates back to 1991[UN]
(Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)
as shown in 2005
the first machine learning
Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small.
Compressed Network Search[CO2]
unsupervised pre-training.
My first work on metalearning machines that
learn to learn was published in 1987.[META][R3]
metalearning in a very general way.
used gradient descent in LSTM networks[LSTM1] instead of traditional
There is another version of this article
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
PDF.
More.
PS. (PDF.)
Precursor of modern backpropagation.[BP1-4]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
PDF.
can be found here.
Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More.
More.
PDF.
More.
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.[MIR]
PDF.
PDF.
attention terminology in 1993.[ATT][FWP2][R4]
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
PDF.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
approaches are now widely used. More.
https://people.idsia.ch/~juergen/deep-learning-history.html
AI Blog
is dominated by artificial neural networks (NNs) and
In 1676, Gottfried Wilhelm Leibniz
Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals
Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L
most cited NN of the 20th century.[MOST] most cited NN of the 21st century.[MOST] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] the first machine learning
It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1]
"wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4]
In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs.
CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6]
given set.[AC20][AC][T22](Sec. XVII) Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) Transformers with "linearized self-attention"[TR5-6] attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).
Annus Mirabilis of 1990-1991.[MIR][MOST]
The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) Neural History Compressor.[UN1] using my NN distillation procedure of 1991.[UN0-1][MIR] Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] Fundamental Deep Learning Problem Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM.
Highway Network[HW1] ImageNet 2015 contest) is a version thereof
Deep learning LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]
Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in
Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) Konrad Zuse Unlike Babbage, Zuse used Leibniz Turing[TUR] (1936), and Post[POS] (1936).
raw computational power of all human brains combined.[RAW] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
Gottfried Wilhelm Leibniz[L86][WI48] (see above), In 1936, Alan M. Turing the world In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL]
In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff
first truly self-driving cars
robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich,
(take all of this with a grain of salt, though[OMG1]).
Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22]
publication page and my
In 2022, we are celebrating the following works from a quarter-century ago.
1. Journal paper on Long Short-Term Memory, the
(and basis of the most cited NN of the 21st).
all possible metaverses
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
meta-reinforcement learning.
5. Journal paper on hierarchical Q-learning.
8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity.
PDF.
The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
More.
PDF.
general system
PDF. (More on
artificial scientists and artists.)
PDF.
(more).
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
PDF.
More.
PS. (PDF.)
attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
J. Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around
HTML.
PDF.
Precursor of modern backpropagation.[BP1-5]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
the artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
1991 NN distillation procedure,[UN0-2][MIR](Sec. 2)
More.
Local copy (HTML only).
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber
this type of deep learning dates back to Schmidhuber
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436).
J. Schmidhuber (AI Blog, 2022).
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022.
and PhDs in computer science. More.
Alphastar has a "deep LSTM core."
used LSTM
[FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013.
PDF.
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
can be found here.
PDF.
OCR-based PDF scan of pages 94-135 (see pages 119-120).
More.
Cognitive Computation 1(2):177-193, 2009. PDF.
More.
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
DanNet,[DAN,DAN1][R6]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
first deep learner to win a medical imaging contest (2012). Link.
J. Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1]
Schmidhuber
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22].
previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6]
PDF.
More.
More.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun
LeCun also listed the "5 best ideas 2012-2022" without mentioning that
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online:
[LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik.
PDF.
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII)
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.[MIR]
PDF.
PDF.
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC7] J. Schmidhuber. Turing
[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
HTML.
Link.
Compare Konrad Zuse
An LSTM composes 84% of the model
2018. An LSTM with 84% of the model
J. Schmidhuber (Blog, 2006).
Is History Converging? Again?
HTML.
HTML overview.
OOPS source code in crystalline format.
HTML.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and
the GAN principle
More.
More.
More.
PDF. More.
PDF.
J. Schmidhuber (AI Blog, 2001). Raw Computing Power.
PDF.
This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation.
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII)
PDF.
Local copy 1 (HTML only).
Local copy 2 (HTML only).
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
attention terminology in 1993.[ATT][FWP2][R4]
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
neural knowledge distillation procedure
The systems of 1991 allowed for much deeper learning than previous methods. More.
approaches are now widely used. More.
the first NNs shown to solve very deep problems.
Theory of Universal Learning Machines & Universal AI.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.
unsupervised pre-training is not necessary
Local copy (plain HTML only).
Schmidhuber
a general, practical, program-controlled computer.
J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936.
PDF.
AI Blog
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20]
unsupervised pre-training for deep NNs (1991),[UN1-2]
vanishing gradients (1991)[VAN1] &
Long Short-Term Memory or LSTM (Sec. A),
NNs with over 100 layers (2015),[HW1-3][R5]
fast weight programmers (1991).[FWP0-2,6]
Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]
The deep NNs
By the 2010s,[DEC] they were
Long Short-Term Memory
vanishing gradient problem
(Sec. 3,Sec. 4)
through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1]
became the first recurrent NN (RNN) to win international competitions.
LSTM[MIR](Sec. 4)
However, such attention mechanisms also
have their roots in my lab (1991);[FWP][FWP0-2,6]
In the 2010s,
Alphastar whose brain has a deep LSTM core trained by PG.[DM3]
Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG]
in healthcare,
D. Computer Vision was revolutionized in the 2010s by
In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs.
unsupervised pre-training is not necessary
Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6]
winning four of them
at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun
DanNet was also the first deep CNN to win:
a Chinese handwriting contest (ICDAR 2011),
an image segmentation contest (ISBI, May 2012),
further extended the work of 2011.[MIR](Sec. 19)
most cited neural network,[MOST] is a version (with open gates) of our earlier
Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
appeared long before the 1980s.
containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)
However, it became really deep in 1991 in my lab,[UN-UN3] which has
First Very Deep NNs, Based on Unsupervised Pre-Training (1991).
more.)
drove the shift
from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR]
(Sec. 19)
LSTMs
brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]
by others (Sec.
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20]
unsupervised pre-training for deep NNs,[UN1-2]
the vanishing gradient problem (1991)[VAN1] &
solutions to it (Sec. A),
Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21)
we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.
any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
In 1936, Turing
my reply to Hinton
who criticized my website on Turing
Likewise, Konrad Zuse (1910-1995)
created the world
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20]
unsupervised pre-training for deep NNs (1991),[UN1-2][UN]
vanishing gradients (1991)[VAN1] &
solutions to it (Sec. A),[LSTM0-17][CTC]
record-breaking deep supervised NNs
and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5]
NNs with over 100 layers (2015),[HW1-3][R5]
fast weight programmers (1991),[FWP0-2,6]
Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]
"advances in natural language processing" and in speech
supervised NNs and
CNNs
and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV
as well as Sec. 4 & Sec. 19 of the overview.[MIR]
DanNet[DAN][DAN1][GPUCNN5]
the first NN to win a medical imaging contest through deep learning
approach of
who failed to cite them, even in later
the first machine learning
others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)
made GPU-based NNs fast and deep enough
unsupervised pre-training (pioneered by myself in 1991)
is not necessary
our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]
Furthermore, by the mid 2010s, speech recognition and machine translation
(and apparently even other award committees[HIN](Sec. I))
recent debate:[HIN] It is true that in 2018,
fast deep CNN called DanNet
as well as Sec. 19 of the overview.[MIR]
In the 2010s,
was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B.
adaptive neural sequential attention: end-to-end-differentiable
"soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990).
FWPs of 1991[FWP0-1]
which have become a popular alternative to RNNs.
the 2010s,[DEC]
the attention terminology[FWP2] now used
See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.
a simple application[AC]
of the adversarial curiosity (AC) principle
the other
(1991).[PM1-2][AC20][R2][MIR](Sec. 5)
vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp.
my publications on exactly this topic
Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6]
unsupervised pre-training
for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1)
the first NNs shown to solve very deep problems
compressing or distilling
one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2)
fast weight programmers[FWP][FWP0-4a]
through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above).
learning sequential attention
with NNs.[MIR](Sec. 9)
our much earlier work on this[ATT1][ATT] although
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and
at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the
superhuman performance
at ICPR 2012, our DanNet[GPUCNN1-3] won the
medical imaging contest
All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari
adaptive subgoal generators
(1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10)
planning and reinforcement learning with recurrent neural world models
(1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like
fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI)
ad hominem attacks[AH2-3][HIN]
As emphasized earlier:[DLC][HIN]
GANs are variations
LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2]
unsupervised pre-training for deep NNs
by myself
our deep and fast DanNet (2011)[GPUCNN1-3] as
Backpropagation is a previously invented
artificial curiosity
by ours[UN0-2][UN]
So virtually all the algorithms that have attracted
our LSTM
brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST]
medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC]
and many other applications.[DEC]
As mentioned earlier,[MIR](Sec. 21)
it is not always clear[DLC]
backpropagation
AI scientists and AI historians
equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1]
publication page and my
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
More.
PDF.
PDF. (More on
artificial scientists and artists.)
PDF.
(more).
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular.
PDF.
More.
PS. (PDF.)
HTML.
PDF.
Precursor of modern backpropagation.[BP1-4]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
our artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
Local copy (HTML only).
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
this type of deep learning dates back to 1991.[UN1-2][UN]
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436).
Alphastar has a "deep LSTM core."
used LSTM
PDF.
J. Schmidhuber (AI Blog, 26 March 2021).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
can be found here.
PDF.
OCR-based PDF scan of pages 94-135 (see pages 119-120).
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
first deep learner to win a medical imaging contest (2012). HTML.
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.
PDF.
More.
(v1: 24 Sep 2021,
v2: 31 Dec 2021)
Versions since 2021 archived in the Internet Archive
deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM
2015 survey of deep learning[DL1]
June 2020 article[T20a][R12]
version 1 of the present report.
expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
first version
Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN]
this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III).
Highway Net,
were all driven by my lab:[MOST] In 1991, I had the
first very deep NNs based on unsupervised pre-training;[UN-UN2]
LSTMs
later our Highway Nets[HW1-3] brought it to feedforward NNs.
based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC]
our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV).
described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2).
GANs are instances
of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII).
it became really deep in 1991 in my lab,
unsupervised pre-training of NNs,
supervised LSTM.
combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were
our revolutionary CTC-LSTM which was soon on most smartphones.
(soon used for several billions of
was also based on our LSTM.
most visible breakthroughs
deep NNs
superior computer vision in 2011,
winning 4 image recognition contests in a row
is an open-gated version of our earlier Highway Nets.
deep & fast CNN
(where LeCun participated),
deep GPU-NN of 2010
debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton),
and our GPU-CNN of 2011 (DanNet) was the first
first to win medical imaging competitions
backpropagation
CTC-LSTM
We started this in 1990-93
long before LBH
Artificial Curiosity
vanishing gradients (1991),
metalearning (1987),
unsupervised pre-training (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
fast weight programmers using
Sec. IV is on Turing (1936) and his predecessors
In the recent decade of deep learning,
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
AI Blog 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]
by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.
any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world
modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]
"advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]
DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of
who failed to cite them, even in later
the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)
made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]
Furthermore, by the mid 2010s, speech recognition and machine translation
(and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,
fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]
In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used
See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.
a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari
adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]
As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
AI Blog
Traditionally this is done with recurrent NNs (RNNs)
published.[FWP0-1]
the attention terminology[FWP2] now used
famous vanishing gradient
additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5)
Annus Mirabilis of deep learning.[MIR]
artificial neural network (NN)
started in 1965
fast weights.
attention[ATT] (Sec. 4)
on 26 March 1991,
attention[ATT]
That is, I separated storage and control like in traditional computers,
recurrent NNs (RNNs)
attention[ATT] (compare Sec. 4).
sequence-processing recurrent NNs (RNNs)
the computationally most powerful NNs of them all.[UN][MIR](Sec. 0)
of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)
4. Attention terminology of 1993
NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22]
internal spotlights of attention
do not suffer during sequence learning from the famous vanishing gradient
and both of them dating back to 1991, our miraculous year of deep learning.[MIR]
Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step
Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep
Remarkably, both of these dual approaches of 1991 have become successful.
the mid 2010s,[DEC]
major IT companies overwhelmingly used
unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1)
dates back to 1991[UN]
(Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)
as shown in 2005
the first machine learning
Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small.
Compressed Network Search[CO2]
unsupervised pre-training.
My first work on metalearning machines that
learn to learn was published in 1987.[META][R3]
metalearning in a very general way.
used gradient descent in LSTM networks[LSTM1] instead of traditional
There is another version of this article
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
PDF.
More.
PS. (PDF.)
Precursor of modern backpropagation.[BP1-4]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
PDF.
can be found here.
Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More.
More.
PDF.
More.
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.[MIR]
PDF.
PDF.
attention terminology in 1993.[ATT][FWP2][R4]
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
PDF.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
approaches are now widely used. More.
https://people.idsia.ch/~juergen/deep-learning-history.html
AI Blog
is dominated by artificial neural networks (NNs) and
In 1676, Gottfried Wilhelm Leibniz
Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals
Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L
most cited NN of the 20th century.[MOST] most cited NN of the 21st century.[MOST] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] the first machine learning
It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1]
"wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4]
In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs.
CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6]
given set.[AC20][AC][T22](Sec. XVII) Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) Transformers with "linearized self-attention"[TR5-6] attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).
Annus Mirabilis of 1990-1991.[MIR][MOST]
The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) Neural History Compressor.[UN1] using my NN distillation procedure of 1991.[UN0-1][MIR] Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] Fundamental Deep Learning Problem Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM.
Highway Network[HW1] ImageNet 2015 contest) is a version thereof
Deep learning LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]
Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in
Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) Konrad Zuse Unlike Babbage, Zuse used Leibniz Turing[TUR] (1936), and Post[POS] (1936).
raw computational power of all human brains combined.[RAW] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
Gottfried Wilhelm Leibniz[L86][WI48] (see above), In 1936, Alan M. Turing the world In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL]
In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff
first truly self-driving cars
robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich,
(take all of this with a grain of salt, though[OMG1]).
Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22]
publication page and my
In 2022, we are celebrating the following works from a quarter-century ago.
1. Journal paper on Long Short-Term Memory, the
(and basis of the most cited NN of the 21st).
all possible metaverses
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
meta-reinforcement learning.
5. Journal paper on hierarchical Q-learning.
8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity.
PDF.
The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
More.
PDF.
general system
PDF. (More on
artificial scientists and artists.)
PDF.
(more).
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
PDF.
More.
PS. (PDF.)
attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
J. Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around
HTML.
PDF.
Precursor of modern backpropagation.[BP1-5]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
the artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
1991 NN distillation procedure,[UN0-2][MIR](Sec. 2)
More.
Local copy (HTML only).
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber
this type of deep learning dates back to Schmidhuber
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436).
J. Schmidhuber (AI Blog, 2022).
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022.
and PhDs in computer science. More.
Alphastar has a "deep LSTM core."
used LSTM
[FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013.
PDF.
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
can be found here.
PDF.
OCR-based PDF scan of pages 94-135 (see pages 119-120).
More.
Cognitive Computation 1(2):177-193, 2009. PDF.
More.
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
DanNet,[DAN,DAN1][R6]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
first deep learner to win a medical imaging contest (2012). Link.
J. Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1]
Schmidhuber
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22].
previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6]
PDF.
More.
More.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun
LeCun also listed the "5 best ideas 2012-2022" without mentioning that
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online:
[LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik.
PDF.
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII)
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.[MIR]
PDF.
PDF.
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC7] J. Schmidhuber. Turing
[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
HTML.
Link.
Compare Konrad Zuse
An LSTM composes 84% of the model
2018. An LSTM with 84% of the model
J. Schmidhuber (Blog, 2006).
Is History Converging? Again?
HTML.
HTML overview.
OOPS source code in crystalline format.
HTML.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and
the GAN principle
More.
More.
More.
PDF. More.
PDF.
J. Schmidhuber (AI Blog, 2001). Raw Computing Power.
PDF.
This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation.
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII)
PDF.
Local copy 1 (HTML only).
Local copy 2 (HTML only).
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
attention terminology in 1993.[ATT][FWP2][R4]
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
neural knowledge distillation procedure
The systems of 1991 allowed for much deeper learning than previous methods. More.
approaches are now widely used. More.
the first NNs shown to solve very deep problems.
Theory of Universal Learning Machines & Universal AI.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.
unsupervised pre-training is not necessary
Local copy (plain HTML only).
Schmidhuber
a general, practical, program-controlled computer.
J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936.
PDF.
AI Blog
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20]
unsupervised pre-training for deep NNs (1991),[UN1-2]
vanishing gradients (1991)[VAN1] &
Long Short-Term Memory or LSTM (Sec. A),
NNs with over 100 layers (2015),[HW1-3][R5]
fast weight programmers (1991).[FWP0-2,6]
Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]
The deep NNs
By the 2010s,[DEC] they were
Long Short-Term Memory
vanishing gradient problem
(Sec. 3,Sec. 4)
through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1]
became the first recurrent NN (RNN) to win international competitions.
LSTM[MIR](Sec. 4)
However, such attention mechanisms also
have their roots in my lab (1991);[FWP][FWP0-2,6]
In the 2010s,
Alphastar whose brain has a deep LSTM core trained by PG.[DM3]
Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG]
in healthcare,
D. Computer Vision was revolutionized in the 2010s by
In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs.
unsupervised pre-training is not necessary
Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6]
winning four of them
at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun
DanNet was also the first deep CNN to win:
a Chinese handwriting contest (ICDAR 2011),
an image segmentation contest (ISBI, May 2012),
further extended the work of 2011.[MIR](Sec. 19)
most cited neural network,[MOST] is a version (with open gates) of our earlier
Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
appeared long before the 1980s.
containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin
Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)
However, it became really deep in 1991 in my lab,[UN-UN3] which has
First Very Deep NNs, Based on Unsupervised Pre-Training (1991).
more.)
drove the shift
from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR]
(Sec. 19)
LSTMs
brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]
by others (Sec.
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20]
unsupervised pre-training for deep NNs,[UN1-2]
the vanishing gradient problem (1991)[VAN1] &
solutions to it (Sec. A),
Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21)
we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.
any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
In 1936, Turing
my reply to Hinton
who criticized my website on Turing
Likewise, Konrad Zuse (1910-1995)
created the world
modern backpropagation
principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20]
unsupervised pre-training for deep NNs (1991),[UN1-2][UN]
vanishing gradients (1991)[VAN1] &
solutions to it (Sec. A),[LSTM0-17][CTC]
record-breaking deep supervised NNs
and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5]
NNs with over 100 layers (2015),[HW1-3][R5]
fast weight programmers (1991),[FWP0-2,6]
Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]
"advances in natural language processing" and in speech
supervised NNs and
CNNs
and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV
as well as Sec. 4 & Sec. 19 of the overview.[MIR]
DanNet[DAN][DAN1][GPUCNN5]
the first NN to win a medical imaging contest through deep learning
approach of
who failed to cite them, even in later
the first machine learning
others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)
made GPU-based NNs fast and deep enough
unsupervised pre-training (pioneered by myself in 1991)
is not necessary
our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]
Furthermore, by the mid 2010s, speech recognition and machine translation
(and apparently even other award committees[HIN](Sec. I))
recent debate:[HIN] It is true that in 2018,
fast deep CNN called DanNet
as well as Sec. 19 of the overview.[MIR]
In the 2010s,
was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B.
adaptive neural sequential attention: end-to-end-differentiable
"soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990).
FWPs of 1991[FWP0-1]
which have become a popular alternative to RNNs.
the 2010s,[DEC]
the attention terminology[FWP2] now used
See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.
a simple application[AC]
of the adversarial curiosity (AC) principle
the other
(1991).[PM1-2][AC20][R2][MIR](Sec. 5)
vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp.
my publications on exactly this topic
Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6]
unsupervised pre-training
for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1)
the first NNs shown to solve very deep problems
compressing or distilling
one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2)
fast weight programmers[FWP][FWP0-4a]
through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above).
learning sequential attention
with NNs.[MIR](Sec. 9)
our much earlier work on this[ATT1][ATT] although
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and
at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the
superhuman performance
at ICPR 2012, our DanNet[GPUCNN1-3] won the
medical imaging contest
All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari
adaptive subgoal generators
(1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10)
planning and reinforcement learning with recurrent neural world models
(1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like
fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI)
ad hominem attacks[AH2-3][HIN]
As emphasized earlier:[DLC][HIN]
GANs are variations
LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2]
unsupervised pre-training for deep NNs
by myself
our deep and fast DanNet (2011)[GPUCNN1-3] as
Backpropagation is a previously invented
artificial curiosity
by ours[UN0-2][UN]
So virtually all the algorithms that have attracted
our LSTM
brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST]
medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC]
and many other applications.[DEC]
As mentioned earlier,[MIR](Sec. 21)
it is not always clear[DLC]
backpropagation
AI scientists and AI historians
equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1]
publication page and my
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
More.
PDF.
PDF. (More on
artificial scientists and artists.)
PDF.
(more).
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
PDF.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular.
PDF.
More.
PS. (PDF.)
HTML.
PDF.
Precursor of modern backpropagation.[BP1-4]
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
PDF.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
our artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
Local copy (HTML only).
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world
LSTM.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
this type of deep learning dates back to 1991.[UN1-2][UN]
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436).
Alphastar has a "deep LSTM core."
used LSTM
PDF.
J. Schmidhuber (AI Blog, 26 March 2021).
the attention terminology[FWP2] now used
PDF.
HTML.
Pictures (German).
HTML overview.
can be found here.
PDF.
OCR-based PDF scan of pages 94-135 (see pages 119-120).
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
first deep learner to win a medical imaging contest (2012). HTML.
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.
PDF.
More.
(v1: 24 Sep 2021,
v2: 31 Dec 2021)
Versions since 2021 archived in the Internet Archive
deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM
2015 survey of deep learning[DL1]
June 2020 article[T20a][R12]
version 1 of the present report.
expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
first version
Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN]
this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III).
Highway Net,
were all driven by my lab:[MOST] In 1991, I had the
first very deep NNs based on unsupervised pre-training;[UN-UN2]
LSTMs
later our Highway Nets[HW1-3] brought it to feedforward NNs.
based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC]
our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV).
described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2).
GANs are instances
of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII).
it became really deep in 1991 in my lab,
unsupervised pre-training of NNs,
supervised LSTM.
combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were
our revolutionary CTC-LSTM which was soon on most smartphones.
(soon used for several billions of
was also based on our LSTM.
most visible breakthroughs
deep NNs
superior computer vision in 2011,
winning 4 image recognition contests in a row
is an open-gated version of our earlier Highway Nets.
deep & fast CNN
(where LeCun participated),
deep GPU-NN of 2010
debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton),
and our GPU-CNN of 2011 (DanNet) was the first
first to win medical imaging competitions
backpropagation
CTC-LSTM
We started this in 1990-93
long before LBH
Artificial Curiosity
vanishing gradients (1991),
metalearning (1987),
unsupervised pre-training (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
fast weight programmers using
Sec. IV is on Turing (1936) and his predecessors
In the recent decade of deep learning,
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
AI Blog 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]
by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.
any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]
In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world
modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]
"advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]
DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of
who failed to cite them, even in later
the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)
made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]
Furthermore, by the mid 2010s, speech recognition and machine translation
(and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,
fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]
In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used
See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.
a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari
adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]
As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J. Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.