• Overview : http://people.idsia.ch/~juergen/deep-learning-overview.html Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning AI Blog 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

    Critique of 2018 Turing Award appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

    Critique of 2018 Turing Award by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.

    Critique of 2018 Turing Award any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]

    Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award who failed to cite them, even in later

    Critique of 2018 Turing Award the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)

    Critique of 2018 Turing Award made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]

    Furthermore, by the mid 2010s, speech recognition and machine translation

    Critique of 2018 Turing Award (and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,

    Critique of 2018 Turing Award X.[MIR](Sec. 1)[R8]

    Critique of 2018 Turing Award fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used

    See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.

    Critique of 2018 Turing Award a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although

    Critique of 2018 Turing Award However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See

    Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]

    As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J.  Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

    Deep Learning: Our Miraculous Year 1990-1991 AI Blog Traditionally this is done with recurrent NNs (RNNs) published.[FWP0-1] the attention terminology[FWP2] now used famous vanishing gradient additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5) Annus Mirabilis of deep learning.[MIR]

    artificial neural network (NN) started in 1965 fast weights. attention[ATT] (Sec. 4)

    on 26 March 1991, attention[ATT] That is, I separated storage and control like in traditional computers, recurrent NNs (RNNs)

    attention[ATT] (compare Sec. 4). sequence-processing recurrent NNs (RNNs) the computationally most powerful NNs of them all.[UN][MIR](Sec. 0) of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) 4. Attention terminology of 1993 NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22] internal spotlights of attention do not suffer during sequence learning from the famous vanishing gradient

    LSTM and both of them dating back to 1991, our miraculous year of deep learning.[MIR] Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step

    Highway Networks:
is mirrored in the LSTM-inspired  <a href=Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,[DEC] major IT companies overwhelmingly used unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1) dates back to 1991[UN] (Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)

    as shown in 2005 the first machine learning Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search[CO2] unsupervised pre-training.

    self-referential weight matrix My first work on metalearning machines that learn to learn was published in 1987.[META][R3] metalearning in a very general way. used gradient descent in LSTM networks[LSTM1] instead of traditional There is another version of this article J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber PDF. More. PS. (PDF.) Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. PDF. can be found here. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. More. PDF. More. J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. attention terminology in 1993.[ATT][FWP2][R4] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. PDF. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. approaches are now widely used. More. Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber.
    https://people.idsia.ch/~juergen/deep-learning-history.html AI Blog is dominated by artificial neural networks (NNs) and deep learning,[DL1-4] hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous deep learning survey[DL1] modern AI minimize pain, maximize pleasure, drive cars, etc.[MIR](Sec. 0)[DL1-4]

    Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

    In 1676, Gottfried Wilhelm Leibniz

    Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals

    Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L

    In 1795, Gauss used what
<a href=most cited NN of the 20th century.[MOST] most cited NN of the 21st century.[MOST] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] the first machine learning

    who invented backpropagation?

    It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1]

    10-year anniversary of supervised deep learning breakthrough (2010) "wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4]

    In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs.

    History of computer vision contests won by deep CNNs since 2011 CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] ICDAR 2011 Chinese handwriting DanNet[GPUCNN1-3] IJCNN 2011 traffic signs ISBI 2012 image segmentation ICPR 2012 medical imaging MICCAI 2013 Grand Challenge ResNet,[HW2] a
    Highway Net[HW1]
    with open gates
    winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the DanNet of 2011.[MIR](Sec. 19)[MOST] most cited NN,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). our graph NN-like, Transformer-like Fast Weight Programmers of 1991[FWP0-1][FWP6][FWP] which learn to continually rewrite mappings from inputs to outputs (addressed below), and See overviews[MIR](Sec. 15, Sec. 17) Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.[AC90-20][GAN1]

    Artificial Curiosity & Creativity Since 1990-91 given set.[AC20][AC][T22](Sec. XVII) Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) Transformers with "linearized self-attention"[TR5-6] attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).

    26 March 1991: Neural nets learn to program neural nets with fast weights—like today
<a href=Annus Mirabilis of 1990-1991.[MIR][MOST]

    The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) Neural History Compressor.[UN1] First Very Deep Learner of 1991 using my NN distillation procedure of 1991.[UN0-1][MIR] Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] Sepp Hochreiter
Deep learning is hard because of the <a href=Fundamental Deep Learning Problem Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 Recurrent Neural Networks, especially LSTM three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM.

    Highway Networks:
our <a href=Highway Network[HW1] ImageNet 2015 contest) is a version thereof

    Deep learning LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]

    Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in Highlights of over 2000 years of computing history. Juergen Schmidhuber

    2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber. Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) Konrad Zuse Unlike Babbage, Zuse used Leibniz Turing[TUR] (1936), and Post[POS] (1936).

    1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber. raw computational power of all human brains combined.[RAW] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. Gottfried Wilhelm Leibniz[L86][WI48] (see above), In 1936, Alan M. Turing the world In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL]

    In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich, (take all of this with a grain of salt, though[OMG1]). the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997 Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22] publication page and my In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal paper on Long Short-Term Memory, the (and basis of the most cited NN of the 21st). all possible metaverses 3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments. meta-reinforcement learning. 5. Journal paper on hierarchical Q-learning. 8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. general system PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber PDF. More. PS. (PDF.) attentional component (the fixation controller)." See [MIR](Sec. 9)[R4]. J.  Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around HTML. PDF. Precursor of modern backpropagation.[BP1-5] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. the artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The 1991 NN distillation procedure,[UN0-2][MIR](Sec. 2) More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber this type of deep learning dates back to Schmidhuber [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436). J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. and PhDs in computer science. More. Alphastar has a "deep LSTM core." used LSTM [FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013. PDF. J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). More. Cognitive Computation 1(2):177-193, 2009. PDF. More. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. DanNet,[DAN,DAN1][R6] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. [GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). first deep learner to win a medical imaging contest (2012). Link. J.  Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1] Schmidhuber [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22]. previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6] PDF. More. More. [LEC] J. Schmidhuber (AI Blog, 2022). LeCun LeCun also listed the "5 best ideas 2012-2022" without mentioning that [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: [LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik. PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. Compare Konrad Zuse An LSTM composes 84% of the model 2018. An LSTM with 84% of the model J. Schmidhuber (Blog, 2006). Is History Converging? Again? HTML. HTML overview. OOPS source code in crystalline format. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN principle More. More. More. PDF. More. PDF. J. Schmidhuber (AI Blog, 2001). Raw Computing Power. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation. Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22]. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. attention terminology in 1993.[ATT][FWP2][R4] [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. neural knowledge distillation procedure The systems of 1991 allowed for much deeper learning than previous methods. More. approaches are now widely used. More. the first NNs shown to solve very deep problems. Theory of Universal Learning Machines & Universal AI. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). Schmidhuber a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application. J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936. PDF. Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award AI Blog
    (v1: 24 Sep 2021, v2: 31 Dec 2021) Versions since 2021 archived in the Internet Archive deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] version 1 of the present report. expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) most cited neural network,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

    Critique of 2018 Turing Award appeared long before the 1980s. containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

    Critique of 2018 Turing Award by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.

    Critique of 2018 Turing Award any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]

    Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award who failed to cite them, even in later

    Critique of 2018 Turing Award the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)

    Critique of 2018 Turing Award made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]

    Furthermore, by the mid 2010s, speech recognition and machine translation

    Critique of 2018 Turing Award (and apparently even other award committees[HIN](Sec. I)) recent debate:[HIN] It is true that in 2018,

    Critique of 2018 Turing Award X.[MIR](Sec. 1)[R8]

    Critique of 2018 Turing Award fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used

    See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.

    Critique of 2018 Turing Award a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although

    Critique of 2018 Turing Award However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See

    Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]

    As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1] publication page and my J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J.  Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

    Deep Learning: Our Miraculous Year 1990-1991 Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning AI Blog 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

    Critique of 2018 Turing Award appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

    Critique of 2018 Turing Award by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.

    Critique of 2018 Turing Award any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]

    Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award who failed to cite them, even in later

    Critique of 2018 Turing Award the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)

    Critique of 2018 Turing Award made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]

    Furthermore, by the mid 2010s, speech recognition and machine translation

    Critique of 2018 Turing Award (and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,

    Critique of 2018 Turing Award X.[MIR](Sec. 1)[R8]

    Critique of 2018 Turing Award fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used

    See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.

    Critique of 2018 Turing Award a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although

    Critique of 2018 Turing Award However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See

    Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]

    As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J.  Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

    Deep Learning: Our Miraculous Year 1990-1991 AI Blog Traditionally this is done with recurrent NNs (RNNs) published.[FWP0-1] the attention terminology[FWP2] now used famous vanishing gradient additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5) Annus Mirabilis of deep learning.[MIR]

    artificial neural network (NN) started in 1965 fast weights. attention[ATT] (Sec. 4)

    on 26 March 1991, attention[ATT] That is, I separated storage and control like in traditional computers, recurrent NNs (RNNs)

    attention[ATT] (compare Sec. 4). sequence-processing recurrent NNs (RNNs) the computationally most powerful NNs of them all.[UN][MIR](Sec. 0) of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) 4. Attention terminology of 1993 NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22] internal spotlights of attention do not suffer during sequence learning from the famous vanishing gradient

    LSTM and both of them dating back to 1991, our miraculous year of deep learning.[MIR] Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step

    Highway Networks:
is mirrored in the LSTM-inspired  <a href=Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,[DEC] major IT companies overwhelmingly used unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1) dates back to 1991[UN] (Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)

    as shown in 2005 the first machine learning Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search[CO2] unsupervised pre-training.

    self-referential weight matrix My first work on metalearning machines that learn to learn was published in 1987.[META][R3] metalearning in a very general way. used gradient descent in LSTM networks[LSTM1] instead of traditional There is another version of this article J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber PDF. More. PS. (PDF.) Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. PDF. can be found here. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. More. PDF. More. J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. attention terminology in 1993.[ATT][FWP2][R4] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. PDF. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. approaches are now widely used. More. Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber.
    https://people.idsia.ch/~juergen/deep-learning-history.html AI Blog is dominated by artificial neural networks (NNs) and deep learning,[DL1-4] hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous deep learning survey[DL1] modern AI minimize pain, maximize pleasure, drive cars, etc.[MIR](Sec. 0)[DL1-4]

    Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

    In 1676, Gottfried Wilhelm Leibniz

    Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals

    Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L

    In 1795, Gauss used what
<a href=most cited NN of the 20th century.[MOST] most cited NN of the 21st century.[MOST] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] the first machine learning

    who invented backpropagation?

    It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1]

    10-year anniversary of supervised deep learning breakthrough (2010) "wake-up call to the machine learning community." Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4]

    In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs.

    History of computer vision contests won by deep CNNs since 2011 CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] ICDAR 2011 Chinese handwriting DanNet[GPUCNN1-3] IJCNN 2011 traffic signs ISBI 2012 image segmentation ICPR 2012 medical imaging MICCAI 2013 Grand Challenge ResNet,[HW2] a
    Highway Net[HW1]
    with open gates
    winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the DanNet of 2011.[MIR](Sec. 19)[MOST] most cited NN,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). our graph NN-like, Transformer-like Fast Weight Programmers of 1991[FWP0-1][FWP6][FWP] which learn to continually rewrite mappings from inputs to outputs (addressed below), and See overviews[MIR](Sec. 15, Sec. 17) Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.[AC90-20][GAN1]

    Artificial Curiosity & Creativity Since 1990-91 given set.[AC20][AC][T22](Sec. XVII) Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) Transformers with "linearized self-attention"[TR5-6] attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).

    26 March 1991: Neural nets learn to program neural nets with fast weights—like today
<a href=Annus Mirabilis of 1990-1991.[MIR][MOST]

    The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) Neural History Compressor.[UN1] First Very Deep Learner of 1991 using my NN distillation procedure of 1991.[UN0-1][MIR] Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] Sepp Hochreiter
Deep learning is hard because of the <a href=Fundamental Deep Learning Problem Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 Recurrent Neural Networks, especially LSTM three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM.

    Highway Networks:
our <a href=Highway Network[HW1] ImageNet 2015 contest) is a version thereof

    Deep learning LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]

    Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in Highlights of over 2000 years of computing history. Juergen Schmidhuber

    2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber. Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) Konrad Zuse Unlike Babbage, Zuse used Leibniz Turing[TUR] (1936), and Post[POS] (1936).

    1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber. raw computational power of all human brains combined.[RAW] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. Gottfried Wilhelm Leibniz[L86][WI48] (see above), In 1936, Alan M. Turing the world In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL]

    In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich, (take all of this with a grain of salt, though[OMG1]). the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997 Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22] publication page and my In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal paper on Long Short-Term Memory, the (and basis of the most cited NN of the 21st). all possible metaverses 3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments. meta-reinforcement learning. 5. Journal paper on hierarchical Q-learning. 8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. general system PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber PDF. More. PS. (PDF.) attentional component (the fixation controller)." See [MIR](Sec. 9)[R4]. J.  Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around HTML. PDF. Precursor of modern backpropagation.[BP1-5] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. the artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The 1991 NN distillation procedure,[UN0-2][MIR](Sec. 2) More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber this type of deep learning dates back to Schmidhuber [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436). J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. and PhDs in computer science. More. Alphastar has a "deep LSTM core." used LSTM [FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013. PDF. J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). More. Cognitive Computation 1(2):177-193, 2009. PDF. More. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. DanNet,[DAN,DAN1][R6] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. [GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). first deep learner to win a medical imaging contest (2012). Link. J.  Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1] Schmidhuber [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22]. previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6] PDF. More. More. [LEC] J. Schmidhuber (AI Blog, 2022). LeCun LeCun also listed the "5 best ideas 2012-2022" without mentioning that [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: [LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik. PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. Compare Konrad Zuse An LSTM composes 84% of the model 2018. An LSTM with 84% of the model J. Schmidhuber (Blog, 2006). Is History Converging? Again? HTML. HTML overview. OOPS source code in crystalline format. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN principle More. More. More. PDF. More. PDF. J. Schmidhuber (AI Blog, 2001). Raw Computing Power. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation. Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22]. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. attention terminology in 1993.[ATT][FWP2][R4] [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. neural knowledge distillation procedure The systems of 1991 allowed for much deeper learning than previous methods. More. approaches are now widely used. More. the first NNs shown to solve very deep problems. Theory of Universal Learning Machines & Universal AI. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). Schmidhuber a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application. J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936. PDF. Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award AI Blog
    (v1: 24 Sep 2021, v2: 31 Dec 2021) Versions since 2021 archived in the Internet Archive deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] version 1 of the present report. expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) most cited neural network,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

    Critique of 2018 Turing Award appeared long before the 1980s. containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

    Critique of 2018 Turing Award by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.

    Critique of 2018 Turing Award any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]

    Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award who failed to cite them, even in later

    Critique of 2018 Turing Award the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)

    Critique of 2018 Turing Award made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]

    Furthermore, by the mid 2010s, speech recognition and machine translation

    Critique of 2018 Turing Award (and apparently even other award committees[HIN](Sec. I)) recent debate:[HIN] It is true that in 2018,

    Critique of 2018 Turing Award X.[MIR](Sec. 1)[R8]

    Critique of 2018 Turing Award fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used

    See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.

    Critique of 2018 Turing Award a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although

    Critique of 2018 Turing Award However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See

    Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]

    As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1] publication page and my J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. PDF. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J.  Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. PDF. OCR-based PDF scan of pages 94-135 (see pages 119-120). win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

    Deep Learning: Our Miraculous Year 1990-1991 Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning AI Blog 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. first version Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs later our Highway Nets[HW1-3] brought it to feedforward NNs. based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were our revolutionary CTC-LSTM which was soon on most smartphones. (soon used for several billions of was also based on our LSTM. most visible breakthroughs deep NNs superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. deep & fast CNN (where LeCun participated), deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first first to win medical imaging competitions backpropagation CTC-LSTM We started this in 1990-93 long before LBH Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using Sec. IV is on Turing (1936) and his predecessors In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications)

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991).[FWP0-2,6] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] The deep NNs By the 2010s,[DEC] they were Long Short-Term Memory vanishing gradient problem (Sec. 3,Sec. 4) through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] became the first recurrent NN (RNN) to win international competitions. LSTM[MIR](Sec. 4) However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] In the 2010s, Alphastar whose brain has a deep LSTM core trained by PG.[DM3] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] in healthcare, D. Computer Vision was revolutionized in the 2010s by In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. unsupervised pre-training is not necessary Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] winning four of them at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), further extended the work of 2011.[MIR](Sec. 19) more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).

    Critique of 2018 Turing Award appeared long before the 1980s. layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) However, it became really deep in 1991 in my lab,[UN-UN3] which has First Very Deep NNs, Based on Unsupervised Pre-Training (1991). more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

    Critique of 2018 Turing Award by others (Sec. modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) we had this type of deep learning already in 1991;[UN][UN1-2] see Sec.

    Critique of 2018 Turing Award any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

    In 1936, Turing my reply to Hinton who criticized my website on Turing Likewise, Konrad Zuse (1910-1995) created the world

    Critique of 2018 Turing Award modern backpropagation principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] record-breaking deep supervised NNs and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] fast weight programmers (1991),[FWP0-2,6] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]

    Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning approach of

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award who failed to cite them, even in later

    Critique of 2018 Turing Award the first machine learning others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II)

    Critique of 2018 Turing Award made GPU-based NNs fast and deep enough unsupervised pre-training (pioneered by myself in 1991) is not necessary our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5]

    Furthermore, by the mid 2010s, speech recognition and machine translation

    Critique of 2018 Turing Award (and apparently even other award committees[HIN](Sec. I) recent debate:[HIN] It is true that in 2018,

    Critique of 2018 Turing Award X.[MIR](Sec. 1)[R8]

    Critique of 2018 Turing Award fast deep CNN called DanNet as well as Sec. 19 of the overview.[MIR]

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. the 2010s,[DEC] the attention terminology[FWP2] now used

    See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton.

    Critique of 2018 Turing Award a simple application[AC] of the adversarial curiosity (AC) principle the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. my publications on exactly this topic Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) the first NNs shown to solve very deep problems compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) our much earlier work on this[ATT1][ATT] although

    Critique of 2018 Turing Award However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See

    Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) ad hominem attacks[AH2-3][HIN] As emphasized earlier:[DLC][HIN] GANs are variations LBH, who called themselves the deep learning conspiracy,[DLC] unsupervised pre-training for deep NNs by myself our deep and fast DanNet (2011)[GPUCNN1-3] as Backpropagation is a previously invented artificial curiosity by ours[UN0-2][UN] So virtually all the algorithms that have attracted our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] and many other applications.[DEC]

    As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] backpropagation AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2] publication page and my J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). More. PDF. PDF. (More on artificial scientists and artists.) PDF. (more). [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. More. PS. (PDF.) HTML. PDF. Precursor of modern backpropagation.[BP1-4] [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? PDF. PDF. PDF. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world LSTM. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? this type of deep learning dates back to 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). Alphastar has a "deep LSTM core." used LSTM PDF. J.  Schmidhuber (AI Blog, 26 March 2021). the attention terminology[FWP2] now used PDF. HTML. Pictures (German). HTML overview. can be found here. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. More. More. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: PDF. PDF. PDF. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. An LSTM composes 84% of the model 2018. An LSTM with 84% of the model HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle More. More. More. PDF. More. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. PDF. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. approaches are now widely used. More. More on the Fundamental Deep Learning Problem. unsupervised pre-training is not necessary Local copy (plain HTML only). a general, practical, program-controlled computer. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

    Deep Learning: Our Miraculous Year 1990-1991 Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Deep Learning: Our Miraculous Year 1990-1991

    LSTM

    Highway Networks:
<p><a href=self-referential weight matrix Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber.

    Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

    In 1795, Gauss used what
</a><p><a name=who invented backpropagation?

    10-year anniversary of supervised deep learning breakthrough (2010)

    History of computer vision contests won by deep CNNs since 2011

    Artificial Curiosity & Creativity Since 1990-91 Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another

    26 March 1991: Neural nets learn to program neural nets with fast weights—like today
<a href=First Very Deep Learner of 1991 Sepp Hochreiter
<a href=Recurrent Neural Networks, especially LSTM

    Highway Networks:
<a href=Highlights of over 2000 years of computing history. Juergen Schmidhuber

    2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber.

    1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber.

    1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997 Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Deep Learning: Our Miraculous Year 1990-1991 Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Deep Learning: Our Miraculous Year 1990-1991

    LSTM

    Highway Networks:
<p><a href=self-referential weight matrix Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber.

    Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

    In 1795, Gauss used what
</a><p><a name=who invented backpropagation?

    10-year anniversary of supervised deep learning breakthrough (2010)

    History of computer vision contests won by deep CNNs since 2011

    Artificial Curiosity & Creativity Since 1990-91 Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another

    26 March 1991: Neural nets learn to program neural nets with fast weights—like today
<a href=First Very Deep Learner of 1991 Sepp Hochreiter
<a href=Recurrent Neural Networks, especially LSTM

    Highway Networks:
<a href=Highlights of over 2000 years of computing history. Juergen Schmidhuber

    2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber.

    1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber.

    1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997 Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Deep Learning: Our Miraculous Year 1990-1991 Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Critique of 2018 Turing Award

    Deep Learning: Our Miraculous Year 1990-1991