Regina 11° Partly cloudy |
LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1-2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2]} vanishing gradients (1991)^{[VAN1]} & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991).^{[FWP0-2,6]} ^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} This may explain some of ACM's misattributions.^{[T19]} II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,^{[DEC]} they were academia and industry,^{[DL4]} mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)^{[LSTM0-6]} vanishing gradient problem student Sepp Hochreiter in 1991.^{[VAN1]} This happened long before the similar work of Bengio (see Sec. XVII).^{[MIR] (Sec. 3,Sec. 4)} LSTM was refined with my student Felix Gers^{[LSTM2]} through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).^{[CTC]} Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). Markov models (HMMs)^{[BW][BRI][BOU]} (Sec. XV). Hinton et al. (2012) still used the old hybrid approach^{[HYB12]} and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer^{[LSTM4][LSTM14]} as a postdoc in Hinton's lab.^{[LSTM8]} CTC-LSTM dramatically improved Google's speech recognition.^{[GSR][GSR15][DL4]} on-device speech recognition^{[GSR19]} (not any longer on the server) LSTM^{[MIR](Sec. 4)} (see Sec. VI & XI & XV). of text^{[SNT]} (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} See also Sec. VI & XI & XV. tailored by Bengio's team.^{[ATT14][FWP]} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.^{[LSTM-RL][RPG][LSTMPG]} In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,^{[AM16]} predicting what's going on in nuclear fusion reactors, and so on.^{}[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).^{[JOU17]} Apparently the first LSTM journal paper^{[LSTM1][R5]} is now the most frequently cited D. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^{[CNN1-4]} The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979).^{[CNN1]} The popular downsampling variant called max-pooling was introduced by Weng et al. (1993).^{[CNN3]} In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. LeCun's team later contributed improvements of CNNs, especially for images^{[CNN2,4]} (see Sec. XVIII). Finally, my own team showed in 2010^{[MLP1]} unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton^{[VID1]} who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]} CNNs of 2006.^{[GPUCNN]} winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^{[GPUCNN5]} at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^{[DAN1]} in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet^{[GPUCNN3]} of Hinton's student Krizhevsky won the ImageNet^{[IM09]} 2012 contest^{[GPUCNN4-5][R6]} (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.^{[SCAN]} The VGG network (ImageNet 2014 winner)^{[GPUCNN9]} and other highly cited CNNs^{[RCNN1-3]} further extended the work of 2011.^{[MIR](Sec. 19)} ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) which currently gets more citations per year^{[MOST]} Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net is actually the feedforward net version of vanilla LSTM.^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.
appeared long before the 1980s. were proposed already in the 1940s/50s^{[MC43][K56]} (but don't forget prior work in physics since the 1920s^{[L20][I25][K41][W45]}). deep convolutional NN architecture was proposed in the 1970s.^{[CNN1]} NNs without hidden layers learned in 1958^{[R58]} regression and the method of least squares^{[DL1-2]}). about deeper adaptive NNs^{[R61,R62]} layers (already containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^{[DEEP2]} highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that's what it was.^{[MIR](Sec. 1)[R8]} LBH failed to cite this. XIII & III & V & VIII & IX & X. LBH & co-authors, e.g., Sejnowski^{[S20]} (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert^{[M69]} researchers took a fresh look at the problem in the 1980s."^{[S20]} However, as mentioned above, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method.^{[DEEP1-2][DL2]} Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)} (but see a 1989 paper^{[MOZ]}). However, it became really deep in 1991 in my lab,^{[UN-UN3]} which has See Sec. 1 of the overview:^{[MIR]} First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.^{[UN2][DL1][UN]} (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000^{[LSTM17]} more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR] (Sec. 19)} III. Note that LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets^{[HW1-3]} brought it to feedforward NNs.^{[MOST]}
by others (Sec. III).^{[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4]} deep learning multilayer perceptrons (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs,^{[UN1-2]} the vanishing gradient problem (1991)^{[VAN1]} & solutions to it (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][GPUCNN5]} and other foundations.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)} II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",^{[DL7]} referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} we had this type of deep learning already in 1991;^{[UN][UN1-2]} see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs^{[DEEP1-2]} which Hinton,^{[UN4]} Bengio,^{[UN5]} and LBH^{[DL3,DL3a]} did not cite either. See Sec. X.
my comments systematically track the sequential order of ACM's claims.^{[T19]}
ACM's statement on Turing is greatly misleading, like some of its other statements.^{[T19]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]} Much of early AI in the 1940s-70s was actually about theorem proving^{[ZU48][NS56]}
In 1936, Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result,^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing,^{[POS]} my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.^{[HIN]}) open problem "P=NP?" in his famous letter to John von Neumann (1956).^{[GOD56][URQ10]} Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936^{[ZU36-38][Z36][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Zuse also created the first high-level programming language in the early 1940s.^{[BAU][KNU]} conditional jump instruction.^{[RO98]}
that learn internal representations (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2][UN]} vanishing gradients (1991)^{[VAN1]} & solutions to it (Sec. A),^{[LSTM0-17][CTC]} (2004),^{[GPUNN][GPUCNN5]} record-breaking deep supervised NNs (2010)^{[MLP1-2]} and contest-winning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991),^{[FWP0-2,6]} and more.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]} II & I & III & XIII & X & XVII & XII & XVIII & XX.
"advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011^{[MLP1-2][DAN][DAN1][GPUCNN5][R6]} and through Highway Net-like NNs (2015),^{[HW1-3][R5]} although the principles of CNNs were invented and developed by others since the 1970s.^{[CNN1-4]} See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.^{[MIR]}
DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).^{[GPUCNN5,8]} and were able to greatly improve steel defect detection.^{[ST]} All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky won ImageNet 2012.^{[GPUCNN5][R6]} mitosis detection.^{[MGC][GPUCNN5,8]} approach of D & XI).
without citing them.^{[DL1][DLC][HIN][R2-R4][R7-R8]} V & XII & XIX & II & III & XIII & XVII & X & I.
who failed to cite them, even in later work.^{[HIN][DLC][DL1-2][DEEP1-2][CMB][R7-R8]} See Sec. II & III & XIII & V & X & XIV & I.
first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL^{[DL6,6a]} was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",^{}[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.^{[DEEP1-2][CNN1][HIN][R8][DL1][DLC]} Even deep learning through unsupervised pre-training was introduced by others.^{[UN1-3][R4][HIN](Sec. II)} II & III & XIII & V & I.
ignored by LBH's papers^{[HIN][R7-R8][R2-R5]} (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).
ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),^{[GPUNN][GPUCNN5]} made GPU-based NNs fast and deep enough an important benchmark record,^{[MLP1-2]} unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.^{[VID1]} our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]} vision (explicitly mentioned by ACM) for the first time^{[R6]} (see Sec. D).
Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.^{[LSTM1-4][CTC]} In particular, as mentioned in Sec. A, such as HMMs.^{[BW][BOU][BRI][HYB12]} As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.
ACM's statement is "less wrong" than Honda's^{[HIN](Sec. I)} but still (and apparently even other award committees^{[HIN](Sec. I)} backpropagation by Rumelhart et al. (1985-86)^{[RUM]} (1982).^{[BP2]} And the article^{[RUM]} even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),^{[BP1]} Kelley already had a precursor thereof in the field of control theory;^{[BPA]} see also later work of the early 1960s.^{[BPB][BPC]}^{[R7]} internal representations in hidden layers of NNs.^{[RUM]} But this was essentially just an experimental analysis of a known method.^{[BP1-2]} And history of backpropagation can be found at Scholarpedia^{[DL2]} and in my award-winning survey.^{[DL1]} Also see Sec. XIX, II.
Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.^{[BP1]} recent debate:^{[HIN]} It is true that in 2018, Hinton^{[AOI]} Rumelhart^{[RUM]} with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.^{[HIN]} Neither in a popular book^{[AOI]} nor in other recent work^{[DL3,DL3a]} did he cite Linnainmaa (1970),^{[BP1]} the true creator.^{[BP4-5]} that his 2015 survey^{[DL3]} does cite Werbos (1974) who however described the method correctly only later in 1982^{[BP2]} and also failed to cite Linnainmaa^{[BP1]} (compare Amari's work of 1977^{[BP6]}). Linnainmaa's method was well-known.^{[BP5][DL1-2][DLC]} It wasn't created by "lots of different people" as Hinton suggested^{[AOI][HIN][R11]} one person who published first^{[BP1]} and therefore should get the credit.
Boltzmann Machine (BM)^{[BM]} a learning.^{[HIN]} Recently, however, I learnt through a reader that even the BM paper^{[BM]} did not cite prior relevant work by Sherrington & Kirkpatrick^{[SK75]} and Glauber.^{[G63]} (Compare related work.^{[H86][H88][S93]}) multilayer perceptrons with arbitrarily many layers.^{[DEEP1-2][HIN]} Sec. II V & X.^{[MIR](Sec. 1)[R8]}
As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert^{[M69]} at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800^{[DL1-2]}) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.^{[DEEP2][BP6][CNN1][DL1-2]}
Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).^{[Drop1-2]} Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,^{[DAN1]} Sec. D above. Back then, the only really of deep CNNs through GPUs.^{[GPUCNN1,3,5][R6]} Already before ImageNet 2012,^{[R6]} fast deep CNN called DanNet a monopoly on winning computer vision competitions.^{[GPUCNN5]} It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011^{[GPUCNN2][DAN,DAN1][R6]} long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.^{[MIR]}
since the late 1980s.^{[BW][BRI][BOU]} LSTM (1990s-2005)^{[LSTM0-6]} and CTC^{[CTC]} (2006), which were applied to speech in 2007.^{[LSTM4][LSTM14]} CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.^{[BW][BRI][BOU][HYB12]} See also Sec. A.
5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.^{[SNT]} Bengio^{[NPM]} characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures^{[PO87][PO90]}). In the 2010s, was actually the LSTM of our team,^{[LSTM0-6]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B. Bengio's team^{[ATT14]} has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),^{[FWP2][FWP]} and "hard" attention (in observation space) in the context of RL^{[ATT][ATT0-1]} (1990). attention-based Transformers^{[TR1-6]} are FWPs of 1991^{[FWP0-1]} which have become a popular alternative to RNNs. My FWP of 1991^{[FWP0-1]} (now often called keys and values for self-attention).^{[TR1-6][FWP]} the 2010s,^{[DEC]} Transformers^{[TR1-2]} a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly^{[LSTM13,17]} linear Transformers or Performers^{[TR5-6]} which are formally equivalent to my 1991 FWPs (apart from normalization).^{[FWP6][FWP]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves.
See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper^{[ATT2]} his own work:^{[ATT3]}
GANs^{[GAN0-1]} (2010-2014) are actually a simple application^{[AC]} of the adversarial curiosity (AC) principle from 1990^{[AC90,90b][AC20]} (see also surveys^{[AC09-10]}). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis^{[GAN1]} (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,^{[GAN1]} a well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.^{[AC20][AC]} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]}) Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected their erroneous claims^{[GAN1]} about the other (1991).^{}[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,^{[AV1]} their NIPS 2014 paper^{[GAN1]} and some of the erroneous claims it made about my prior work.^{[AC20]} Goodfellow eventually admitted that PM is adversarial (his paper^{[GAN1]} still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^{[AC90]} like in StyleGANs^{[GAN2]}). When the authors^{[GAN1]} I published one myself in the hopes of correcting the annals of history.^{[AC20]} that they are instances of my earlier work.^{[R2][AC20]} vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^{[VAN2]} without citing Sepp. was settled in favor of Sepp.^{[VAN1]} However, even after a common publication,^{[VAN3]} Bengio published papers^{[VAN4][XAV]} are poor indicators of truly pioneering work.^{[NAT1]} (Margin note: Bengio states^{[YB20]} that in 2018 he one must at least clarify it later,^{[DLC]} Bengio also claims^{[YB20]} that in 1995 my publications on exactly this topic date back to 1991-93.^{[UN0-2][UN]} which I started in 1987^{[META1][META]} long before Bengio that he did it before me.^{[R3]} Bengio also writes^{[YB20]} that in Regarding attention-based Transformers,^{[TR1-6]} Bengio^{[DL3a]} cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.^{[FWP,FWP0-2,6]} Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"^{[LSTMGRU]} for a variant of our vanilla LSTM architecture^{[LSTM2]} (2000) which he did not cite although our work^{[LSTM2]} was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method. learn to count^{[LSTMGRU2]} nor learn simple non-regular languages;^{[LSTMGRU2]} they according to Google Brain.^{[LSTMGRU3]}) unsupervised pre-training for deep NNs.^{[UN0-4][HIN](Sec. II)[MIR](Sec. 1)} Hinton's paper^{[UN4]} (2006) appeared long after my earlier work on this^{[UN0-2]} the first NNs shown to solve very deep problems (see Sec. II above).^{[UN]} It was published in 1991-92^{[UN1]} when compute was about 1000 times more expensive than in 2006. survey (2015),^{[DL3][DLC]} See also Sec. II & III. compressing or distilling one NN into another.^{[UN0-2][DIST1-2][MIR](Sec. 2)} Hinton^{[DIST2]} (2006) did not cite my much earlier original work on this (1991),^{[UN1][UN]} not even in his later patent application fast weight programmers^{[FWP][FWP0-4a]} through tensor-like outer products (1991-2016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above). learning sequential attention with NNs.^{[MIR](Sec. 9)} Hinton^{[ATT3]} (2010) our much earlier work on this^{[ATT1][ATT]} although he was both reviewer and editor of my summary^{[ATT2]} (1990; see Sec. XVI above).
The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.^{[R4]} Remarkably, three of them are related to the 1991 paper^{[UN1][UN]} which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.^{[MIR]} See Sec. I for additional related issues of credit assignment.
LeCun's team has made important contributions to CNNs since 1989.^{[CNN2,4]} However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^{[CNN1]} NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.^{[MIR]} at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN1-3]} won the superhuman performance three times worse performance).^{[DAN1]} Again see Sec. D. at ICPR 2012, our DanNet^{[GPUCNN1-3]} won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)^{[GPUCNN5,7,8]} (before the similar AlexNet won ImageNet 2012^{[GPUCNN5][R6]} and the similar VGG network^{[GPUCNN9]} won ImageNet 2014). mitosis detection.^{[MGC][GPUCNN5,7,8]} Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,^{[AM16][DL1]} All of these fields were heavily shaped in the 2010s by our non-CNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See Sec. A, B, VI, XI.
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^{[BP2-4]} (see also Amari's work of 1977^{[BP6]}). recent work.^{[DL3,DL3a][DLC]} In 1960, Kelley already had a precursor of the algorithm.^{[BPA]} Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"^{[DL1]} (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.^{[DL2]}^{[BP4]}
However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)^{[DEEP1-2]} (and also Fukushima^{[CNN1][DL2]}) had long before LeCun. See Sec. D & II & XIII & V.
LeCun et al. neither cited the origins^{[BP1]} (1970) of this widely used type of automatic differentiation for differentiable networks of modules^{[DL2][BP4-5][DLC]} for such systems.^{[S80]} See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work.^{[PO87-90]}
(Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).^{[UN2]} For example, our adaptive subgoal generators (1991)^{[HRL0-2]} were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) see "100 Authors against Einstein."^{}[AH1] ad hominem attacks^{[AH2-3][HIN]} "If you cannot dispute a fact-based message, attack the messenger himself."^{[HIN]} award can ever change that.^{[HIN]} and their co-workers have contributed useful improvements of deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} whom they did not cite II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:^{[DLC][HIN]} to self-correction,"^{[SV20]} as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video^{[VID2]} Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article^{[NYT3]} Google's original 2016 paper on Google Translate^{[WU]} mentions LSTM over 50 times (see Sec. B). In ad hominem style,^{[AH2-3]} claiming credit he doesn't deserve for many, many things",^{[NYT1]} without LeCun also called the GANs of Bengio's team^{[GAN1]} GANs are variations of my work in 1990.^{[AC90,90b][AC20][R2]} According to Bloomberg,^{[AV2]} Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."^{[FAKE]} LBH, who called themselves the deep learning conspiracy,^{[DLC]} Our LSTM paper^{[LSTM1]} has got more citations than any paper by Bengio or LeCun,^{[R5]} Hinton's most cited paper (2012) is the one on GPU-based CNNs.^{[GPUCNN4][R5]} It follows our earlier work on supervised deep NNs (2010)^{[MLP1]} unsupervised pre-training for deep NNs by myself ^{[UN][UN0-3]} and later championed by Hinton;^{[UN4][VID1]} see Sec. D). Hinton (2012)^{[GPUCNN4]} characterizes our deep and fast DanNet (2011)^{[GPUCNN1-3]} as AlexNet won one;^{[R6]} see Sec. D, XIV. The highly cited VGG network (2014)^{[GPUCNN9]} Hinton's 2nd most cited paper^{[RUM][R5]} of Hinton's paper,^{[RUM]} adding citations for a book by Rumelhart & McClelland^{[R5]}). Backpropagation is a previously invented method^{[BP1]} whose origins of Ivakhnenko whom he has never cited;^{[DEEP1-2][R7-R8]} see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),^{[GAN1]} instances of my artificial curiosity (1990)^{[AC90,90b][AC20][R2]} which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)^{[UN4]} by ours^{[UN0-2][UN]} were preceded by Hanson's^{[Drop1-2]} As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,^{[DL3a]} again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,^{[MOST]} of deep learning MLPs since 1965^{[DEEP1-2]} (see Sec. II, XX) and backpropagation (1960-70)^{[BPA][BP1]} (see Sec. XIX, XII) and convolutional NNs since 1979^{[CNN1-4]} (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets^{[HW1-3]} brought it to feedforward NNs in May 2015.^{[MOST]} superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^{[DEC]} speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.^{[DEC]} Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.
As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^{[DLC]} depth that really learned.^{[DEEP1-2][R8]} Five years later, modern backpropagation
Yes, this critique is also an implicit critique of certain other awards to LBH.^{[HIN]} reddit.com/r/MachineLearning^{[R1-R12]} (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.^{[MIR]}
Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:^{[LECP]} "... else cites."^{[LECP]}
Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,^{[HIN]} as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2]}
Thanks to many expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]} (more). Preprint arXiv/1906.04493. Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^{[FWP]} Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.^{[BP1-4]} PDF. Link. PDF. First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^{[DL2]} English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weight-sharing PDF. Spatial Averaging.^{[CNN1]} PDF. PDF. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.^{[DAN1]} J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.^{[UN-UN2]} More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} unsupervised pre-training for deep NNs^{[UN4]} (2006) although this type of deep learning dates back to 1991.^{[UN1-2][UN]} II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J. Schmidhuber (AI Blog, 26 March 2021). alternative^{[FWP0-1]} to recurrent NNs. the fast weights^{[FAST,FASTa]} of Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-7]} can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]} (now often called keys and values for self-attention^{[TR1-6]}). The similar Transformers^{[TR1-2]} combine this with projections linear Transformers or Performers^{[TR5-6]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas^{[AC][AC90, AC90b][AC20]} of GANs. Description of GANs that does not cite the original work of 1990^{[AC][AC90,AC90b][AC20][R2]} (also containing wrong claims about Predictability Minimization^{[PM0-2][AC20]}). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.^{[DAN1]} This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.^{[HW3]} More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets^{[HW1]} More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^{[MLP1]} J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^{[MIR]} Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).^{}[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).^{[AC90]} Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^{[BP1-4]} also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed^{[MLP1-2]} unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
Menu
directory
status & updates
copyrights
AI Blog
Twitter: @SchmidhuberAI
Traditionally this is done with recurrent NNs (RNNs)
published.^{[FWP0-1]}
the fast weights of
another NN (see Sec. 1).
In 1991, one of them^{[FWP0-1]}
(now often called keys and values for self-attention; Sec. 2).
The very similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
to the 1991 Fast Weight Programmers^{[MOST]} (see this tweet).
In 1993, I also introduced
the attention terminology^{[FWP2]} now used
in this context^{[ATT]} (Sec. 4), and
RNNs that program themselves
(Sec. 3).
famous vanishing gradient
problem aka deep learning problem (analyzed a few months later in 1991^{[VAN1]})
through additive fast weight changes (Sec. 5).
additive neural activations of LSTMs / Highway Nets / ResNets^{[HW1-3]} (Sec. 5)
Annus Mirabilis of deep learning.^{[MIR]}
brand new, improved version^{[FWP6]} of
the 1991 fast weight update rule (Sec. 6).
reinforcement learning through neuroevolution^{[FWP5]} (2005-, Sec. 7),
goal-conditioned policy generators (2022),^{[GGP]}
metalearning machines that learn to learn^{[FWPMETA1-9]}
(1992-2022, Sec. 8).
As I have frequently emphasized since 1990,^{[AC90][PLAN][META]} artificial neural network (NN) universal self-referential formal systems,^{[GOD][GOD34]} I built NNs whose outputs are changes of programs or weight matrices of other NNs^{[FWP0-2]} (Sec. 1, 2, 3), their own weight change algorithms or learning algorithms^{[FWPMETA1-5]} (Sec. 8). gradient descent procedure^{[BP1-4][BPA][R7]}) can compute a direction in program space where one may find a better program,^{[AC90]} better program-modifying program.^{[FWP0-2][FWPMETA1-5]} started in 1965 layers.^{[DEEP1-2]} Their activation functions were Kolmogorov-Gabor polynomials which include the now popular multiplicative gates,^{[DL1-2]} fast weights. von der Malsburg was the first to explicitly emphasize the importance of NNs with rapidly changing weights.^{[FAST]} The second paper on this was published by Feldman in 1982.^{[FASTa]} The weights of a 1987 NN were sums of weights with a large learning rate and weights with a small rate^{[FASTb][T22]} (but have nothing to do with the NN-programming NNs discussed below). Fast Weight Programmers (FWPs) were published in 1991-93^{[FWP0-2]} (Sec. 1, 2, 3, 4). attention^{[ATT]} (Sec. 4) and Transformers^{[TR1-6]} (Sec. 2, 3, 4, 5).
on 26 March 1991, slow NN that learns by backpropagation^{[BP1-4]} to rapidly modify the fast weights of another NN,^{[FWP0]} essentially published in Neural Computation.^{[FWP1]} attention^{[ATT]} (Sec. 4) That is, I separated storage and control like in traditional computers, but in a fully neural way (rather than in a hybrid fashion^{[PDA1][PDA2][DNC]}). Synthetic Gradients.^{[NAN1-5]} recurrent NNs (RNNs) One of the FWPs of 1991^{[FWP0-1]} is illustrated in the figure. There is A disadvantage addressed in Sec. 2 is that the slow net needs many output units if the fast net is large.
The Fast Weight Programmer^{[FWP0-1]} depicted in Sec. 1 has a slow net unit for each fast weight. However, Section 2 of the same 1991 paper^{[FWP0]} linear^{[TR5-6]} Transformers^{[TR1-2]} or attention^{[ATT]} (compare Sec. 4). to the fast weight (which then may be normalized by a squashing function^{[FWP0]}). The second order tensor products.^{[FWP0-3a]} linear Transformers).^{[FWP6][TR5-6]} The highly successful Transformers of 2017^{[TR1-2]} can be viewed as a combination of my additive outer product fast weight principle^{[FWP0-2]} NN-programmed fast weights (Sec. 5 & 1). linear Transformers (2020-21)^{[TR5-6]} abandoned the softmax, essentially resurrecting the original 1991 system.^{[FWP0-1]} Compare Sec. 6. go back at least to Hebb's informal rule (1949)^{[HE49]} and Steinbuch's Learning Matrix around 1960.^{[ST61-63][AMH1-2][KOH72][LIT74][PAL80][KOS88]} since 1991.^{[FWP0-3a][TR5-6]} I offered the FWPs of 1991^{[FWP0-1]} as an sequence-processing recurrent NNs (RNNs) (Sec. 1), the computationally most powerful NNs of them all.^{[UN][MIR](Sec. 0)} Modern Transformers are also viewed as RNN alternatives, despite their limitations.^{[TR3-4]} The slow net and the fast net of the 1991 system^{[FWP0-1]} in Sec. 2 were feedforward NNs (FNNs), like most current Transformers.^{[TR1-6]} I collapsed all of this into a single RNN that could rapidly reprogram all of its own fast weights through additive outer product-based weight changes.^{[FWP2]} One motivation reflected by the title of the paper^{[FWP2]} of the same size: O(H^{2}) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)} See also our more recent work on FWPs since 2017,^{[FWP3-3a][FWPMETA7][FWP6]} and compare a recent study.^{[RA21]} 4. Attention terminology of 1993 Today, everybody is talking about attention when it comes to describing the principles of Transformers.^{[TR1-2]} The additive outer products^{[FWP0-1]} of the Fast Weight Programmers described in Sec. 2 and Sec. 3 Similarly, the attention weights or self-attention weights (see also^{[FWP4b-d]}) NN-programmed fast weights (Sec. 5).^{[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22]} 1993 paper^{[FWP2]} which internal spotlights of attention Fast Weight Programmers.^{[FWP2][ATT]} Apart from possible normalization/squashing,^{[FWP0]} are additive (Sec. 1 & 2). do not suffer during sequence learning from the famous vanishing gradient by my brilliant student Sepp Hochreiter a few months later in his 1991 diploma thesis.^{}[VAN1]
and both of them dating back to 1991, our miraculous year of deep learning.^{[MIR]} Basic Long Short-Term Memory^{[LSTM1]} solves the problem by adding at every time step That is, the core of LSTM is operating in a linear additive activation space (ignoring LSTM's multiplicative gates).^{[LSTM1][VAN1][MIR](Sec. 4 & Sec. 8)} Additive FWPs^{[FWP0-2]} (Sec. 1 & 2), however, solve the problem through a dual approach, By favoring additive operations yielding non-vanishing first derivatives and error flow,^{[VAN1]} Transformers^{[TR1-6]} also follow the additive approach.^{[FWP0-2]} (compare Sec. 2 and Sec. 4 on attention terminology since 1993).
[LSTM1-13] is mirrored in the LSTM-inspired Highway Network (May 2015),^{[HW1][HW1a][HW3]} the first working really deep It is essentially a feedforward version of LSTM^{[LSTM1]} with forget gates.^{[LSTM2]} Residual Net or ResNet^{[HW2]} (Dec 2015). Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,^{[DEC]} major IT companies overwhelmingly used smartphones.^{[DL4]} rapidly learn to solve quickly^{[LSTM13]} while plain Transformers can't yet.^{[TR4]} unsupervised pre-training of deep NNs.^{[UN0-UN2][MIR](Sec. 1)} dates back to 1991^{[UN]} Recent work of February 2021^{[FWP6]} mechanisms^{[TR5-6]} and Fast Weight Programmers^{[FWP0-2]} (Sec. 2).^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)} variants.^{[TR5-6]} Building on previous work^{[FWPMETA7]} on FWPs (Sec. 1, 2, 3, 8), we replace the 1991 elementary programming instruction based on additive outer products^{[FWP0-2]} by a delta rule-like^{[WID]} language modeling tasks.^{[FWP6]} Our code is public. work of June 2021^{[FWP7]} (also with Robert Csordas) points out that the original FWP formulation of 1991^{[FWP0-1]} is more general than the one of linear Transformers: a slow NN continually reprograms the weights of a fast NN with Our code is public.
as shown in 2005 with my former postdoc Faustino Gomez^{[FWP5]} (now CEO of NNAISENSE) Our 2005 paper on deep RL^{[DL6,6a]} was actually the first machine learning numerous weights of large NNs through very compact codes.^{}[KO0-2][CO1-4] Here we exploited that the Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search^{[CO2]} unsupervised pre-training.
Recent work of 2022^{[GGP]} with
My first work on metalearning machines that learn to learn was published in 1987.^{[META][R3]} metalearning in a very general way. In references^{[FWPMETA1-5]} since 1992, the slow NN and the fast NN (Sec. 1) are recurrent and identical. The RNN can see its own errors or reward signals called eval(t+1) in the image.^{[FWPMETA5]}
The 1993 FWP of Sec. 3^{[FWP2]} also was an RNN RNN above,^{[FWPMETA1-5]} it used outer products between key patterns and value patterns (Sec. 2) to manipulate used gradient descent in LSTM networks^{[LSTM1]} instead of traditional functions of two variables^{[HO1]} (more on LSTM and fast weights in Sec. 5). In 2020, Imanol et al. augmented an LSTM with an associative fast weight memory.^{[FWPMETA7]} partially observable environments.^{[FWPMETA7]} Our recent MetaGenRL (2020)^{[METARL10]} meta-learns See the blog post of my PhD student Louis Kirsch. outer-product-like fast weights encoded in the activations of LSTMs.^{[FWPMETA6]} variables^{[FWP2]} (Sec. 3). VS-ML can also learn to implement the backpropagation learning algorithm^{[BP1-4]} purely in the end-to-end differentiable forward dynamics of RNNs.^{[FWPMETA6]}
In 2022, we also published at ICML a modern self-referential weight matrix (SWRM)^{[FWPMETA8]} based on the 1992 SWRM.^{[FWPMETA1-5]}
self-improvement (compare this tweet).
There is another version of this article
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network.
The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.^{[AMH1]}
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
Transformers with linearized self-attention (1991-93).^{[FWP]} Today, both types are very popular.
PDF.
PDF.
More.
PS. (PDF.)
Precursor of modern backpropagation.^{[BP1-4]}
PDF.
Link.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
PDF.
PDF.
PDF.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
Deep Learning.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By
greatly improved (CTC-based)
on-device speech recognition
(on the phone, not the server)
LSTM.
PDF.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
neural networks learning to control dynamic external memories.^{[PDA1-2][FWP0-1]}
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
alternative^{[FWP0-1]} to recurrent NNs.
the fast weights^{[FAST,FASTa]} of
Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-8]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]}
(now often called keys and values for self-attention^{[TR1-6]}).
The similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
RNNs that program themselves.
See tweet of 2022.
PDF.
"Transformer with linearized self-attention."^{[FWP]}
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30-year anniversary.
PDF.
Preprint: arXiv:1811.12143. PDF.
PDF.
Preprint: arXiv:2003.08165.
PDF.
HTML overview.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
Preprint: arXiv:2106.06295 (June 2021).
PDF.
PDF.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here.
Preprint arXiv:2012.14905 [cs.LG], 2020.
Report arXiv:2011.07831 [cs.AI], 2020.
Preprint: arXiv:2202.05780.
Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022).
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The
LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]} More.
Link.
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
More.
arxiv:1612.07771 (2016). Also at ICLR 2017.
PDF.
PDF.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
More.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020.
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.^{[MIR]}
PDF.
PDF.
Preprint arXiv:1608.05343, 2016.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
the 1991 publication on what's now called "Transformers with linearized self-attention."^{[FWP0-6][TR5-6]}
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30-year anniversary.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
PDF.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF.
approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here (depth > 1000).
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 15 June 1991 (advisor J. Schmidhuber). PDF.
Menu
directory
status & updates
copyrights
https://people.idsia.ch/~juergen/deep-learning-history.html
AI Blog
@SchmidhuberAI
arXiv:2212.11279
is dominated by artificial neural networks (NNs) and deep learning,^{[DL1-4]}
hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous
deep learning survey^{[DL1]}
mentioning my own team's work, because (as of 2022) the most cited NNs are based on it.^{[MOST]}
Sec. 1: Introduction
Sec. 2: 1676: The Chain Rule For Backward Credit Assignment
Sec. 3: Circa 1800: First Neural Net (NN) / Linear Regression / Shallow Learning
Sec. 4: 1920-1925: First Recurrent NN (RNN) Architecture. ~1972: First Learning RNNs
Sec. 5: 1958: Multilayer Feedforward NN (without Deep Learning)
Sec. 6: 1965: First Deep Learning
Sec. 7: 1967-68: Deep Learning by Stochastic Gradient Descent
Sec. 8: 1970: Backpropagation. 1982: For NNs. 1960: Precursor.
Sec. 9: 1979: First Deep Convolutional NN (1969: Rectified Linear Units)
Sec. 10: 1980s-90s: Graph NNs / Stochastic Delta Rule (Dropout) / More RNNs / Etc
Sec. 11: Feb 1990: Generative Adversarial Networks / Artificial Curiosity / NN Online Planners
Sec. 12: April 1990: NNs Learn to Generate Subgoals / Work on Command
Sec. 13: March 1991: NNs Learn to Program NNs. Transformers with Linearized Self-Attention
Sec. 14: April 1991: Deep Learning by Self-Supervised Pre-Training. Distilling NNs
Sec. 15: June 1991: Fundamental Deep Learning Problem: Vanishing/Exploding Gradients
Sec. 16: June 1991: Roots of Long Short-Term Memory / Highway Nets / ResNets
Sec. 17: 1980s-: NNs for Learning to Act Without a Teacher
Sec. 18: It's the Hardware, Stupid!
Sec. 19: But Don't Neglect the Theory of AI (Since 1931) and Computer Science
Sec. 20: The Broader Historic Context from Big Bang to Far Future
Sec. 21: Acknowledgments
Sec. 22: 555+ Partially Annotated References (many more in the award-winning survey^{[DL1]})
quite erroneous ideas about the origins of the universe (see the final section
A history of AI written in the 1980s would have emphasized topics such as theorem proving,^{[GOD][GOD34][ZU48][NS56]} logic programming, expert systems, and heuristic search.^{[FEI63,83][LEN83]} an old area of research seeing renewed interest. Practical AI dates back at least to 1914, when Leonardo Torres y Quevedo (see below) built the first working chess end game player^{[BRU1-4]} any type of computation-based AI.^{[GOD][BIB3][GOD21,a,b]} emphasis on topics such as support vector machines and kernel methods,^{[SVM1-4]} Bayesian (actually Laplacian or possibly Saundersonian^{[STI83-85]}) reasoning^{[BAY1-8][FI22]} and other concepts of probability theory and statistics,^{[MM1-5][NIL98][RUS95]} decision trees,^{e.g.,[MIT97]} ensemble methods,^{[ENS1-4]} swarm intelligence,^{[SW1]} and evolutionary computation.^{[EVO1-7]([TUR1],unpublished)} Why? Because back then such techniques drove many successful AI applications.
A history of AI written in the 2020s must emphasize concepts such as the even older chain rule^{[LEI07]} and deep nonlinear artificial neural networks (NNs) trained by gradient descent,^{[GD']} in particular, feedback-based recurrent networks, which are general computers whose programs are weight matrices.^{[AC90]} Why? Because many of the most famous and most commercial recent AI applications depend on them.^{[DL4]} MACY conferences (1946-1953)^{[MACY51]} and the 1951 Paris conference on calculating machines and human thought, now often viewed as the first conference on AI.^{[AI51][BRO21][BRU4]} modern AI based on "deep learning" with NNs.^{[DL1-2][DEC]} minimize pain, maximize pleasure, drive cars, etc.^{[MIR](Sec. 0)[DL1-4]}
The present piece also debunks a frequently repeated, misleading "history of deep learning"^{[S20][DL3,3a]} which ignores most of the pioneering work mentioned below.^{[T22]} See Footnote 6. The title image of the present article is a reaction to an erroneous piece of common knowledge which says^{[T19]} that the use of NNs "as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s," although such NNs appeared long before the 1980s.^{[T22]} on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} Finally,
In 1676, Gottfried Wilhelm Leibniz textbook on Leibniz' differential calculus.^{[LEI07-10][L84]}
This answer is used by the technique of gradient descent (GD), apparently first proposed by Augustin-Louis Cauchy in 1847^{[GD']} (and much later by Jacques Hadamard^{[GD'']}; the stochastic version called SGD is due to Herbert Robbins and Sutton Monro (1951)^{[STO51-52]}).
Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;^{[L84][SON18][MAD05][LEI21,a,b]} later Isaac Newton was also credited for his unpublished work.^{[SON18]} Their priority dispute,^{[SON18]} however, did not encompass the chain rule.^{[LEI07-10]} Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever^{[ARC06]}) paved the way for infinitesimals Sangamagrama and colleagues of the Indian Kerala school.^{[MAD86-05]} "the world's first computer scientist"^{[LA14]}) also laid foundations of modern computer science. He and the first with an internal memory.^{[BL16]} He described the principles of binary computers (1679)^{[L79][L03][LA14][HO66][LEI21,a,b]} His formal Algebra of Thought (1686)^{[L86][WI48]} was deductively equivalent^{[LE18]} to the much later Boolean Algebra (1847).^{[BOO]} all possible questions through computation;^{[WI48]}
Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L'Hopital (1696).^{[CONN21]} doing this).^{[T22]} It was not published until 1970, as discussed below.^{[BP1,4,5]}
In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Later Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795.^{[STI81]}
Rosenblatt's perceptron (1958)^{[R58]} combined a linear NN as above with an output threshold function to obtain a pattern classifier (compare his more advanced work on multi-layer networks discussed below). Joseph^{[R61]} Widrow & Hoff's similar Adaline learned in 1962.^{[WID62]}
analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.^{}[L20][I24,I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first learning RNNs (see below). were also discussed in 1943 by neuroscientists Warren McCulloch und Walter Pitts^{[MC43]} and formally analyzed in 1956 by Stephen Cole Kleene.^{[K56]}
In 1972, Shun-Ichi Amari made the Lenz-Ising recurrent architecture adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights.^{}[AMH1] See also Stephen Grossberg's work on biological networks,^{[GRO69]} David Marr's^{[MAR71]} and Teuvo Kohonen's^{[KOH72]} work, and Kaoru Nakano's learning RNN.^{[NAK72]}
10 years later, the Amari network was republished (and its storage capacity analyzed).^{[AMH2]} Some called it the Hopfield Network (!) or Amari-Hopfield Network.^{[AMH3]} sequence-processing generalization thereof.^{[AMH1]} learning RNNs. This, however, was first published many decades later,^{[TUR1]} which explains the obscurity of his thoughts here.^{[TUR21]} (Margin note: it has been pointed out that the famous "Turing Test" should actually be called the "Descartes Test."^{[TUR3,a,b][TUR21]})
Today, the most popular RNN is the Long Short-Term Memory (LSTM) mentioned below, which has become the most cited NN of the 20th century.^{[MOST]}
In 1958, Frank Rosenblatt not only combined linear NNs and threshold functions (see the section on shallow learning since 1800), he also had more interesting, deeper multilayer perceptrons (MLPs).^{[R58]} because only the last layer learned,^{[DL1]} Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM1-2][CONN21][T22]}
MLPs were also discussed in 1961 by Karl Steinbuch^{[ST61-95]} and Roger David Joseph^{[R61]} (1961). See also Oliver Selfridge's multilayer Pandemonium^{[SE59]} (1959). wrote about "back-propagating errors" in an MLP with a hidden layer^{[R62]} although he did not yet have a general deep learning algorithm for deep MLPs. What's now called backpropagation is quite different and was first published in 1970, as discussed below.^{[BP1-BP5][BPA-C]}
Today, the most popular FNN is a version of the LSTM-based Highway Net (mentioned below) called ResNet,^{[HW1-3]} which has become the most cited NN of the 21st century.^{[MOST]}
multiplicative gates).^{}[DEEP1-2][DL1-2][FDL] A paper of 1971^{[DEEP2]} highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} first introduced to Machine Learning much later by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} (Margin note: our 2005 paper on deep learning^{[DL6,6a]} was the first machine learning publication with the word combination "learn deep" in the title.^{[T22]})
Ivakhnenko and Lapa (1965, see above) end-to-end fashion from scratch by stochastic gradient descent (SGD),^{[GD1]} a method proposed in 1951 by Robbins & Monro.^{[STO51-52]}
Amari's implementation^{[GD2,GD2a]} (with his student Saito) learned internal representations in a five layer MLP with two modifiable layers, which was trained to classify
See also Iakov Zalmanovich Tsypkin's even earlier work on gradient descent-based on-line learning for non-linear systems.^{[GDa-b]}
Remarkably, as mentioned above, Amari also published learning RNNs in 1972.^{[AMH1]}
In 1970, Seppo Linnainmaa was the first to publish what's now known as backpropagation, the famous algorithm for credit assignment in networks of differentiable nodes,^{[BP1,4,5]}
In 1982, Paul Werbos proposed to use the method to train NNs,^{[BP2]} extending ideas in his 1974 thesis.
In 1960, Henry J. Kelley already had a precursor of backpropagation in the field of control theory;^{[BPA]} see also later work of the early 1960s by Stuart Dreyfus and Arthur E. Bryson.^{[BPB][BPC]}^{[R7]} Unlike Linnainmaa's general method,^{[BP1]} the systems of the 1960s^{[BPA-C]}
Backpropagation is essentially an efficient way of implementing Leibniz's chain rule^{[LEI07-10]} (1676) (see above) for deep networks. Cauchy's gradient descent^{[GD']} uses this to such that the NN behaves more and more like some teacher, which could be a human, or another NN,^{[UN-UN2]} or something else. had just become accessible in wealthier academic labs. An experimental analysis of the known method^{[BP1-2]} yield useful internal representations in hidden layers of NNs.^{[RUM]} At least for supervised learning, backpropagation is generally more efficient than Amari's above-mentioned deep learning through the more general SGD method (1967), which learned useful internal representations in NNs about 2 decades earlier.^{[GD1-2a]}
It took 4 decades until the backpropagation method of 1970^{[BP1-2]} got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991^{[UN][UN0-3]} (see below), and later championed by others (2006).^{[UN4]} In fact, it was claimed^{[VID1]} postdoc Dan Ciresan^{[MLP1-2]} pre-training for important applications.^{[MLP2]}
Our system set a new performance record^{[MLP1]} on Jung & Oh in 2004^{[GPUNN]}). A reviewer called this a "wake-up call to the machine learning community." researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]} and then also by Amari's SGD for MLPs.^{[GD1-2]} Minsky neither cited this work nor corrected his book later.^{[HIN](Sec. I)[T22]} (such as the Boltzmann machine^{[BM][HIN][SK75][G63][T22]}) without relating them to the original work,^{[DLC][S20][T22]} although the true history is well-known. in the 1960s-70s, especially outside of the Anglosphere.^{[DEEP1-2][GD1-3][CNN1][DL1-2][T22]} Blatant misattribution and unintentional^{[PLAG1][CONN21]} or intentional^{[FAKE2]} plagiarism are still tainting the entire field of deep learning.^{[T22]} Scientific journals "need to make clearer and firmer commitments to self-correction,"^{[SV20]} as is already the standard in other scientific fields.
Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^{[CNN1-4]} Neocognitron.^{[CNN1]} rectified linear units (ReLUs) for NNs (1969).^{[RELU1]} They are now widely used in CNNs and other NNs.
In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),^{[BP1-2]} and applied to speech.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990^{[CNN3a]} and by Juan Weng et al. for higher-dimensional CNNs in 1993.^{[CNN3]} Yann LeCun's team has contributed improvements of CNNs, especially for images.^{[CNN2,4][T22]} Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]}
CNNs (Dan Ciresan et al., 2011).^{[GPUCNN1,3,5]}
Our fast GPU-based^{[GPUNN][GPUCNN5]}
CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]}
CNNs of 2006.^{[GPUCNN]} In 2011, DanNet became the first pure deep CNN
to win computer vision contests.^{[GPUCNN2-3,5]}
Highway Net^{[HW1]}
with open gates
ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) and currently the most cited NN,^{[MOST]} is a version (with open gates) of our earlier Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). NNs with rapidly changing "fast weights" were introduced by v.d. Malsburg (1981) and others.^{}[FAST,a,b] Deep learning architectures that can manipulate structured data such as graphs^{[T22]} were our graph NN-like, Transformer-like Fast Weight Programmers of 1991^{[FWP0-1][FWP6][FWP]} which learn to continually rewrite mappings from inputs to outputs (addressed below), and the work of Baldi and colleagues.^{[BA96-03]} Today, graph NNs are used in numerous applications.
Werbos,^{[BP2][BPTT1]} Williams,^{[BPTT2][CUB0-2]} and others^{[ROB87][BPTT3][DL1]} analyzed ways of implementing gradient descent^{[GD'][STO51-52][GDa-b][GD1-2a]} in RNNs. Kohonen's self-organising maps became popular.^{[KOH82-89]} in space and time.^{[BB2][NAN1-4][NHE][HEL]} See overviews^{[MIR](Sec. 15, Sec. 17)} and recent renewed interest in such methods.^{[NAN5][FWPMETA6][HIN22]} version of this became popular under the moniker "dropout."^{[Drop1-4][GPUCNN4]} Generative Adversarial Networks (GANs) have become very popular.^{}[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.^{[AC90-20][GAN1]} Two dueling NNs (a probabilistic generator and a predictor) are trying to maximize each other's loss in a minimax game.^{[AC](Sec. 1)} (using stochastic units^{[AC90]} like in the much later StyleGANs^{[GAN2]}). the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain.^{[AC90]} (The world model can also be used for continual online action planning.^{[AC90][PLAN2-3][PLAN]})
4 years before a 2014 paper on GANs,^{[GAN1]} my well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a given set.^{[AC20][AC][T22](Sec. XVII)} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]} has been widely used for exploration in Reinforcement Learning^{[SIN5][OUD13][PAT17][BUR18]} for synthesis of realistic images,^{[GAN1,2]} although the latter domain was recently taken over by Rombach et al.'s Latent Diffusion, another method published in Munich,^{[DIF1]} building on Jarzynski's earlier work in physics from the previous millennium^{[DIF2]} and more recent papers.^{[DIF3-5]} Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.^{[PM0-2][AC20][R2][MIR](Sec. 7)} which is now considered a remaining grand challenge.^{}[LEC] The early 1990s, however, saw first exceptions: NNs that learn to decompose complex spatio-temporal observation sequences into compact but meaningful chunks^{[UN0-3]} (see further below), and NN-based planners of hierarchical action sequences for compositional learning,^{[HRL0]} as discussed next. This work injected concepts of traditional "symbolic" hierarchical AI^{[NS59][FU77]} into end-to-end differentiable "sub-symbolic" NNs. end-to-end differentiable NN-based subgoal generators for Hierarchical Reinforcement Learning (HRL).^{[HRL0]} Soon afterwards, this was also done with recurrent NNs that learn to generate sequences of subgoals.^{[HRL1-2][PHD][MIR](Sec. 10)} problem."^{[LEC]}
Compare other NNs that have "worked on command" since April 1990, in particular, for learning selective attention,^{[ATT0-3]} artificial curiosity and self-invented problems,^{[PP][PPa,1,2][AC]} upside-down reinforcement learning^{[UDRL1-2]} and its generalizations.^{[GGP]} Recently, Transformers^{}[TR1] have been all the rage, e.g., generating human-sounding texts.^{[GPT3]} Transformers with "linearized self-attention"^{[TR5-6]} were first published in March 1991^{[FWP0-1][FWP6][FWP]} These so-called "Fast Weight Programmers" or "Fast Weight Controllers"^{[FWP0-1]} separated storage and control like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way (rather than in a hybrid fashion^{[PDA1-2][DNC]}). The "self-attention" in standard Transformers^{[TR1-4]} combines this with a projection and softmax (using attention terminology like the one I introduced in 1993^{[ATT][FWP2][R4]}).
Today's Transformers heavily use unsupervised pre-training^{[UN0-3]} (see next section), another deep learning methodology Annus Mirabilis of 1990-1991.^{[MIR][MOST]}
The 1991 fast weight programmers 1992^{[FWPMETA1-9][HO1]} extended my 1987 diploma thesis,^{[META1]} which introduced algorithms not just for learning but also for meta-learning or learning to learn,^{[META]} to learn better learning algorithms through experience. This became very popular in the 2010s^{[DEC]} when computers were a million times faster. layers of neurons or many subsequent computational stages.^{}[MIR] ones^{[DL1-2]} (but see a 1989 paper^{[MOZ]}). of arbitrary depth.^{[DL1]} Before the 1990s, however, RNNs failed to learn deep problems in practice.^{[MIR](Sec. 0)} scales:^{[LEC]} the Neural Sequence Chunker^{[UN0]} or Neural History Compressor.^{[UN1]} "very deep learning" tasks of depth > 1000^{[UN2]} (requiring Neural History Compressor.^{[UN3]} (See also recent work on unsupervised NN-based abstraction.^{[OBJ1-5]}) More than a decade after this work,^{[UN1]} called Deep Belief Networks (DBNs).^{[UN4]} (or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]} using my NN distillation procedure of 1991.^{[UN0-1][MIR]} NN distillation was also republished many years later,^{[DIST2][MIR][HIN][T22]} and is widely used today. used by Transformers^{[TR1-6]} for Transformers with linearized self-attention were also first published^{[FWP0-6]} in Annus Mirabilis of 1990-1991,^{[MIR][MOST]} together with unsupervised/self-supervised pre-training for deep learning.^{[UN0-3]} See the previous section. Deep learning is hard because of the Fundamental Deep Learning Problem his diploma thesis which I had the pleasure to supervise.^{[VAN1]} First he implemented the Neural History Compressor above but then did much more: In both cases, learning fails (compare^{[VAN2]}). This analysis led to basic principles of what's now called LSTM (see below). Long Short-Term Memory (LSTM) recurrent neural network^{[LSTM1-6]} overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 diploma thesis,^{[VAN1]} which I consider one of the most important documents in the history of machine learning. It also provided essential insights for overcoming the problem, through basic principles (such as constant error flow) of what we called LSTM in a tech report of 1995.^{[LSTM0]} After the main peer-reviewed publication in 1997^{[LSTM1][25y97]} (now the most cited NN article of the 20th century^{[MOST]}), application of LSTM to speech (2004).^{[LSTM10]} 2005 saw the first publication of LSTM with full backpropagation through time and of bi-directional LSTM^{[LSTM3]} (now widely used). Another milestone of 2006 was the training method "Connectionist Temporal Classification" or CTC^{[CTC]} for simultaneous alignment and recognition of sequences. Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). NNs and traditional approaches such as Hidden Markov Models (HMMs).^{[BW][BRI][BOU][HYB12][T22]} three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM was soon used for everything that involves sequential data such as speech^{[LSTM10-11][LSTM4][DL1]} and videos. Google's speech recognition on the Android smartphones.^{[GSR15]} Many other companies adopted this.^{[DL4]} on-device speech recognition of 2019 (now on your phone, not on the server) LSTM. In 1995, we already had an excellent neural probabilistic text model^{}[SNT] whose basic concepts were Nakamura and Shikano's 1989 word category prediction model.^{[NPMa]} In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} achieve only 10 billion clicks),^{[FB17][DL4]} Apple's Quicktype on roughly 1 billion iPhones,^{[DL4]} the voice of Amazon's Alexa,^{[DL4]} image caption generation^{[DL4]} & automatic email answering^{[DL4]} etc. Business Week called LSTM "arguably the most commercial AI achievement."^{[AV1]} have "LSTM" in their title.^{[DEC]}
Highway Network^{[HW1]} (previous NNs had at most a few tens of layers). Microsoft's ResNet^{[HW2]} (which won the ImageNet 2015 contest) is a version thereof The earlier Highway Nets perform roughly as well as their ResNet versions on ImageNet.^{[HW3]} Variants of highway gates are also used for certain algorithmic tasks where the pure residual layers do not work as well.^{[NDR]}
Deep learning is all about NN depth.^{[DL1]} LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward Net version called ResNet the most cited NN of the 21st.^{[MOST]} (Citations, however, are a highly questionable measure of true impact.^{[NAT1]}) Reinforcement Learning (RL),^{}[KAE96][BER96][TD3][UNI][GM3][LSTMPG] expected cumulative reward signals.^{[DL1]} formulated in the general RL framework.^{[UNI]} Monte Carlo (tree) search (MC, 1949),^{[MOC1-5]} dynamic programming (DP, 1953),^{[BEL53]} artificial evolution (1954),^{[EVO1-7]([TUR1],unpublished)} alpha-beta-pruning (1959),^{[S59]} control theory and system identification (1950s),^{[KAL59][GLA85]} stochastic gradient descent (SGD, 1951),^{[STO51-52]} and universal search techniques (1973).^{[AIT7]} system identification,^{[WER87-89][MUN87][NGU89]} DP and its online variant called Temporal Differences (TD),^{[TD1-3]} artificial evolution,^{[EVONN1-3]} and policy gradients.^{[GD1][PG1-3]} Many additional references on this can be found in Sec. 6 of the 2015 survey.^{[DL1]}
When there is a Markovian interface^{[PLAN3]} RL with DP/TD/MC-based FNNs can be very successful, as shown in 1994^{[TD2]} (master-level backgammon player) and the 2010s^{[DM1-2a]} (superhuman players for Go, chess, and other games). history of previous inputs, our combinations of RL algorithms and LSTM^{[LSTM-RL][RPG]} have become standard, in particular, our LSTM trained by policy gradients (2007).^{[RPG07][RPG][LSTMPG]}
For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} commonsense reasoning^{[MAR15]} and learning to think.^{[PLAN4-5]} time scales?^{[LEC]} We published answers to these questions in 1990-91: self-supervised neural history compressors^{[UN][UN0-3]} learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators^{[HRL3][MIR](Sec. 10)} learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in 1997^{[AC97][AC99][AC02]} and 2015-18.^{[PLAN4-5]} century^{}[SHA7a][RAU1] by Heron of Alexandria was perhaps the first machine with a stored program.^{[BAN][KOE1]} It used pins on
Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"^{[SMO13]}) designed the first machine (the step reckoner) that could perform all four arithmetic operations, and the first with a memory.^{[BL16]} cards (1679),^{[L79][L03][LA14][HO66]} and published the chain rule^{[LEI07-10]} (see above), essential ingredient of deep learning and modern AI.
Leonardo Torres y Quevedo (mentioned in the introduction) became it at the 1951 Paris AI conference.^{[AI51][BRO21][BRU4]} Konrad Zuse The corresponding patent of 1936^{[ZU36-38][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Unlike Babbage, Zuse used Leibniz' principles of binary computation (1679)^{[L79][LA14][HO66][L03]} This greatly simplified the hardware.^{[LEI21,a,b]} Church^{[CHU]} (1935), Turing^{[TUR]} (1936), and Post^{[POS]} (1936). conditional jump instruction.^{[RO98]}
John Atanasoff (the "father of tube-based computing"^{[NASC6a]}). Julius Edgar Lilienfeld in 1925.^{[LIL1-2]} used to break the Nazi code.^{[NASC6]} someone other than Zuse (1941)^{[RO98]} was Howard Aiken's decimal MARK I (US, 1944). and the 1948 upgrade of ENIAC, which was reprogrammed by entering numerical instruction codes into read-only memory.^{[HAI14b]} with several transistors on a common substrate (granted in 1952).^{[IC49-14]} In 1959, Robert Noyce presented a monolithic IC.^{[IC14]} ICs/GPUs of today (2022) contain many billions of transistors (almost all of them of Lilienfeld's 1925 FET type^{[LIL1-2]}). Moore's Law which states that the number of transistors^{[LIL1-2]} raw computational power of all human brains combined.^{[RAW]} According to Bremermann (1982),^{[BRE]} as previously noted back in 2004.^{[OOPS2][ZUS21]} are actually light beams).^{[DL2]} are expected to become even much more important than they are today.^{[DL2]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}
He combined Georg Cantor's diagonalization trick^{[CAN]} with the foundational work by Gottlob Frege^{[FRE]} (who introduced the first formal language in 1879), Thoralf Skolem^{[SKO23]} (who introduced primitive recursive functions in 1923) and Jacques Herbrand^{[GOD86]} (who identified Gottfried Wilhelm Leibniz^{[L86][WI48]} (see above), deductively equivalent^{[LE18]} to the later Boolean Algebra of 1847.^{[BOO]} In 1936, Alan M. Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result.^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing.^{[POS]} the world's first working programmable general-purpose computer,^{[ZU36-38][RO98][ZUS21]} the first high-level programming language.^{[BAU][KNU]} 1945^{[KNU]} in 1948.^{[ZU48]} Compare Newell & Simon's later work on theorem proving (1956).^{[NS56]} In 1964, Ray Solomonoff combined Bayesian (actually Laplacian^{[STI83-85]}) probabilistic reasoning and theoretical computer science^{[GOD][CHU][TUR][POS]} of learning to predict future data from past observations.^{[AIT1][AIT10]} With Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),^{[AIT1-22]} going beyond traditional information theory^{[SHA48][KUL]} this concept,^{[AIT7][AIT5][AIT12-13][AIT16-17]} as well as applications to NNs.^{[KO2][CO1-3]}
In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant^{[UNI]}) augmented Solomonoff's universal predictor^{[AIT1][AIT10]} environments.^{[AIT20,22]} He also derived the asymptotically fastest algorithm for all well-defined computational problems,^{[AIT21]} a beautiful pattern of exponential acceleration in it,^{}[OMG] which I have presented in many talks since then, and which also made it into Sibylle Berg's award-winning book "GRM: Brainfuck."^{[OMG2]} intervals: just a few decades or centuries or at most millennia.^{[OMG1]} Heron of Alexandria^{[RAU1]} in the 1st century). The telephone (e.g., Meucci 1857, Reis 1860, Bell 1876)^{[NASC3]} Haber-Bosch process for creating artificial fertilizer, without which the world could feed at most 4 billion people.^{[HAB1-2]} first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).^{[AUT]} Back then, I worked on my 1987 diploma thesis,^{[META1]} which introduced algorithms not just for learning but also for meta-learning or learning to learn,^{[META]} to learn better learning algorithms through experience (now a very popular topic^{[DEC]}). And then came our Miraculous Year 1990-91^{[MIR]} at TU Munich, the root of today's most cited NNs^{[MOST]} and of modern deep learning through artificial curiosity and generative adversarial NNs for agents that invent their own problems (see above),^{[AC90-AC20][PP-PP2][SA17]} Transformers with linearized self-attention (see above),^{[FWP0-6][TR5-6]} distilling teacher NNs into student NNs (see above),^{[UN][UN0-3]} at multiple levels of abstraction and multiple time scales (see above),^{[HRL0-2][LEC]} and other exciting stuff. Much of this has become very popular, and improved the lives of billions of people.^{[DL4][DEC][MOST]} (take all of this with a grain of salt, though^{[OMG1]}). lab for decades^{[AC][AC90,AC90b]}) will quickly improve themselves, restricted only by the fundamental limits of computability and physics. it,^{[ACM16][FA15][SP16][SA17]} make more and bigger AIs. Those who don't won't have an impact.^{[ACM16][FA15][SP16]}
Some of the material above was taken from previous AI Blog posts.^{[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22]}
publication page and my
arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
555+ References (and many more in the survey^{[DL1]})
In 2022, we are celebrating the following works from a quarter-century ago.
1. Journal paper on Long Short-Term Memory, the
(and basis of the most cited NN of the 21st).
all possible metaverses
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
meta-reinforcement learning.
5. Journal paper on hierarchical Q-learning.
8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity.
PDF.
The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
More.
PDF.
PDF.
general system
systems with intrinsic motivation,^{[AC90-AC95]} the system also
See later publications.^{[AC99][AC02]}
PDF.
PDF.
PDF. (More on
artificial scientists and artists.)
IEEE link.
PDF.
With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]}
(more).
Preprint arXiv/1906.04493.
Link.
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
H. Bruderer^{[BRU4]} calls that the first conference on AI.
Blog of Werner Vogels, CTO of Amazon (Nov 2016):
PDF.
First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network,^{[AMH3]} based on the (uncited) Lenz-Ising recurrent architecture.^{[L20][I25][T22]}
Mentions the recurrent Ising model^{[L20][I25]}on which the (uncited) Amari network^{[AMH1,2]} is based.
The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.^{[AMH1]} [AMH2] did not cite [AMH1].
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
Transformers with linearized self-attention (1991-93).^{[FWP]} Today, both types are very popular.
PDF.
PDF.
More.
PS. (PDF.)
H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite.
In fact, Hinton was the reviewer of a 1990 paper^{[ATT2]}
his own work:^{[ATT3]}
attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
arXiv/1409.0473, 2014-16.
This work on soft "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
J. Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around
Bloomberg, May 15, 2018.
PDF.
HTML.
PDF.
by Sherrington & Kirkpatrick^{[SK75]} & Glauber^{[G63]} nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} nor
Amari's work (1967-68)^{[GD1-2]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[S20][DLC]} failed to cite the prior art.^{[T22]}
formal Algebra of Thought (1686)^{[L86][WI48]} was
deductively equivalent^{[LE18]} to the much later
Precursor of modern backpropagation.^{[BP1-5]}
PDF.
Link.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
Link.
IEEE Spectrum, 2021. Link.
English version: [CNN1+]. More in Scholarpedia.
Link.
[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1-5]} and weight-sharing
PDF.
Spatial Averaging.^{[CNN1]}
Spatial Averaging.^{[CNN1]}
PDF.
PDF.
PDF.
Inverse, 2016. Link.
Since November 2021: Comments on version 1 of the report^{[T22]}
in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive.
PDF.
PDF.
Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
1st superhuman result in 2011.^{[DAN1]} Now everybody is using this approach.
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
the artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
1991 NN distillation procedure,^{[UN0-2][MIR](Sec. 2)}
More.
Deep Learning.
HTML.
A "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML.
Local copy (HTML only).
Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By
greatly improved (CTC-based)
on-device speech recognition
(on the phone, not the server)
LSTM.
PDF.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber's lab solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the
Internet Archive),
referring to Hinton's^{[UN4]} and Bengio's^{[UN5]}
unsupervised pre-training for deep NNs^{[UN4]} (2006) although
this type of deep learning dates back to Schmidhuber's work of 1991.^{[UN1-2][UN]}
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed^{[DLC2]} "Deep Learning Conspiracy" (Nature 521 p 436).
it). More on this under [T22].
J. Schmidhuber (AI Blog, 2022).
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022.
Preprint arXiv:2212.11279.
Tweet of 2022.
arxiv:1312.5602.
Link.
the first sentence of the abstract of the earlier tech report version^{[DM1]}
was created earlier by Jan Koutnik et al. in Schmidhuber's lab.^{[CO2]}
and PhDs in computer science. More.
Alphastar has a "deep LSTM core."
Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007).
Preprint arXiv:2112.10752, LMU Munich, 2021.
neural networks learning to control dynamic external memories.^{[PDA1-2][FWP0-1]}
arXiv:1808.03578, 2018.
arXiv:1808.03578, 2018.
Conf. on Neural Networks, Vol. 2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.^{[R62][T22]}
This overview does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.^{}[R62][T22]
Link.
used LSTM
over 4 billion automatic translations per day (The Verge, August 4, 2017);
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013.
PDF.
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
alternative^{[FWP0-1]} to recurrent NNs.
the fast weights^{[FAST,FASTa,b]} of
Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-8]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]}
(now often called keys and values for self-attention^{[TR1-6]}).
The similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
RNNs that program themselves.
See tweet of 2022.
PDF.
normalization).^{[FWP]}
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30-year anniversary.
PDF.
Preprint: arXiv:1811.12143. PDF.
PDF. Very similar to [FWP0-2], in both motivation [FWP2] and execution.
This work on "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
Preprint: arXiv:2003.08165.
PDF.
HTML overview.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
Preprint: arXiv:2106.06295 (June 2021).
PDF.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here.
Preprint arXiv:2012.14905 [cs.LG], 2020.
Report arXiv:2011.07831 [cs.AI], 2020.
Preprint: arXiv:2202.05780.
PDF.
Probably the first paper on using stochastic gradient descent^{[STO51-52]}
reverse mode of automatic differentiation or backpropagation^{[BP1]}).
OCR-based PDF scan of pages 94-135 (see pages 119-120).
Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.^{[GD1]} (S. Amari, personal communication, 2021.)
Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022).
arXiv:cs/0309048 (2003).
More.
PDF.
Cognitive Computation 1(2):177-193, 2009. PDF.
More.
Google Research Blog, Sep 2015, see also
Aug 2015 Google's speech recognition based on CTC and LSTM.
Alphr Technology, Jul 2015, or 9to5google, Jul 2015
WIRED, Sep 2016,
siliconANGLE, Sep 2016
Blog post, Internet Archive, 2010.
A blog post describing basic ideas^{[AC][AC90,AC90b][AC20]} of GANs.
A description of GANs that does not cite Schmidhuber's original GAN principle of 1990^{[AC][AC90,AC90b][AC20][R2][T22]} (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization^{[PM0-2][AC20][T22]}).
Link.
This was number 1 on Hacker News.
Frankfurter Allgemeine Zeitung, 16/6/2021.
Preprint arXiv/2005.14165.
for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint.
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
competitor.^{[DAN1]} This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
PDF.
DanNet,^{[DAN,DAN1][R6]}
to win computer vision contests in 2011^{[GPUCNN2-3,5]} (AlexNet and VGG Net^{[GPUCNN9]} followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)^{[RELU1]} and dropout (a variant of Hanson 1990 stochastic delta rule)^{[Drop1-4]} but neither cites the original work^{[RELU1][Drop1]} nor the basic CNN architecture (Fukushima, 1979).^{[CNN1]}
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
first deep learner to win a medical imaging contest (2012). Link.
J. Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century^{[HAB1]}
PDF.
PDF.
Bengio claimed^{[YB20]}
Schmidhuber's publications on exactly this topic
date back to 1991-93.^{[UN0-2][UN]}
An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.^{[NHE]}
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22].
previous related work.^{[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6]}
PDF.
what Y. LeCun called an "open problem" in 2022.^{[LEC]}
North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990.
PDF.
This work did not cite Schmidhuber's gradient-based subgoal generators for hierarchical reinforcement learning (1990).^{[HRL0-2]}
PDF.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The
LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]}
More.
Link.
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
More.
arxiv:1612.07771 (2016). Also at ICLR 2017.
This work did not cite the earlier LSTM^{[LSTM0-6]} trained by Connectionist Temporal Classification (CTC, 2006).^{[CTC]} CTC-LSTM was successfully applied to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}) and became the first superior end-to-end neural speech recogniser that outperformed the
state of the art, dramatically improving Google's speech recognition.^{[GSR][GSR15][DL4]}
Markov models (HMMs).^{[BW][BRI][BOU]} [HYB12] still used the old hybrid approach and did not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.^{[LSTM8]}
Ernst Ising and Wilhelm Lenz in the 1920s.^{[L20][I25][K41][W45][T22]} It settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs.^{[AMH1-2]}
Who Invented the IC?
Preprint arXiv:1704.04760
PDF.
PDF.
Mathematischen Schriften, ed. C. Gerhardt, Berlin 1879, vol.7, p.223. English link.
Link.
arXiv:1607.06450, 2016.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years
See tweet1.
LeCun also listed the "5 best ideas 2012-2022" without mentioning that
See tweet2.
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online:
19/5/2021.
[LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
Based on [LSTM0]. More.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
Preprint: arxiv:1506.07452.
PDF.
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent
PDF.
are actually a variant of the vanilla LSTM architecture^{[LSTM2]} (2000) which the authors did not cite
although this work^{[LSTM2]} was the one that introduced gated recurrent units.
Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method.
learn to count^{[LSTMGRU2]} nor learn simple non-regular
languages;^{[LSTMGRU2]} they
according to Google Brain.^{[LSTMGRU3]})
Preprint arXiv:1805.04908.
Architectures. Preprint arXiv:1703.03906
A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert^{[M69]}
researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]}
and then also by Amari's SGD for MLPs.^{[GD1-2]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020. The
Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
By 2010, when compute was 100 times more expensive than today, both the feedforward NNs^{[MLP1]}
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.^{[MIR]}
PDF.
PDF.
Preprint arXiv:1608.05343, 2016.
Preprint arXiv:1611.01578 (PDF), 2017.
Compare the earlier Neural Architecture Search of Bayer et al. (2009) for LSTM-like topologies.^{[LSTM7]}
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b.
Letter, Science, vol 336, p 1639, June 2012.
See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a)
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC6a] J. Schmidhuber. Comment on "Biography: The ABC of computing" by J. Gilbey, Nature 468 p 760-761 (2010). Link.
[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004
[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
HTML.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
Link.
excellent 1995 neural probabilistic text model.^{[SNT]} See also Nakamura and Shikano's 1989 word category prediction model.^{[NPMa]}
Compare Konrad Zuse's much earlier 1948 work on
theorem proving^{[ZU48]}
the first high-level programming language.^{[BAU][KNU]}
NY Times article
Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
arxiv:1912.06680.
An LSTM composes 84% of the model's total parameter count.
2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.
Link.
J. Schmidhuber (Blog, 2006).
Is History Converging? Again?
history's exponential acceleration since the Big Bang.^{[OMG]}
Preprint arXiv/1606.06724.
Preprint arXiv/1708.03498.
Preprint arXiv/1802.10353.
Preprint arXiv/2010.03635.
Preprint arXiv/2011.12930.
PDF.
HTML.
HTML overview.
OOPS source code in crystalline format.
PDF.
HTML.
Link.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and
the GAN principle
Based on TR FKI-126-90 (1990).^{}[AC90]
More.
PDF.
Partially based on TR FKI-126-90 (1990).^{[AC90]}
Report arXiv:1210.0118 [cs.AI], 2015.
One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
Preprint: arXiv:1809.01999.
Github: World Models.
minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF.
More.
1991. PDF.
More.
PDF. More.
Link.
arXiv:1112.5309 [cs.AI]
PDF.
First Experiments with PowerPlay.
arXiv:1210.8385 [cs.AI].
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.
in 1987^{[META1][META]} long before Bengio
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R9] Reddit/ML, 2019. We
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun
[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers
Although these MLPs did not yet have deep learning, because only the last layer learned,^{[DL1]}
Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM1-2][CONN21][T22]}
J. Schmidhuber (AI Blog, 2001). Raw Computing Power.
Preprint arXiv/1311.2524, Nov 2013.
Preprint arXiv/1703.06870, 2017.
PDF.
The first paper on policy gradients for LSTM. This approach has become very important in reinforcement learning.^{[LSTMPG]}
This experimental analysis of backpropagation did not cite the origin of the method,^{[BP1-5]} also known as the reverse mode of automatic differentiation.
the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} as well as
Amari's work (1967-68)^{[GD1-2]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[DL3,3a]} failed to cite the prior art.^{[T22]}
Link.
A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert^{[M69]}
researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]}
and then also by Amari's SGD for MLPs.^{[GD1-2]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
in the 1960s-70s, especially outside of the Anglosphere.^{[DEEP1-2][GD1-3][CNN1][DL1-2][T22]}
The Past, Present and Future of Artificial Intelligence.
Link.
PDF.
Much later this was called a probabilistic language model.^{[T22]}
PDF.
Link.
ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link.
Local copy 1 (HTML only).
Local copy 2 (HTML only).
[T22] debunks this justification.
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
Debunking [T19] and [DL3a] .
the 1991 publication on what's now called "Transformers with linearized self-attention."^{[FWP0-6][TR5-6]}
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30-year anniversary.
Link.
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though.
The Turing Test.
YouTube video, 2022.
Preprint arXiv/1912.02875, 5 Dec 2019.
Preprint arXiv/1912.02877, 5 Dec 2019.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
By 1993, the approach solved problems of depth 1000 [UN2]
neural knowledge distillation procedure
The systems of 1991 allowed for much deeper learning than previous methods. More.
1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF.
approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here (depth > 1000).
2006. PDF.
It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)^{[UN0-3]}
the first NNs shown to solve very deep problems.
(or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]}
This can greatly facilitate very deep downstream learning.^{[UN0-3]}
The comment under reference^{[UN4]} applies here as well.
Theory of Universal Learning Machines & Universal AI.
Link.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.
Results are essentially identical to those of Schmidhuber's diploma student Sepp Hochreiter (1991).^{[VAN1]} Even after a common publication,^{[VAN3]} the first author of [VAN2] published papers^{[VAN4]} that cited only their
own [VAN2] but not the original work.
PDF.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
Link.
Youtube video [see 28:16].
However, in 2010, Schmidhuber's team in Switzerland showed^{[MLP1-2]}
unsupervised pre-training is not necessary
Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times.
WWW link (retrieved 15 May 2020).
Local copy (plain HTML only).
Schmidhuber's publications on exactly this topic
date back to 1991-93.^{[UN0-2][UN]}
already in 1995.^{[SNT]}
a general, practical, program-controlled computer.
architecture [NEU45].
PDF.
J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936.
Weltwoche, Nr. 33.21, 19 August 2021.
PDF.
Menu
directory
status & updates
copyrights
AI Blog
Twitter: @SchmidhuberAI
(v1: 24 Sep 2021,
v2: 31 Dec 2021)
Versions since 2021 archived in the Internet Archive
This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021).
deep learning survey,^{[DL1]} and can also be seen as a short history of the deep learning revolution, at least as far as ACM's erroneous laudation and the Turing Lecture are concerned.
2015 survey of deep learning^{[DL1]}
June 2020 article^{[T20a][R12]}
version 1 of the present report.
(see Executive Summary
I,
V,
II,
XII,
XIX,
XXI,
XIII,
XIV,
XX,
XVII).
(A) speech recognition,
(B) natural language processing,
(C) robotics,
(D) computer vision,
(VII) medicine, astronomy, materials science.
A,
B,
C,
D,
VII,
XVII,
VI,
XVI).
II,
V,
XX,
XVIII)
with Dr. Bengio & Dr. Hinton (see Sec. XVII, I).
I respond to LBH's recent ACM article (July 2021).
expands material in my Critique of the 2019 Honda Prize^{[HIN]} (~3,000 words).
Abstract & Outline (~300 words),
Introduction (~300 words),
Critique of LBH's ACM article (Turing Lecture) of July 2021^{[DL3a]}
Executive summary of what's wrong with ACM's laudation (~1,000 words),
21 comments on 21 claims by ACM (~8,000 words),
Conclusion (~2,000 words).
All backed up by over 300 references (over 10,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
science is self-correcting."^{}[SV20]
they are mine or other people's.^{[DL1-2][HIN][NASC1-9]} The present page is offered as a resource for all good computer scientists who share this inclination.
and to fight plagiarism,^{[FAKE2]}
collusion rings,^{[LIT21]} and systemic academic corruption in all of their more and less subtle forms.^{[FAKE]}
Sec. 2
LBH's 2021 ACM article^{[DL3a]} which necessitated an extension of the
first version
of this post.^{[T20a][R12]}
ACM's official justification^{[T19]} of the
2018 A.M. Turing Award^{[R1]}
After the Executive Summary in Sec. 3, Sec. 4 will split
ACM's full text^{[T19]}
into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
Most of the critiques are based on references to original papers and material from the AI Blog.^{[AIB][MIR][DEC][HIN]}
publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.^{}[DL3a]
LBH's well-known earlier omissions.^{[DLC][HIN][T20a]}
LBH claim to "briefly describe the origins of deep learning"^{[DL3a]} without even mentioning the world's first working deep learning nets by
Ivakhnenko and Lapa in 1965^{[DEEP1-2][R8]} (see Sec. II).
this class of methods was pioneered in 1991^{[UN-UN2]} (see Sec. II, III).
Highway Net,
the first really deep feedforward NN.^{[HW1-3]}
(see Sec. D, VI).
were all driven by my lab:^{[MOST]} In 1991, I had the
first very deep NNs based on unsupervised pre-training;^{[UN-UN2]}
LSTMs
brought essentially unlimited depth to gradient-based supervised recurrent NNs;^{[LSTM0-17]}
later our Highway Nets^{[HW1-3]} brought it to feedforward NNs.
from 2007^{[LSTM4,14]}
based on LSTM^{[LSTM0-6]} (1990s-2005) and CTC (2006).^{[CTC]}
our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years^{[GSR][GSR15-19][DL4]} (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule^{[Drop1-3]} (see Sec. XIV).
perceptrons through stochastic gradient descent^{[GD1-3]} (without reverse mode backpropagation^{[BP1]}).
Fukushima who introduced ReLUs in 1969^{[RELU1-2]} (see Sec. XIV).
called AlexNet,^{[GPUCNN4]} without mentioning that our earlier groundbreaking deep GPU-based DanNet^{[GPUCNN1-3,5-8][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011^{[GPUCNN1-8][R5-6]} (see Sec. XIV).
XVIII).
already in 1965^{[DEEP1-2][R8]} (see Sec. II).
earlier fast weights of von der Malsburg (1981) and Feldman (1982).^{[FAST,FASTa-b][FWP]}
described in the 1991-93 papers on Fast Weight Programmers and linear Transformers^{[FWP0-1,6]} (see Sec. XVI, XVII-2).
dedicate an extra section to attention-based Transformers,^{[TR1-6]} citing Bengio's team (2014) for "soft attention"^{[ATT14]} without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers^{[FWP,FWP0-2,6][ATT]} (see Sec. XVII-1, XVI).
LBH claim that Bengio's team^{[NPM]}
of text compression^{[SNT]} (see Sec. XVI, XVII-1).
LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)^{[GAN0-1]} without mentioning that
GANs are instances
of the Adversarial Curiosity Principle of 1990^{[AC90-20][MIR](Sec. 5)} (see Sec. XVII).
In summation, LBH have repeatedly chosen to ignore the previous well-known critiques^{[DLC][HIN][T20a]} and deep learning surveys,^{[DL1-2]} and ACM's peer review process failed to catch this. ACM's Code of Ethics and Professional Conduct^{[ACM18]} states: "Computing
and deep learning (e.g., Sec. I), ACM lauds
Numerous references can be found under the relevant section links I-XXI
which adhere to the sequential order of ACM's text^{[T19]}
Sec. II:
it became really deep in 1991 in my lab,
unsupervised pre-training of NNs,
supervised LSTM.
Sec. I contains 4 subsections
A, B, C, D
A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition
combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were
Hinton (2012) and Bengio (XV)
our revolutionary CTC-LSTM which was soon on most smartphones.
Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI):
(soon used for several billions of
was also based on our LSTM.
Sec. C: Robotics.
most visible breakthroughs
Sec. D: Computer Vision
XVIII & XIV & XI & VI)
and applied to speech. All before LeCun's CNN work (XVIII).
deep NNs
pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for
superior computer vision in 2011,
winning 4 image recognition contests in a row
is an open-gated version of our earlier Highway Nets.
Sec. XIV:
deep & fast CNN
(where LeCun participated),
Sec. XI: ACM mentions GPU-accelerated NNs
deep GPU-NN of 2010
debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton),
and our GPU-CNN of 2011 (DanNet) was the first
XVIII:
Fukushima and Waibel (see Sec. D).
The first application of CNNs with backpropagation to biomedical/biometric images is due to Baldi and Chauvin.^{[BA93]}
VII: ACM explicitly mentions medicine and
first to win medical imaging competitions
Sec. XII & XIX & XXI: Modern
backpropagation
XIII &
II &
V
III &
IX &
X &
XX):
Sec. XX: ACM credits LeCun for work on
Sec. XXI: ACM credits LeCun for work on
XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences.
CTC-LSTM
A &
B).
XVI: ACM
We started this in 1990-93
long before LBH
Sec. XVII:
Artificial Curiosity
vanishing gradients (1991),
metalearning (1987),
unsupervised pre-training (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
fast weight programmers using
and other topics.^{}[R2-R6]
Sec. IV is on Turing (1936) and his predecessors
Critique of LBH's ACM article (Turing Lecture) of July 2021.
Sec. Conclusion:
In the recent decade of deep learning,
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
Sec. II &
III &
V &
XII &
XIII &
XVII &
XIV &
XIX &
XX &
XXI.
In what follows, ACM's full text [T19] is split into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1-2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2]} vanishing gradients (1991)^{[VAN1]} & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991).^{[FWP0-2,6]} ^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} This may explain some of ACM's misattributions.^{[T19]} II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,^{[DEC]} they were academia and industry,^{[DL4]} mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)^{[LSTM0-6]} vanishing gradient problem student Sepp Hochreiter in 1991.^{[VAN1]} This happened long before the similar work of Bengio (see Sec. XVII).^{[MIR] (Sec. 3,Sec. 4)} LSTM was refined with my student Felix Gers^{[LSTM2]} through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).^{[CTC]} Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). Markov models (HMMs)^{[BW][BRI][BOU]} (Sec. XV). Hinton et al. (2012) still used the old hybrid approach^{[HYB12]} and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer^{[LSTM4][LSTM14]} as a postdoc in Hinton's lab.^{[LSTM8]} CTC-LSTM dramatically improved Google's speech recognition.^{[GSR][GSR15][DL4]} on-device speech recognition^{[GSR19]} (not any longer on the server) LSTM^{[MIR](Sec. 4)} (see Sec. VI & XI & XV). of text^{[SNT]} (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} See also Sec. VI & XI & XV. tailored by Bengio's team.^{[ATT14][FWP]} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.^{[LSTM-RL][RPG][LSTMPG]} In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,^{[AM16]} predicting what's going on in nuclear fusion reactors, and so on.^{}[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).^{[JOU17]} Apparently the first LSTM journal paper^{[LSTM1][R5]} is now the 20th century D. Computer Vision was revolutionized in the 2010s by a particular feedforward neural net (NN) called the convolutional NN (CNN).^{[CNN1-4]} The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979),^{[CNN1]} who also introduced the now widely used rectified linear units (ReLUs) in 1969.^{[RELU1]} In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990^{[CNN3a]} and by Weng et al. for higher-dimensional CNNs in 1993.^{[CNN3]} Since 1989, LeCun's team has contributed improvements of CNNs, especially for images^{[CNN2,4]} (see Sec. XVIII). Finally, my own team showed in 2010^{[MLP1]} unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton^{[VID1]} who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]} CNNs of 2006.^{[GPUCNN]} winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^{[GPUCNN5]} at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^{[DAN1]} in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet^{[GPUCNN3]} of Hinton's student Krizhevsky won the ImageNet^{[IM09]} 2012 contest^{[GPUCNN4-5][R6]} (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.^{[SCAN]} The VGG network (ImageNet 2014 winner)^{[GPUCNN9]} and other highly cited CNNs^{[RCNN1-3]} further extended the work of 2011.^{[MIR](Sec. 19)} ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) and currently the most cited neural network,^{[MOST]} is a version (with open gates) of our earlier Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net is actually the feedforward net version of vanilla LSTM.^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.
appeared long before the 1980s. The first non-learning recurrent NN (RNN) architecture (the Lenz-Ising model) was analyzed by physicists in the 1920s.^{[L20][I25][K41][W45]} were also discussed in 1943 by McCulloch and Pitts^{[MC43]} and formally analyzed in 1956 by Kleene.^{[K56]} In 1972, Amari reused the Lenz-Ising model to build a learning RNN, later sometimes called the Hopfield network or Amari-Hopfield Network.^{[AMH1-3]} artificial evolution^{[TUR1]} and single adaptive layer learned in 1958^{[R58]} (Joseph^{[R61]} Widrow & Hoff's similar Adaline learned in 1962.^{[WID62]} regression and the method of least squares^{[DL1-2]} multilayer perceptrons (MLPs) were discussed by Steinbuch^{[ST61-95]} (1961), Joseph^{[R61]} (1961), and Rosenblatt^{[R62]} (1962), who wrote about "back-propagating errors" in an MLP with a hidden layer,^{[R62]} but did not yet have a general deep learning algorithm for deep MLPs (what's now called backpropagation is quite different and was first published by Linnainmaa in 1970^{[BP1-BP5][BPA-C]}). Compare also Selfridge's multilayer Pandemonium^{[SE59]} (1959). containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^{[DEEP2]} already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} LBH failed to cite this, just like they failed to cite Amari,^{[GD1]} who in 1967 proposed stochastic gradient descent^{[STO51-52]} (SGD) for MLPs and whose implementation^{[GD2,GD2a]} (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin's work^{[GDa-b]}). deep convolutional NN architecture was first introduced in the 1970s;^{[CNN1]} his very popular ReLU already in 1969.^{[RELU1-2]} XIII, III, V, VIII, IX, and X. LBH & co-authors, e.g., Sejnowski^{[S20]} (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert^{[M69]} researchers took a fresh look at the problem in the 1980s."^{[S20]} However, as mentioned above, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method^{[DEEP1-2][DL2]} (and then also by Amari's SGD for MLPs^{[GD1-2]}). Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)} (but see a 1989 paper^{[MOZ]}). However, it became really deep in 1991 in my lab,^{[UN-UN3]} which has See Sec. 1 of the overview:^{[MIR]} First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.^{[UN2][DL1][UN]} (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000^{[LSTM17]} more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR] (Sec. 19)} III. Note that LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets^{[HW1-3]} brought it to feedforward NNs.^{[MOST]}
by others (Sec. III).^{[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4]} deep learning multilayer perceptrons (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs,^{[UN1-2]} the vanishing gradient problem (1991)^{[VAN1]} & solutions to it (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][GPUCNN5]} and other foundations.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)} II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",^{[DL7]} referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} we had this type of deep learning already in 1991;^{[UN][UN1-2]} see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs^{[DEEP1-2]} which Hinton,^{[UN4]} Bengio,^{[UN5]} and LBH^{[DL3,DL3a]} did not cite either. See Sec. X.
my comments systematically track the sequential order of ACM's claims.^{[T19]}
ACM's statement on Turing is greatly misleading, like some of its other statements.^{[T19]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]} Much of early AI in the 1940s-70s was actually about theorem proving^{[ZU48][NS56]}
In 1936, Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result,^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing,^{[POS]} my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.^{[HIN]}) open problem "P=NP?" in his famous letter to John von Neumann (1956).^{[GOD56][URQ10]} Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936^{[ZU36-38][Z36][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Zuse also created the first high-level programming language in the early 1940s.^{[BAU][KNU]} conditional jump instruction.^{[RO98]}
that learn internal representations (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2][UN]} vanishing gradients (1991)^{[VAN1]} & solutions to it (Sec. A),^{[LSTM0-17][CTC]} (2004),^{[GPUNN][GPUCNN5]} record-breaking deep supervised NNs (2010)^{[MLP1-2]} and contest-winning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991),^{[FWP0-2,6]} and more.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]} II & I & III & XIII & X & XVII & XII & XVIII & XX.
"advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011^{[MLP1-2][DAN][DAN1][GPUCNN5][R6]} and through Highway Net-like NNs (2015),^{[HW1-3][R5]} although the principles of CNNs were invented and developed by others since the 1970s.^{[CNN1-4]} See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.^{[MIR]}
Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]} DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).^{[GPUCNN5,8]} and were able to greatly improve steel defect detection.^{[ST]} All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky^{[GPUCNN4-5][R6]} and the VGG network^{[GPUCNN9]} mitosis detection.^{[MGC][GPUCNN5,8]} approach of D & XI).
without citing them.^{[DL1][DLC][HIN][R2-R4][R7-R8]} V & XII & XIX & II & III & XIII & XVII & X & I.
who failed to cite them, even in later work.^{[HIN][DLC][DL1-2][DEEP1-2][RELU1-2][R7-R8]} See Sec. II & III & XIII & V & X & XIV & I.
first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL^{[DL6,6a]} was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",^{}[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.^{[DEEP1-2][CNN1][HIN][R8][DL1][DLC]} Even deep learning through unsupervised pre-training was introduced by others.^{[UN1-3][R4][HIN](Sec. II)} II & III & XIII & V & I.
ignored by LBH's papers^{[HIN][R7-R8][R2-R5]} (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).
ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),^{[GPUNN][GPUCNN5]} made GPU-based NNs fast and deep enough an important benchmark record,^{[MLP1-2]} unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.^{[VID1]} our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]} vision (explicitly mentioned by ACM) for the first time^{[R6]} (see Sec. D).
Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.^{[LSTM1-4][CTC]} In particular, as mentioned in Sec. A, such as HMMs.^{[BW][BOU][BRI][HYB12]} As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.
ACM's statement is "less wrong" than Honda's^{[HIN](Sec. I)} but still (and apparently even other award committees^{[HIN](Sec. I)}) backpropagation by Rumelhart et al. (1985-86)^{[RUM]} (1982).^{[BP2]} And the article^{[RUM]} even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),^{[BP1]} Kelley already had a precursor thereof in the field of control theory;^{[BPA]} see also later work of the early 1960s.^{[BPB][BPC]}^{[R7]} internal representations in hidden layers of NNs.^{[RUM]} But this was essentially just an experimental analysis of a known method.^{[BP1-2]} And history of backpropagation can be found at Scholarpedia^{[DL2]} and in my award-winning survey.^{[DL1]} Also see Sec. XIX, II.
Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.^{[BP1]} recent debate:^{[HIN]} It is true that in 2018, Hinton^{[AOI]} Rumelhart^{[RUM]} with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.^{[HIN]} Neither in a popular book^{[AOI]} nor in other recent work^{[DL3,DL3a]} did he cite Linnainmaa (1970),^{[BP1]} the true creator.^{[BP4-5]} that his 2015 survey^{[DL3]} does cite Werbos (1974) who however described the method correctly only later in 1982^{[BP2]} and also failed to cite Linnainmaa.^{[BP1]} Compare the 1967-68 work of Amari:^{[GD1-3]} to my knowledge the first to propose and implement stochastic gradient descent^{[STO51-52]} reverse mode gradient descent method now known as backpropagation^{[BP1]}); see also Tsypkin's work of 1966.^{[GDa-b]} Linnainmaa's backpropagation method was well-known.^{[BP5][DL1-2][DLC]} It wasn't created by "lots of different people" as Hinton suggested^{[AOI][HIN][R11]} one person who published first^{[BP1]} and therefore should get the credit.
Boltzmann Machine (BM)^{[BM]} a learning.^{[HIN]} Recently, however, I learnt through a reader that even the BM paper^{[BM]} did not cite prior relevant work by Sherrington & Kirkpatrick^{[SK75]} and Glauber.^{[G63]} (Compare related work.^{[H86][H88][S93]}) multilayer perceptrons with arbitrarily many layers.^{[DEEP1-2][HIN]} Sec. II V & X.^{[MIR](Sec. 1)[R8]}
As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert^{[M69]} at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800^{[DL1-2]}) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.^{[DEEP2][GD1-3][CNN1][DL1-2]}
Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).^{[Drop1-3]} Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,^{[DAN1]} Sec. D above. Back then, the only really of deep CNNs through GPUs.^{[GPUCNN1,3,5][R6]} Already before ImageNet 2012,^{[R6]} fast deep CNN called DanNet a monopoly on winning computer vision competitions.^{[GPUCNN5]} It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011^{[GPUCNN2][DAN,DAN1][R6]} long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.^{[MIR]}
since the late 1980s.^{[BW][BRI][BOU]} LSTM (1990s-2005)^{[LSTM0-6]} and CTC^{[CTC]} (2006), which were applied to speech in 2007.^{[LSTM4][LSTM14]} CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.^{[BW][BRI][BOU][HYB12]} See also Sec. A.
5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.^{[SNT]} Bengio^{[NPM]} characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures^{[PO87][PO90]}). In the 2010s, was actually the LSTM of our team,^{[LSTM0-6]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B. Bengio's team^{[ATT14]} has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),^{[FWP2][FWP]} and "hard" attention (in observation space) in the context of RL^{[ATT][ATT0-1]} (1990). attention-based Transformers^{[TR1-6]} are FWPs of 1991^{[FWP0-1]} which have become a popular alternative to RNNs. My FWP of 1991^{[FWP0-1]} (now often called keys and values for self-attention).^{[TR1-6][FWP]} the 2010s,^{[DEC]} Transformers^{[TR1-2]} a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly^{[LSTM13,17]} linear Transformers or Performers^{[TR5-6]} which are formally equivalent to my 1991 FWPs (apart from normalization).^{[FWP6][FWP]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves.
See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper^{[ATT2]} his own work:^{[ATT3]}
GANs^{[GAN0-1]} (2010-2014) are actually a simple application^{[AC]} of the adversarial curiosity (AC) principle from 1990^{[AC90,90b][AC20]} (see also surveys^{[AC09-10]}). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis^{[GAN1]} (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,^{[GAN1]} a well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.^{[AC20][AC]} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]}) Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected their erroneous claims^{[GAN1]} about the other (1991).^{}[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,^{[AV1]} their NIPS 2014 paper^{[GAN1]} and some of the erroneous claims it made about my prior work.^{[AC20]} Goodfellow eventually admitted that PM is adversarial (his paper^{[GAN1]} still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^{[AC90]} like in StyleGANs^{[GAN2]}). When the authors^{[GAN1]} I published one myself in the hopes of correcting the annals of history.^{[AC20]} that they are instances of my earlier work.^{[R2][AC20]} vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^{[VAN2]} without citing Sepp. was settled in favor of Sepp.^{[VAN1]} However, even after a common publication,^{[VAN3]} Bengio published papers^{[VAN4][XAV]} are poor indicators of truly pioneering work.^{[NAT1]} (Margin note: Bengio states^{[YB20]} that in 2018 he one must at least clarify it later,^{[DLC]} Bengio also claims^{[YB20]} that in 1995 my publications on exactly this topic date back to 1991-93.^{[UN0-2][UN]} which I started in 1987^{[META1][META]} long before Bengio that he did it before me.^{[R3]} Bengio also writes^{[YB20]} that in Regarding attention-based Transformers,^{[TR1-6]} Bengio^{[DL3a]} cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.^{[FWP,FWP0-2,6]} Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"^{[LSTMGRU]} for a variant of our vanilla LSTM architecture^{[LSTM2]} (2000) which he did not cite although our work^{[LSTM2]} was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method. learn to count^{[LSTMGRU2]} nor learn simple non-regular languages;^{[LSTMGRU2]} they according to Google Brain.^{[LSTMGRU3]}) unsupervised pre-training for deep NNs.^{[UN0-4][HIN](Sec. II)[MIR](Sec. 1)} Hinton's paper^{[UN4]} (2006) appeared long after my earlier work on this^{[UN0-2]} the first NNs shown to solve very deep problems (see Sec. II above).^{[UN]} It was published in 1991-92^{[UN1]} when compute was about 1000 times more expensive than in 2006. survey (2015),^{[DL3][DLC]} See also Sec. II & III. compressing or distilling one NN into another.^{[UN0-2][DIST1-2][MIR](Sec. 2)} Hinton^{[DIST2]} (2006) did not cite my much earlier original work on this (1991),^{[UN1][UN]} not even in his later patent application fast weight programmers^{[FWP][FWP0-4a]} through tensor-like outer products (1991-2016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above). learning sequential attention with NNs.^{[MIR](Sec. 9)} Hinton^{[ATT3]} (2010) our much earlier work on this^{[ATT1][ATT]} although he was both reviewer and editor of my summary^{[ATT2]} (1990; see Sec. XVI above).
The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.^{[R4]} Remarkably, three of them are related to the 1991 paper^{[UN1][UN]} which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.^{[MIR]} See Sec. I for additional related issues of credit assignment.
LeCun's team has made important contributions to CNNs since 1989.^{[CNN2,4]} However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^{[CNN1]} NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.^{[MIR]} at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN1-3]} won the superhuman performance three times worse performance).^{[DAN1]} Again see Sec. D. Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]} at ICPR 2012, our DanNet^{[GPUCNN1-3]} won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)^{[GPUCNN5,7,8]} (before the similar AlexNet won ImageNet 2012^{[GPUCNN5][R6]} and the similar VGG network^{[GPUCNN9]} won ImageNet 2014). mitosis detection.^{[MGC][GPUCNN5,7,8]} Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,^{[AM16][DL1]} All of these fields were heavily shaped in the 2010s by our non-CNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See Sec. A, B, VI, XI.
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^{[BP2-4]} (see also Amari's work on SGD for MLPs of 1967-68^{[GD1-2a]}). recent work.^{[DL3,DL3a][DLC]} In 1960, Kelley already had a precursor of the algorithm.^{[BPA]} Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"^{[DL1]} (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.^{[DL2]}^{[BP4]}
However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)^{[DEEP1-2]} and Amari^{[GD1-2]} (and also Fukushima^{[CNN1][DL2]}) had long before LeCun. See Sec. D & II & XIII & V.
LeCun et al. neither cited the origins^{[BP1]} (1970) of this widely used type of automatic differentiation for differentiable networks of modules^{[DL2][BP4-5][DLC]} for such systems.^{[S80]} See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work;^{[PO87-90]} compare the important work of Baldi and colleagues.^{[BA96-03]}
(Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).^{[UN2]} For example, our adaptive subgoal generators (1991)^{[HRL0-2]} were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) see "100 Authors against Einstein."^{}[AH1] ad hominem attacks^{[AH2-3][HIN]} "If you cannot dispute a fact-based message, attack the messenger himself."^{[HIN]} Science has a well-established way of dealing with plagiarism (which may be unintentional^{[PLAG1][CONN21]} or not^{[FAKE2]}) award can ever change that.^{[HIN]} and their co-workers have contributed useful improvements of deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} whom they did not cite, in contrast to ACM's Code of Ethics and Professional Conduct^{[ACM18]} II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:^{[DLC][HIN]} to self-correction,"^{[SV20]} as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video^{[VID2]} Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article^{[NYT3]} Google's original 2016 paper on Google Translate^{[WU]} mentions LSTM over 50 times (see Sec. B). In ad hominem style,^{[AH2-3]} claiming credit he doesn't deserve for many, many things",^{[NYT1]} without LeCun also called the GANs of Bengio's team^{[GAN1]} GANs are variations of my work in 1990.^{[AC90,90b][AC20][R2]} According to Bloomberg,^{[AV2]} Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."^{[FAKE]} LBH, who called themselves the deep learning conspiracy,^{[DLC][DLC1-2]} Our LSTM paper^{[LSTM1]} has got more citations than any paper by Bengio or LeCun,^{[R5]} Hinton's most cited paper (2012) is the one on GPU-based CNNs.^{[GPUCNN4][R5]} It follows our earlier work on supervised deep NNs (2010)^{[MLP1]} unsupervised pre-training for deep NNs by myself ^{[UN][UN0-3]} and later championed by Hinton;^{[UN4][VID1]} see Sec. D). Hinton (2012)^{[GPUCNN4]} characterizes our deep and fast DanNet (2011)^{[GPUCNN1-3]} as AlexNet won one;^{[R6]} see Sec. D, XIV. The highly cited VGG network (2014)^{[GPUCNN9]} Hinton's 2nd most cited paper^{[RUM][R5]} of Hinton's paper,^{[RUM]} adding citations for a book by Rumelhart & McClelland^{[R5]}). Backpropagation is a previously invented method^{[BP1]} whose origins of Ivakhnenko whom he has never cited;^{[DEEP1-2][R7-R8]} see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),^{[GAN1]} which are instances of my artificial curiosity (1990)^{[AC90,90b][AC20][R2]} which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)^{[UN4]} by ours^{[UN0-2][UN]} were preceded by Hanson's^{[Drop1-3]} As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,^{[DL3a]} again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,^{[MOST]} of deep learning MLPs since 1965^{[DEEP1-2][GD1-2a]} (see Sec. II, XX) and backpropagation (1960-70)^{[BPA][BP1]} (see Sec. XIX, XII) and convolutional NNs since 1979^{[CNN1-4]} (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets^{[HW1-3]} brought it to feedforward NNs in May 2015.^{[MOST]} superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^{[DEC]} speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.^{[DEC]} Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.
As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^{[DLC]} depth that really learned.^{[DEEP1-2][R8]} Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.^{[GD1-2a]} A few years later, modern backpropagation unintentional^{[PLAG1][CONN21]} or intentional.^{[FAKE2]}
Yes, this critique is also an implicit critique of certain other awards to LBH.^{[HIN]} reddit.com/r/MachineLearning^{[R1-R12]} (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.^{[MIR]}
Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:^{[LECP]} "... else cites."^{[LECP]} weights and an adaptive output layer.^{[R62]} So Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs)^{[ELM1]} revisionist narrative of ELMs^{[ELM2][CONN21]} self-proclaimed "deep learning conspiracy"^{[DLC1-2]}
Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,^{[HIN]} as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2][R1]}
Thanks publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]} (more). Preprint arXiv/1906.04493. ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network.^{[AMH3]} The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.^{[AMH1]} [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^{[FWP]} Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.^{[BP1-4]} PDF. Link. PDF. First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^{[DL2]} English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weight-sharing PDF. Spatial Averaging.^{[CNN1]} Spatial Averaging.^{[CNN1]} PDF. PDF. PDF. Since November 2021: Comments on version 1 of the present report^{[T21v1]} in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.^{[DAN1]} J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.^{[UN-UN2]} More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} unsupervised pre-training for deep NNs^{[UN4]} (2006) although this type of deep learning dates back to 1991.^{[UN1-2][UN]} II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed^{[DLC1-2]} "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. In fact, the ELM concept goes back to Rosenblatt's work around 1960.^{}[R62] used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J. Schmidhuber (AI Blog, 26 March 2021). alternative^{[FWP0-1]} to recurrent NNs. the fast weights^{[FAST,FASTa]} of Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-7]} can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]} (now often called keys and values for self-attention^{[TR1-6]}). The similar Transformers^{[TR1-2]} combine this with projections linear Transformers or Performers^{[TR5-6]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. PDF. Probably the first paper on using stochastic gradient descent^{[STO51-52]} reverse mode of automatic differentiation or backpropagation^{[BP1]}). OCR-based PDF scan of pages 94-135 (see pages 119-120). Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.^{[GD1]} (S. Amari, personal communication, 2021.) Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas^{[AC][AC90, AC90b][AC20]} of GANs. Description of GANs that does not cite the original work of 1990^{[AC][AC90,AC90b][AC20][R2]} (also containing wrong claims about Predictability Minimization^{[PM0-2][AC20]}). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.^{[DAN1]} This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.^{[HW3]} More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets^{[HW1]} More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^{[MLP1]} J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^{[MIR]} Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. Link. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).^{}[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).^{[AC90]} Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^{[BP1-4]} also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. Link. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed^{[MLP1-2]} unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
Menu
directory
status & updates
copyrights
AI Blog
@SchmidhuberAI
This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021).
2015 survey of deep learning^{[DL1]}
June 2020 article^{[T20a][R12]}
(see Executive Summary
I,
V,
II,
XII,
XIX,
XXI,
XIII,
XIV,
XX,
XVII).
(A) speech recognition,
(B) natural language processing,
(C) robotics,
(D) computer vision,
(VII) medicine, astronomy, materials science.
A,
B,
C,
D,
VII,
XVII,
VI,
XVI).
II,
V,
XX,
XVIII)
with Dr. Bengio & Dr. Hinton (see Sec. XVII, I).
I respond to LBH's recent ACM article (July 2021).
expands material in my Critique of the 2019 Honda Prize^{[HIN]} (~3,000 words).
Abstract & Outline (~300 words),
Introduction (~300 words),
Critique of LBH's ACM article (Turing Lecture) of July 2021^{[DL3a]}
Executive summary of what's wrong with ACM's laudation (~1,000 words),
21 comments on 21 claims by ACM (~8,000 words),
Conclusion and Acknowledgments (~2,000 words).
All backed up by over 250 references (~9,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
science is self-correcting."^{}[SV20]
they are mine or other people's.^{[DL1-2][HIN][NASC1-9]} The present page is offered as a resource for all good computer scientists who share this inclination.
and to fight plagiarism, collusion rings,^{[LIT21]} and systemic academic corruption in all of their more and less subtle forms.^{[FAKE]}
Sec. 2
LBH's 2021 ACM article^{[DL3a]} which necessitated an extension of the
first version
of this post.^{[T20a][R12]}
ACM's official justification^{[T19]} of the
2018 A.M. Turing Award^{[R1]}
After the Executive Summary in Sec. 3, Sec. 4 will split
ACM's full text^{[T19]}
into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
Most of the critiques are based on references to original papers and material from the AI Blog.^{[AIB][MIR][DEC][HIN]}
publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.^{}[DL3a]
LBH's well-known earlier omissions.^{[DLC][HIN][T20a]}
LBH claim to "briefly describe the origins of deep learning"^{[DL3a]} without even mentioning the world's first working deep learning nets by
Ivakhnenko and Lapa in 1965^{[DEEP1-2][R8]} (see Sec. II).
this class of methods was pioneered in 1991^{[UN-UN2]} (see Sec. II, III).
Highway Net,
the first really deep feedforward NN.^{[HW1-3]}
(see Sec. D, VI).
were all driven by my lab:^{[MOST]} In 1991, I had the
first very deep NNs based on unsupervised pre-training;^{[UN-UN2]}
LSTMs
brought essentially unlimited depth to gradient-based supervised recurrent NNs;^{[LSTM0-17]}
later our Highway Nets^{[HW1-3]} brought it to feedforward NNs.
from 2007^{[LSTM4,14]}
based on LSTM^{[LSTM0-6]} (1990s-2005) and CTC (2006).^{[CTC]}
our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years^{[GSR][GSR15-19][DL4]} (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule^{[Drop1-2]} (see Sec. XIV).
von der Malsburg who introduced ReLUs in 1973^{[CMB]} (see Sec. XIV).
called AlexNet,^{[GPUCNN4]} without mentioning that our earlier groundbreaking deep GPU-based DanNet^{[GPUCNN1-3,5-8][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011^{[GPUCNN1-8][R5-6]} (see Sec. XIV).
XVIII).
already in 1965^{[DEEP1-2][R8]} (see Sec. II).
earlier fast weights of von der Malsburg (1981) and Feldman (1982).^{[FAST,FASTa-b][FWP]}
described in the 1991-93 papers on Fast Weight Programmers and linear Transformers^{[FWP0-1,6]} (see Sec. XVI, XVII-2).
dedicate an extra section to attention-based Transformers,^{[TR1-6]} citing Bengio's team (2014) for "soft attention"^{[ATT14]} without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers^{[FWP,FWP0-2,6][ATT]} (see Sec. XVII-1, XVI).
LBH claim that Bengio's team^{[NPM]}
of text compression^{[SNT]} (see Sec. XVI, XVII-1).
LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)^{[GAN0-1]} without mentioning that
GANs are instances
of the Adversarial Curiosity Principle of 1990^{[AC90-20][MIR](Sec. 5)} (see Sec. XVII).
In summation, LBH have repeatedly chosen to ignore the previous well-known critiques^{[DLC][HIN][T20a]} and deep learning surveys,^{[DL1-2]}
and deep learning (e.g., Sec. I), ACM lauds
Numerous references can be found under the relevant section links I-XXI
which adhere to the sequential order of ACM's text^{[T19]}
Sec. II:
it became really deep in 1991 in my lab,
unsupervised pre-training of NNs,
supervised LSTM.
Sec. I contains 4 subsections
A, B, C, D
A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition
combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were
Hinton (2012) and Bengio (XV)
our revolutionary CTC-LSTM which was soon on most smartphones.
Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI):
(soon used for several billions of
was also based on our LSTM.
Sec. C: Robotics.
most visible breakthroughs
Sec. D: Computer Vision
XVIII & XIV & XI & VI)
and applied to speech. All before LeCun's CNN work (XVIII).
deep NNs
pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for
superior computer vision in 2011,
winning 4 image recognition contests in a row
is an open-gated version of our earlier Highway Nets.
Sec. XIV:
deep & fast CNN
(where LeCun participated),
Sec. XI: ACM mentions GPU-accelerated NNs
deep GPU-NN of 2010
debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton),
and our GPU-CNN of 2011 (DanNet) was the first
XVIII:
Fukushima and Waibel (see Sec. D).
VII: ACM explicitly mentions medicine and
first to win medical imaging competitions
Sec. XII & XIX & XXI: Modern
backpropagation
XIII &
II &
V
III &
IX &
X &
XX):
Sec. XX: ACM credits LeCun for work on
Sec. XXI: ACM credits LeCun for work on
XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences.
CTC-LSTM
A &
B).
XVI: ACM
We started this in 1990-93
long before LBH
Sec. XVII:
Artificial Curiosity
vanishing gradients (1991),
metalearning (1987),
unsupervised pre-training (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
fast weight programmers using
and other topics.^{}[R2-R6]
Sec. IV is on Turing (1936) and his predecessors
Critique of LBH's ACM article (Turing Lecture) of July 2021.
Sec. Conclusion:
In the recent decade of deep learning,
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
Sec. II &
III &
V &
XII &
XIII &
XVII &
XIV &
XIX &
XX &
XXI.
In what follows, ACM's full text [T19] is split into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1-2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2]} vanishing gradients (1991)^{[VAN1]} & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991).^{[FWP0-2,6]} ^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} This may explain some of ACM's misattributions.^{[T19]} II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,^{[DEC]} they were academia and industry,^{[DL4]} mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)^{[LSTM0-6]} vanishing gradient problem student Sepp Hochreiter in 1991.^{[VAN1]} This happened long before the similar work of Bengio (see Sec. XVII).^{[MIR] (Sec. 3,Sec. 4)} LSTM was refined with my student Felix Gers^{[LSTM2]} through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).^{[CTC]} Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). Markov models (HMMs)^{[BW][BRI][BOU]} (Sec. XV). Hinton et al. (2012) still used the old hybrid approach^{[HYB12]} and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer^{[LSTM4][LSTM14]} as a postdoc in Hinton's lab.^{[LSTM8]} CTC-LSTM dramatically improved Google's speech recognition.^{[GSR][GSR15][DL4]} on-device speech recognition^{[GSR19]} (not any longer on the server) LSTM^{[MIR](Sec. 4)} (see Sec. VI & XI & XV). of text^{[SNT]} (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} See also Sec. VI & XI & XV. tailored by Bengio's team.^{[ATT14][FWP]} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.^{[LSTM-RL][RPG][LSTMPG]} In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,^{[AM16]} predicting what's going on in nuclear fusion reactors, and so on.^{}[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).^{[JOU17]} Apparently the first LSTM journal paper^{[LSTM1][R5]} is now the most frequently cited D. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^{[CNN1-4]} The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979).^{[CNN1]} The popular downsampling variant called max-pooling was introduced by Weng et al. (1993).^{[CNN3]} In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. LeCun's team later contributed improvements of CNNs, especially for images^{[CNN2,4]} (see Sec. XVIII). Finally, my own team showed in 2010^{[MLP1]} unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton^{[VID1]} who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]} CNNs of 2006.^{[GPUCNN]} winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^{[GPUCNN5]} at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^{[DAN1]} in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet^{[GPUCNN3]} of Hinton's student Krizhevsky won the ImageNet^{[IM09]} 2012 contest^{[GPUCNN4-5][R6]} (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.^{[SCAN]} The VGG network (ImageNet 2014 winner)^{[GPUCNN9]} and other highly cited CNNs^{[RCNN1-3]} further extended the work of 2011.^{[MIR](Sec. 19)} ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) which currently gets more citations per year^{[MOST]} Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net is actually the feedforward net version of vanilla LSTM.^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.
appeared long before the 1980s. were proposed already in the 1940s/50s^{[MC43][K56]} (but don't forget prior work in physics since the 1920s^{[L20][I25][K41][W45]}). deep convolutional NN architecture was proposed in the 1970s.^{[CNN1]} NNs without hidden layers learned in 1958^{[R58]} regression and the method of least squares^{[DL1-2]}). about deeper adaptive NNs^{[R61,R62]} layers (already containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^{[DEEP2]} highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that's what it was.^{[MIR](Sec. 1)[R8]} LBH failed to cite this. XIII & III & V & VIII & IX & X. LBH & co-authors, e.g., Sejnowski^{[S20]} (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert^{[M69]} researchers took a fresh look at the problem in the 1980s."^{[S20]} However, as mentioned above, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method.^{[DEEP1-2][DL2]} Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)} (but see a 1989 paper^{[MOZ]}). However, it became really deep in 1991 in my lab,^{[UN-UN3]} which has See Sec. 1 of the overview:^{[MIR]} First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.^{[UN2][DL1][UN]} (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000^{[LSTM17]} more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR] (Sec. 19)} III. Note that LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets^{[HW1-3]} brought it to feedforward NNs.^{[MOST]}
by others (Sec. III).^{[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4]} deep learning multilayer perceptrons (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs,^{[UN1-2]} the vanishing gradient problem (1991)^{[VAN1]} & solutions to it (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][GPUCNN5]} and other foundations.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)} II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",^{[DL7]} referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} we had this type of deep learning already in 1991;^{[UN][UN1-2]} see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs^{[DEEP1-2]} which Hinton,^{[UN4]} Bengio,^{[UN5]} and LBH^{[DL3,DL3a]} did not cite either. See Sec. X.
my comments systematically track the sequential order of ACM's claims.^{[T19]}
ACM's statement on Turing is greatly misleading, like some of its other statements.^{[T19]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]} Much of early AI in the 1940s-70s was actually about theorem proving^{[ZU48][NS56]}
In 1936, Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result,^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing,^{[POS]} my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.^{[HIN]}) open problem "P=NP?" in his famous letter to John von Neumann (1956).^{[GOD56][URQ10]} Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936^{[ZU36-38][Z36][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Zuse also created the first high-level programming language in the early 1940s.^{[BAU][KNU]} conditional jump instruction.^{[RO98]}
that learn internal representations (1965),^{[DEEP1-2][R8]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1943-56)^{[MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2][UN]} vanishing gradients (1991)^{[VAN1]} & solutions to it (Sec. A),^{[LSTM0-17][CTC]} (2004),^{[GPUNN][GPUCNN5]} record-breaking deep supervised NNs (2010)^{[MLP1-2]} and contest-winning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991),^{[FWP0-2,6]} and more.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]} II & I & III & XIII & X & XVII & XII & XVIII & XX.
"advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011^{[MLP1-2][DAN][DAN1][GPUCNN5][R6]} and through Highway Net-like NNs (2015),^{[HW1-3][R5]} although the principles of CNNs were invented and developed by others since the 1970s.^{[CNN1-4]} See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.^{[MIR]}
DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).^{[GPUCNN5,8]} and were able to greatly improve steel defect detection.^{[ST]} All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky won ImageNet 2012.^{[GPUCNN5][R6]} mitosis detection.^{[MGC][GPUCNN5,8]} approach of D & XI).
without citing them.^{[DL1][DLC][HIN][R2-R4][R7-R8]} V & XII & XIX & II & III & XIII & XVII & X & I.
who failed to cite them, even in later work.^{[HIN][DLC][DL1-2][DEEP1-2][CMB][R7-R8]} See Sec. II & III & XIII & V & X & XIV & I.
first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL^{[DL6,6a]} was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",^{}[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.^{[DEEP1-2][CNN1][HIN][R8][DL1][DLC]} Even deep learning through unsupervised pre-training was introduced by others.^{[UN1-3][R4][HIN](Sec. II)} II & III & XIII & V & I.
ignored by LBH's papers^{[HIN][R7-R8][R2-R5]} (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).
ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),^{[GPUNN][GPUCNN5]} made GPU-based NNs fast and deep enough an important benchmark record,^{[MLP1-2]} unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.^{[VID1]} our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]} vision (explicitly mentioned by ACM) for the first time^{[R6]} (see Sec. D).
Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.^{[LSTM1-4][CTC]} In particular, as mentioned in Sec. A, such as HMMs.^{[BW][BOU][BRI][HYB12]} As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.
ACM's statement is "less wrong" than Honda's^{[HIN](Sec. I)} but still (and apparently even other award committees^{[HIN](Sec. I)} backpropagation by Rumelhart et al. (1985-86)^{[RUM]} (1982).^{[BP2]} And the article^{[RUM]} even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),^{[BP1]} Kelley already had a precursor thereof in the field of control theory;^{[BPA]} see also later work of the early 1960s.^{[BPB][BPC]}^{[R7]} internal representations in hidden layers of NNs.^{[RUM]} But this was essentially just an experimental analysis of a known method.^{[BP1-2]} And history of backpropagation can be found at Scholarpedia^{[DL2]} and in my award-winning survey.^{[DL1]} Also see Sec. XIX, II.
Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.^{[BP1]} recent debate:^{[HIN]} It is true that in 2018, Hinton^{[AOI]} Rumelhart^{[RUM]} with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.^{[HIN]} Neither in a popular book^{[AOI]} nor in other recent work^{[DL3,DL3a]} did he cite Linnainmaa (1970),^{[BP1]} the true creator.^{[BP4-5]} that his 2015 survey^{[DL3]} does cite Werbos (1974) who however described the method correctly only later in 1982^{[BP2]} and also failed to cite Linnainmaa^{[BP1]} (compare Amari's work of 1977^{[BP6]}). Linnainmaa's method was well-known.^{[BP5][DL1-2][DLC]} It wasn't created by "lots of different people" as Hinton suggested^{[AOI][HIN][R11]} one person who published first^{[BP1]} and therefore should get the credit.
Boltzmann Machine (BM)^{[BM]} a learning.^{[HIN]} Recently, however, I learnt through a reader that even the BM paper^{[BM]} did not cite prior relevant work by Sherrington & Kirkpatrick^{[SK75]} and Glauber.^{[G63]} (Compare related work.^{[H86][H88][S93]}) multilayer perceptrons with arbitrarily many layers.^{[DEEP1-2][HIN]} Sec. II V & X.^{[MIR](Sec. 1)[R8]}
As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert^{[M69]} at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800^{[DL1-2]}) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.^{[DEEP2][BP6][CNN1][DL1-2]}
Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).^{[Drop1-2]} Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,^{[DAN1]} Sec. D above. Back then, the only really of deep CNNs through GPUs.^{[GPUCNN1,3,5][R6]} Already before ImageNet 2012,^{[R6]} fast deep CNN called DanNet a monopoly on winning computer vision competitions.^{[GPUCNN5]} It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011^{[GPUCNN2][DAN,DAN1][R6]} long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.^{[MIR]}
since the late 1980s.^{[BW][BRI][BOU]} LSTM (1990s-2005)^{[LSTM0-6]} and CTC^{[CTC]} (2006), which were applied to speech in 2007.^{[LSTM4][LSTM14]} CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.^{[BW][BRI][BOU][HYB12]} See also Sec. A.
5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.^{[SNT]} Bengio^{[NPM]} characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures^{[PO87][PO90]}). In the 2010s, was actually the LSTM of our team,^{[LSTM0-6]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B. Bengio's team^{[ATT14]} has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),^{[FWP2][FWP]} and "hard" attention (in observation space) in the context of RL^{[ATT][ATT0-1]} (1990). attention-based Transformers^{[TR1-6]} are FWPs of 1991^{[FWP0-1]} which have become a popular alternative to RNNs. My FWP of 1991^{[FWP0-1]} (now often called keys and values for self-attention).^{[TR1-6][FWP]} the 2010s,^{[DEC]} Transformers^{[TR1-2]} a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly^{[LSTM13,17]} linear Transformers or Performers^{[TR5-6]} which are formally equivalent to my 1991 FWPs (apart from normalization).^{[FWP6][FWP]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves.
See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper^{[ATT2]} his own work:^{[ATT3]}
GANs^{[GAN0-1]} (2010-2014) are actually a simple application^{[AC]} of the adversarial curiosity (AC) principle from 1990^{[AC90,90b][AC20]} (see also surveys^{[AC09-10]}). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis^{[GAN1]} (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,^{[GAN1]} a well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.^{[AC20][AC]} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]}) Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected their erroneous claims^{[GAN1]} about the other (1991).^{}[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,^{[AV1]} their NIPS 2014 paper^{[GAN1]} and some of the erroneous claims it made about my prior work.^{[AC20]} Goodfellow eventually admitted that PM is adversarial (his paper^{[GAN1]} still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^{[AC90]} like in StyleGANs^{[GAN2]}). When the authors^{[GAN1]} I published one myself in the hopes of correcting the annals of history.^{[AC20]} that they are instances of my earlier work.^{[R2][AC20]} vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^{[VAN2]} without citing Sepp. was settled in favor of Sepp.^{[VAN1]} However, even after a common publication,^{[VAN3]} Bengio published papers^{[VAN4][XAV]} are poor indicators of truly pioneering work.^{[NAT1]} (Margin note: Bengio states^{[YB20]} that in 2018 he one must at least clarify it later,^{[DLC]} Bengio also claims^{[YB20]} that in 1995 my publications on exactly this topic date back to 1991-93.^{[UN0-2][UN]} which I started in 1987^{[META1][META]} long before Bengio that he did it before me.^{[R3]} Bengio also writes^{[YB20]} that in Regarding attention-based Transformers,^{[TR1-6]} Bengio^{[DL3a]} cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.^{[FWP,FWP0-2,6]} Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"^{[LSTMGRU]} for a variant of our vanilla LSTM architecture^{[LSTM2]} (2000) which he did not cite although our work^{[LSTM2]} was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method. learn to count^{[LSTMGRU2]} nor learn simple non-regular languages;^{[LSTMGRU2]} they according to Google Brain.^{[LSTMGRU3]}) unsupervised pre-training for deep NNs.^{[UN0-4][HIN](Sec. II)[MIR](Sec. 1)} Hinton's paper^{[UN4]} (2006) appeared long after my earlier work on this^{[UN0-2]} the first NNs shown to solve very deep problems (see Sec. II above).^{[UN]} It was published in 1991-92^{[UN1]} when compute was about 1000 times more expensive than in 2006. survey (2015),^{[DL3][DLC]} See also Sec. II & III. compressing or distilling one NN into another.^{[UN0-2][DIST1-2][MIR](Sec. 2)} Hinton^{[DIST2]} (2006) did not cite my much earlier original work on this (1991),^{[UN1][UN]} not even in his later patent application fast weight programmers^{[FWP][FWP0-4a]} through tensor-like outer products (1991-2016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above). learning sequential attention with NNs.^{[MIR](Sec. 9)} Hinton^{[ATT3]} (2010) our much earlier work on this^{[ATT1][ATT]} although he was both reviewer and editor of my summary^{[ATT2]} (1990; see Sec. XVI above).
The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.^{[R4]} Remarkably, three of them are related to the 1991 paper^{[UN1][UN]} which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.^{[MIR]} See Sec. I for additional related issues of credit assignment.
LeCun's team has made important contributions to CNNs since 1989.^{[CNN2,4]} However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^{[CNN1]} NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.^{[MIR]} at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN1-3]} won the superhuman performance three times worse performance).^{[DAN1]} Again see Sec. D. at ICPR 2012, our DanNet^{[GPUCNN1-3]} won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)^{[GPUCNN5,7,8]} (before the similar AlexNet won ImageNet 2012^{[GPUCNN5][R6]} and the similar VGG network^{[GPUCNN9]} won ImageNet 2014). mitosis detection.^{[MGC][GPUCNN5,7,8]} Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,^{[AM16][DL1]} All of these fields were heavily shaped in the 2010s by our non-CNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See Sec. A, B, VI, XI.
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^{[BP2-4]} (see also Amari's work of 1977^{[BP6]}). recent work.^{[DL3,DL3a][DLC]} In 1960, Kelley already had a precursor of the algorithm.^{[BPA]} Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"^{[DL1]} (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.^{[DL2]}^{[BP4]}
However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)^{[DEEP1-2]} (and also Fukushima^{[CNN1][DL2]}) had long before LeCun. See Sec. D & II & XIII & V.
LeCun et al. neither cited the origins^{[BP1]} (1970) of this widely used type of automatic differentiation for differentiable networks of modules^{[DL2][BP4-5][DLC]} for such systems.^{[S80]} See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work.^{[PO87-90]}
(Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).^{[UN2]} For example, our adaptive subgoal generators (1991)^{[HRL0-2]} were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) see "100 Authors against Einstein."^{}[AH1] ad hominem attacks^{[AH2-3][HIN]} "If you cannot dispute a fact-based message, attack the messenger himself."^{[HIN]} award can ever change that.^{[HIN]} and their co-workers have contributed useful improvements of deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} whom they did not cite II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:^{[DLC][HIN]} to self-correction,"^{[SV20]} as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video^{[VID2]} Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article^{[NYT3]} Google's original 2016 paper on Google Translate^{[WU]} mentions LSTM over 50 times (see Sec. B). In ad hominem style,^{[AH2-3]} claiming credit he doesn't deserve for many, many things",^{[NYT1]} without LeCun also called the GANs of Bengio's team^{[GAN1]} GANs are variations of my work in 1990.^{[AC90,90b][AC20][R2]} According to Bloomberg,^{[AV2]} Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."^{[FAKE]} LBH, who called themselves the deep learning conspiracy,^{[DLC]} Our LSTM paper^{[LSTM1]} has got more citations than any paper by Bengio or LeCun,^{[R5]} Hinton's most cited paper (2012) is the one on GPU-based CNNs.^{[GPUCNN4][R5]} It follows our earlier work on supervised deep NNs (2010)^{[MLP1]} unsupervised pre-training for deep NNs by myself ^{[UN][UN0-3]} and later championed by Hinton;^{[UN4][VID1]} see Sec. D). Hinton (2012)^{[GPUCNN4]} characterizes our deep and fast DanNet (2011)^{[GPUCNN1-3]} as AlexNet won one;^{[R6]} see Sec. D, XIV. The highly cited VGG network (2014)^{[GPUCNN9]} Hinton's 2nd most cited paper^{[RUM][R5]} of Hinton's paper,^{[RUM]} adding citations for a book by Rumelhart & McClelland^{[R5]}). Backpropagation is a previously invented method^{[BP1]} whose origins of Ivakhnenko whom he has never cited;^{[DEEP1-2][R7-R8]} see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),^{[GAN1]} instances of my artificial curiosity (1990)^{[AC90,90b][AC20][R2]} which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)^{[UN4]} by ours^{[UN0-2][UN]} were preceded by Hanson's^{[Drop1-2]} As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,^{[DL3a]} again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,^{[MOST]} of deep learning MLPs since 1965^{[DEEP1-2]} (see Sec. II, XX) and backpropagation (1960-70)^{[BPA][BP1]} (see Sec. XIX, XII) and convolutional NNs since 1979^{[CNN1-4]} (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets^{[HW1-3]} brought it to feedforward NNs in May 2015.^{[MOST]} superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^{[DEC]} speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.^{[DEC]} Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.
As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^{[DLC]} depth that really learned.^{[DEEP1-2][R8]} Five years later, modern backpropagation
Yes, this critique is also an implicit critique of certain other awards to LBH.^{[HIN]} reddit.com/r/MachineLearning^{[R1-R12]} (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.^{[MIR]}
Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:^{[LECP]} "... else cites."^{[LECP]}
Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,^{[HIN]} as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2]}
Thanks to many expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]} (more). Preprint arXiv/1906.04493. Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^{[FWP]} Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.^{[BP1-4]} PDF. Link. PDF. First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^{[DL2]} English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weight-sharing PDF. Spatial Averaging.^{[CNN1]} PDF. PDF. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.^{[DAN1]} J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.^{[UN-UN2]} More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} unsupervised pre-training for deep NNs^{[UN4]} (2006) although this type of deep learning dates back to 1991.^{[UN1-2][UN]} II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J. Schmidhuber (AI Blog, 26 March 2021). alternative^{[FWP0-1]} to recurrent NNs. the fast weights^{[FAST,FASTa]} of Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-7]} can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]} (now often called keys and values for self-attention^{[TR1-6]}). The similar Transformers^{[TR1-2]} combine this with projections linear Transformers or Performers^{[TR5-6]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas^{[AC][AC90, AC90b][AC20]} of GANs. Description of GANs that does not cite the original work of 1990^{[AC][AC90,AC90b][AC20][R2]} (also containing wrong claims about Predictability Minimization^{[PM0-2][AC20]}). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.^{[DAN1]} This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.^{[HW3]} More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets^{[HW1]} More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^{[MLP1]} J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.^{[MIR]} Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).^{}[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).^{[AC90]} Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,^{[BP1-4]} also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed^{[MLP1-2]} unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
Menu
directory
status & updates
copyrights
AI Blog
Twitter: @SchmidhuberAI
Traditionally this is done with recurrent NNs (RNNs)
published.^{[FWP0-1]}
the fast weights of
another NN (see Sec. 1).
In 1991, one of them^{[FWP0-1]}
(now often called keys and values for self-attention; Sec. 2).
The very similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
to the 1991 Fast Weight Programmers^{[MOST]} (see this tweet).
In 1993, I also introduced
the attention terminology^{[FWP2]} now used
in this context^{[ATT]} (Sec. 4), and
RNNs that program themselves
(Sec. 3).
famous vanishing gradient
problem aka deep learning problem (analyzed a few months later in 1991^{[VAN1]})
through additive fast weight changes (Sec. 5).
additive neural activations of LSTMs / Highway Nets / ResNets^{[HW1-3]} (Sec. 5)
Annus Mirabilis of deep learning.^{[MIR]}
brand new, improved version^{[FWP6]} of
the 1991 fast weight update rule (Sec. 6).
reinforcement learning through neuroevolution^{[FWP5]} (2005-, Sec. 7),
goal-conditioned policy generators (2022),^{[GGP]}
metalearning machines that learn to learn^{[FWPMETA1-9]}
(1992-2022, Sec. 8).
As I have frequently emphasized since 1990,^{[AC90][PLAN][META]} artificial neural network (NN) universal self-referential formal systems,^{[GOD][GOD34]} I built NNs whose outputs are changes of programs or weight matrices of other NNs^{[FWP0-2]} (Sec. 1, 2, 3), their own weight change algorithms or learning algorithms^{[FWPMETA1-5]} (Sec. 8). gradient descent procedure^{[BP1-4][BPA][R7]}) can compute a direction in program space where one may find a better program,^{[AC90]} better program-modifying program.^{[FWP0-2][FWPMETA1-5]} started in 1965 layers.^{[DEEP1-2]} Their activation functions were Kolmogorov-Gabor polynomials which include the now popular multiplicative gates,^{[DL1-2]} fast weights. von der Malsburg was the first to explicitly emphasize the importance of NNs with rapidly changing weights.^{[FAST]} The second paper on this was published by Feldman in 1982.^{[FASTa]} The weights of a 1987 NN were sums of weights with a large learning rate and weights with a small rate^{[FASTb][T22]} (but have nothing to do with the NN-programming NNs discussed below). Fast Weight Programmers (FWPs) were published in 1991-93^{[FWP0-2]} (Sec. 1, 2, 3, 4). attention^{[ATT]} (Sec. 4) and Transformers^{[TR1-6]} (Sec. 2, 3, 4, 5).
on 26 March 1991, slow NN that learns by backpropagation^{[BP1-4]} to rapidly modify the fast weights of another NN,^{[FWP0]} essentially published in Neural Computation.^{[FWP1]} attention^{[ATT]} (Sec. 4) That is, I separated storage and control like in traditional computers, but in a fully neural way (rather than in a hybrid fashion^{[PDA1][PDA2][DNC]}). Synthetic Gradients.^{[NAN1-5]} recurrent NNs (RNNs) One of the FWPs of 1991^{[FWP0-1]} is illustrated in the figure. There is A disadvantage addressed in Sec. 2 is that the slow net needs many output units if the fast net is large.
The Fast Weight Programmer^{[FWP0-1]} depicted in Sec. 1 has a slow net unit for each fast weight. However, Section 2 of the same 1991 paper^{[FWP0]} linear^{[TR5-6]} Transformers^{[TR1-2]} or attention^{[ATT]} (compare Sec. 4). to the fast weight (which then may be normalized by a squashing function^{[FWP0]}). The second order tensor products.^{[FWP0-3a]} linear Transformers).^{[FWP6][TR5-6]} The highly successful Transformers of 2017^{[TR1-2]} can be viewed as a combination of my additive outer product fast weight principle^{[FWP0-2]} NN-programmed fast weights (Sec. 5 & 1). linear Transformers (2020-21)^{[TR5-6]} abandoned the softmax, essentially resurrecting the original 1991 system.^{[FWP0-1]} Compare Sec. 6. go back at least to Hebb's informal rule (1949)^{[HE49]} and Steinbuch's Learning Matrix around 1960.^{[ST61-63][AMH1-2][KOH72][LIT74][PAL80][KOS88]} since 1991.^{[FWP0-3a][TR5-6]} I offered the FWPs of 1991^{[FWP0-1]} as an sequence-processing recurrent NNs (RNNs) (Sec. 1), the computationally most powerful NNs of them all.^{[UN][MIR](Sec. 0)} Modern Transformers are also viewed as RNN alternatives, despite their limitations.^{[TR3-4]} The slow net and the fast net of the 1991 system^{[FWP0-1]} in Sec. 2 were feedforward NNs (FNNs), like most current Transformers.^{[TR1-6]} I collapsed all of this into a single RNN that could rapidly reprogram all of its own fast weights through additive outer product-based weight changes.^{[FWP2]} One motivation reflected by the title of the paper^{[FWP2]} of the same size: O(H^{2}) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)} See also our more recent work on FWPs since 2017,^{[FWP3-3a][FWPMETA7][FWP6]} and compare a recent study.^{[RA21]} 4. Attention terminology of 1993 Today, everybody is talking about attention when it comes to describing the principles of Transformers.^{[TR1-2]} The additive outer products^{[FWP0-1]} of the Fast Weight Programmers described in Sec. 2 and Sec. 3 Similarly, the attention weights or self-attention weights (see also^{[FWP4b-d]}) NN-programmed fast weights (Sec. 5).^{[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22]} 1993 paper^{[FWP2]} which internal spotlights of attention Fast Weight Programmers.^{[FWP2][ATT]} Apart from possible normalization/squashing,^{[FWP0]} are additive (Sec. 1 & 2). do not suffer during sequence learning from the famous vanishing gradient by my brilliant student Sepp Hochreiter a few months later in his 1991 diploma thesis.^{}[VAN1]
and both of them dating back to 1991, our miraculous year of deep learning.^{[MIR]} Basic Long Short-Term Memory^{[LSTM1]} solves the problem by adding at every time step That is, the core of LSTM is operating in a linear additive activation space (ignoring LSTM's multiplicative gates).^{[LSTM1][VAN1][MIR](Sec. 4 & Sec. 8)} Additive FWPs^{[FWP0-2]} (Sec. 1 & 2), however, solve the problem through a dual approach, By favoring additive operations yielding non-vanishing first derivatives and error flow,^{[VAN1]} Transformers^{[TR1-6]} also follow the additive approach.^{[FWP0-2]} (compare Sec. 2 and Sec. 4 on attention terminology since 1993).
[LSTM1-13] is mirrored in the LSTM-inspired Highway Network (May 2015),^{[HW1][HW1a][HW3]} the first working really deep It is essentially a feedforward version of LSTM^{[LSTM1]} with forget gates.^{[LSTM2]} Residual Net or ResNet^{[HW2]} (Dec 2015). Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,^{[DEC]} major IT companies overwhelmingly used smartphones.^{[DL4]} rapidly learn to solve quickly^{[LSTM13]} while plain Transformers can't yet.^{[TR4]} unsupervised pre-training of deep NNs.^{[UN0-UN2][MIR](Sec. 1)} dates back to 1991^{[UN]} Recent work of February 2021^{[FWP6]} mechanisms^{[TR5-6]} and Fast Weight Programmers^{[FWP0-2]} (Sec. 2).^{[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3)} variants.^{[TR5-6]} Building on previous work^{[FWPMETA7]} on FWPs (Sec. 1, 2, 3, 8), we replace the 1991 elementary programming instruction based on additive outer products^{[FWP0-2]} by a delta rule-like^{[WID]} language modeling tasks.^{[FWP6]} Our code is public. work of June 2021^{[FWP7]} (also with Robert Csordas) points out that the original FWP formulation of 1991^{[FWP0-1]} is more general than the one of linear Transformers: a slow NN continually reprograms the weights of a fast NN with Our code is public.
as shown in 2005 with my former postdoc Faustino Gomez^{[FWP5]} (now CEO of NNAISENSE) Our 2005 paper on deep RL^{[DL6,6a]} was actually the first machine learning numerous weights of large NNs through very compact codes.^{}[KO0-2][CO1-4] Here we exploited that the Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search^{[CO2]} unsupervised pre-training.
Recent work of 2022^{[GGP]} with
My first work on metalearning machines that learn to learn was published in 1987.^{[META][R3]} metalearning in a very general way. In references^{[FWPMETA1-5]} since 1992, the slow NN and the fast NN (Sec. 1) are recurrent and identical. The RNN can see its own errors or reward signals called eval(t+1) in the image.^{[FWPMETA5]}
The 1993 FWP of Sec. 3^{[FWP2]} also was an RNN RNN above,^{[FWPMETA1-5]} it used outer products between key patterns and value patterns (Sec. 2) to manipulate used gradient descent in LSTM networks^{[LSTM1]} instead of traditional functions of two variables^{[HO1]} (more on LSTM and fast weights in Sec. 5). In 2020, Imanol et al. augmented an LSTM with an associative fast weight memory.^{[FWPMETA7]} partially observable environments.^{[FWPMETA7]} Our recent MetaGenRL (2020)^{[METARL10]} meta-learns See the blog post of my PhD student Louis Kirsch. outer-product-like fast weights encoded in the activations of LSTMs.^{[FWPMETA6]} variables^{[FWP2]} (Sec. 3). VS-ML can also learn to implement the backpropagation learning algorithm^{[BP1-4]} purely in the end-to-end differentiable forward dynamics of RNNs.^{[FWPMETA6]}
In 2022, we also published at ICML a modern self-referential weight matrix (SWRM)^{[FWPMETA8]} based on the 1992 SWRM.^{[FWPMETA1-5]}
self-improvement (compare this tweet).
There is another version of this article
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our
PDF.
The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network.
The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.^{[AMH1]}
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
Transformers with linearized self-attention (1991-93).^{[FWP]} Today, both types are very popular.
PDF.
PDF.
More.
PS. (PDF.)
Precursor of modern backpropagation.^{[BP1-4]}
PDF.
Link.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
PDF.
PDF.
PDF.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
More.
Deep Learning.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By
greatly improved (CTC-based)
on-device speech recognition
(on the phone, not the server)
LSTM.
PDF.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
neural networks learning to control dynamic external memories.^{[PDA1-2][FWP0-1]}
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
alternative^{[FWP0-1]} to recurrent NNs.
the fast weights^{[FAST,FASTa]} of
Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-8]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]}
(now often called keys and values for self-attention^{[TR1-6]}).
The similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
RNNs that program themselves.
See tweet of 2022.
PDF.
"Transformer with linearized self-attention."^{[FWP]}
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30-year anniversary.
PDF.
Preprint: arXiv:1811.12143. PDF.
PDF.
Preprint: arXiv:2003.08165.
PDF.
HTML overview.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
Preprint: arXiv:2106.06295 (June 2021).
PDF.
PDF.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here.
Preprint arXiv:2012.14905 [cs.LG], 2020.
Report arXiv:2011.07831 [cs.AI], 2020.
Preprint: arXiv:2202.05780.
Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022).
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The
LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]} More.
Link.
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
More.
arxiv:1612.07771 (2016). Also at ICLR 2017.
PDF.
PDF.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
More.
PDF.
PDF.
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020.
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.^{[MIR]}
PDF.
PDF.
Preprint arXiv:1608.05343, 2016.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
the 1991 publication on what's now called "Transformers with linearized self-attention."^{[FWP0-6][TR5-6]}
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30-year anniversary.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
PDF.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF.
approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here (depth > 1000).
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 15 June 1991 (advisor J. Schmidhuber). PDF.
Menu
directory
status & updates
copyrights
https://people.idsia.ch/~juergen/deep-learning-history.html
AI Blog
@SchmidhuberAI
arXiv:2212.11279
is dominated by artificial neural networks (NNs) and deep learning,^{[DL1-4]}
hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous
deep learning survey^{[DL1]}
mentioning my own team's work, because (as of 2022) the most cited NNs are based on it.^{[MOST]}
Sec. 1: Introduction
Sec. 2: 1676: The Chain Rule For Backward Credit Assignment
Sec. 3: Circa 1800: First Neural Net (NN) / Linear Regression / Shallow Learning
Sec. 4: 1920-1925: First Recurrent NN (RNN) Architecture. ~1972: First Learning RNNs
Sec. 5: 1958: Multilayer Feedforward NN (without Deep Learning)
Sec. 6: 1965: First Deep Learning
Sec. 7: 1967-68: Deep Learning by Stochastic Gradient Descent
Sec. 8: 1970: Backpropagation. 1982: For NNs. 1960: Precursor.
Sec. 9: 1979: First Deep Convolutional NN (1969: Rectified Linear Units)
Sec. 10: 1980s-90s: Graph NNs / Stochastic Delta Rule (Dropout) / More RNNs / Etc
Sec. 11: Feb 1990: Generative Adversarial Networks / Artificial Curiosity / NN Online Planners
Sec. 12: April 1990: NNs Learn to Generate Subgoals / Work on Command
Sec. 13: March 1991: NNs Learn to Program NNs. Transformers with Linearized Self-Attention
Sec. 14: April 1991: Deep Learning by Self-Supervised Pre-Training. Distilling NNs
Sec. 15: June 1991: Fundamental Deep Learning Problem: Vanishing/Exploding Gradients
Sec. 16: June 1991: Roots of Long Short-Term Memory / Highway Nets / ResNets
Sec. 17: 1980s-: NNs for Learning to Act Without a Teacher
Sec. 18: It's the Hardware, Stupid!
Sec. 19: But Don't Neglect the Theory of AI (Since 1931) and Computer Science
Sec. 20: The Broader Historic Context from Big Bang to Far Future
Sec. 21: Acknowledgments
Sec. 22: 555+ Partially Annotated References (many more in the award-winning survey^{[DL1]})
quite erroneous ideas about the origins of the universe (see the final section
A history of AI written in the 1980s would have emphasized topics such as theorem proving,^{[GOD][GOD34][ZU48][NS56]} logic programming, expert systems, and heuristic search.^{[FEI63,83][LEN83]} an old area of research seeing renewed interest. Practical AI dates back at least to 1914, when Leonardo Torres y Quevedo (see below) built the first working chess end game player^{[BRU1-4]} any type of computation-based AI.^{[GOD][BIB3][GOD21,a,b]} emphasis on topics such as support vector machines and kernel methods,^{[SVM1-4]} Bayesian (actually Laplacian or possibly Saundersonian^{[STI83-85]}) reasoning^{[BAY1-8][FI22]} and other concepts of probability theory and statistics,^{[MM1-5][NIL98][RUS95]} decision trees,^{e.g.,[MIT97]} ensemble methods,^{[ENS1-4]} swarm intelligence,^{[SW1]} and evolutionary computation.^{[EVO1-7]([TUR1],unpublished)} Why? Because back then such techniques drove many successful AI applications.
A history of AI written in the 2020s must emphasize concepts such as the even older chain rule^{[LEI07]} and deep nonlinear artificial neural networks (NNs) trained by gradient descent,^{[GD']} in particular, feedback-based recurrent networks, which are general computers whose programs are weight matrices.^{[AC90]} Why? Because many of the most famous and most commercial recent AI applications depend on them.^{[DL4]} MACY conferences (1946-1953)^{[MACY51]} and the 1951 Paris conference on calculating machines and human thought, now often viewed as the first conference on AI.^{[AI51][BRO21][BRU4]} modern AI based on "deep learning" with NNs.^{[DL1-2][DEC]} minimize pain, maximize pleasure, drive cars, etc.^{[MIR](Sec. 0)[DL1-4]}
The present piece also debunks a frequently repeated, misleading "history of deep learning"^{[S20][DL3,3a]} which ignores most of the pioneering work mentioned below.^{[T22]} See Footnote 6. The title image of the present article is a reaction to an erroneous piece of common knowledge which says^{[T19]} that the use of NNs "as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s," although such NNs appeared long before the 1980s.^{[T22]} on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} Finally,
In 1676, Gottfried Wilhelm Leibniz textbook on Leibniz' differential calculus.^{[LEI07-10][L84]}
This answer is used by the technique of gradient descent (GD), apparently first proposed by Augustin-Louis Cauchy in 1847^{[GD']} (and much later by Jacques Hadamard^{[GD'']}; the stochastic version called SGD is due to Herbert Robbins and Sutton Monro (1951)^{[STO51-52]}).
Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;^{[L84][SON18][MAD05][LEI21,a,b]} later Isaac Newton was also credited for his unpublished work.^{[SON18]} Their priority dispute,^{[SON18]} however, did not encompass the chain rule.^{[LEI07-10]} Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever^{[ARC06]}) paved the way for infinitesimals Sangamagrama and colleagues of the Indian Kerala school.^{[MAD86-05]} "the world's first computer scientist"^{[LA14]}) also laid foundations of modern computer science. He and the first with an internal memory.^{[BL16]} He described the principles of binary computers (1679)^{[L79][L03][LA14][HO66][LEI21,a,b]} His formal Algebra of Thought (1686)^{[L86][WI48]} was deductively equivalent^{[LE18]} to the much later Boolean Algebra (1847).^{[BOO]} all possible questions through computation;^{[WI48]}
Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L'Hopital (1696).^{[CONN21]} doing this).^{[T22]} It was not published until 1970, as discussed below.^{[BP1,4,5]}
In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Later Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795.^{[STI81]}
Rosenblatt's perceptron (1958)^{[R58]} combined a linear NN as above with an output threshold function to obtain a pattern classifier (compare his more advanced work on multi-layer networks discussed below). Joseph^{[R61]} Widrow & Hoff's similar Adaline learned in 1962.^{[WID62]}
analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.^{}[L20][I24,I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first learning RNNs (see below). were also discussed in 1943 by neuroscientists Warren McCulloch und Walter Pitts^{[MC43]} and formally analyzed in 1956 by Stephen Cole Kleene.^{[K56]}
In 1972, Shun-Ichi Amari made the Lenz-Ising recurrent architecture adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights.^{}[AMH1] See also Stephen Grossberg's work on biological networks,^{[GRO69]} David Marr's^{[MAR71]} and Teuvo Kohonen's^{[KOH72]} work, and Kaoru Nakano's learning RNN.^{[NAK72]}
10 years later, the Amari network was republished (and its storage capacity analyzed).^{[AMH2]} Some called it the Hopfield Network (!) or Amari-Hopfield Network.^{[AMH3]} sequence-processing generalization thereof.^{[AMH1]} learning RNNs. This, however, was first published many decades later,^{[TUR1]} which explains the obscurity of his thoughts here.^{[TUR21]} (Margin note: it has been pointed out that the famous "Turing Test" should actually be called the "Descartes Test."^{[TUR3,a,b][TUR21]})
Today, the most popular RNN is the Long Short-Term Memory (LSTM) mentioned below, which has become the most cited NN of the 20th century.^{[MOST]}
In 1958, Frank Rosenblatt not only combined linear NNs and threshold functions (see the section on shallow learning since 1800), he also had more interesting, deeper multilayer perceptrons (MLPs).^{[R58]} because only the last layer learned,^{[DL1]} Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM1-2][CONN21][T22]}
MLPs were also discussed in 1961 by Karl Steinbuch^{[ST61-95]} and Roger David Joseph^{[R61]} (1961). See also Oliver Selfridge's multilayer Pandemonium^{[SE59]} (1959). wrote about "back-propagating errors" in an MLP with a hidden layer^{[R62]} although he did not yet have a general deep learning algorithm for deep MLPs. What's now called backpropagation is quite different and was first published in 1970, as discussed below.^{[BP1-BP5][BPA-C]}
Today, the most popular FNN is a version of the LSTM-based Highway Net (mentioned below) called ResNet,^{[HW1-3]} which has become the most cited NN of the 21st century.^{[MOST]}
multiplicative gates).^{}[DEEP1-2][DL1-2][FDL] A paper of 1971^{[DEEP2]} highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} first introduced to Machine Learning much later by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} (Margin note: our 2005 paper on deep learning^{[DL6,6a]} was the first machine learning publication with the word combination "learn deep" in the title.^{[T22]})
Ivakhnenko and Lapa (1965, see above) end-to-end fashion from scratch by stochastic gradient descent (SGD),^{[GD1]} a method proposed in 1951 by Robbins & Monro.^{[STO51-52]}
Amari's implementation^{[GD2,GD2a]} (with his student Saito) learned internal representations in a five layer MLP with two modifiable layers, which was trained to classify
See also Iakov Zalmanovich Tsypkin's even earlier work on gradient descent-based on-line learning for non-linear systems.^{[GDa-b]}
Remarkably, as mentioned above, Amari also published learning RNNs in 1972.^{[AMH1]}
In 1970, Seppo Linnainmaa was the first to publish what's now known as backpropagation, the famous algorithm for credit assignment in networks of differentiable nodes,^{[BP1,4,5]}
In 1982, Paul Werbos proposed to use the method to train NNs,^{[BP2]} extending ideas in his 1974 thesis.
In 1960, Henry J. Kelley already had a precursor of backpropagation in the field of control theory;^{[BPA]} see also later work of the early 1960s by Stuart Dreyfus and Arthur E. Bryson.^{[BPB][BPC]}^{[R7]} Unlike Linnainmaa's general method,^{[BP1]} the systems of the 1960s^{[BPA-C]}
Backpropagation is essentially an efficient way of implementing Leibniz's chain rule^{[LEI07-10]} (1676) (see above) for deep networks. Cauchy's gradient descent^{[GD']} uses this to such that the NN behaves more and more like some teacher, which could be a human, or another NN,^{[UN-UN2]} or something else. had just become accessible in wealthier academic labs. An experimental analysis of the known method^{[BP1-2]} yield useful internal representations in hidden layers of NNs.^{[RUM]} At least for supervised learning, backpropagation is generally more efficient than Amari's above-mentioned deep learning through the more general SGD method (1967), which learned useful internal representations in NNs about 2 decades earlier.^{[GD1-2a]}
It took 4 decades until the backpropagation method of 1970^{[BP1-2]} got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991^{[UN][UN0-3]} (see below), and later championed by others (2006).^{[UN4]} In fact, it was claimed^{[VID1]} postdoc Dan Ciresan^{[MLP1-2]} pre-training for important applications.^{[MLP2]}
Our system set a new performance record^{[MLP1]} on Jung & Oh in 2004^{[GPUNN]}). A reviewer called this a "wake-up call to the machine learning community." researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]} and then also by Amari's SGD for MLPs.^{[GD1-2]} Minsky neither cited this work nor corrected his book later.^{[HIN](Sec. I)[T22]} (such as the Boltzmann machine^{[BM][HIN][SK75][G63][T22]}) without relating them to the original work,^{[DLC][S20][T22]} although the true history is well-known. in the 1960s-70s, especially outside of the Anglosphere.^{[DEEP1-2][GD1-3][CNN1][DL1-2][T22]} Blatant misattribution and unintentional^{[PLAG1][CONN21]} or intentional^{[FAKE2]} plagiarism are still tainting the entire field of deep learning.^{[T22]} Scientific journals "need to make clearer and firmer commitments to self-correction,"^{[SV20]} as is already the standard in other scientific fields.
Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).^{[CNN1-4]} Neocognitron.^{[CNN1]} rectified linear units (ReLUs) for NNs (1969).^{[RELU1]} They are now widely used in CNNs and other NNs.
In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),^{[BP1-2]} and applied to speech.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990^{[CNN3a]} and by Juan Weng et al. for higher-dimensional CNNs in 1993.^{[CNN3]} Yann LeCun's team has contributed improvements of CNNs, especially for images.^{[CNN2,4][T22]} Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]}
CNNs (Dan Ciresan et al., 2011).^{[GPUCNN1,3,5]}
Our fast GPU-based^{[GPUNN][GPUCNN5]}
CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]}
CNNs of 2006.^{[GPUCNN]} In 2011, DanNet became the first pure deep CNN
to win computer vision contests.^{[GPUCNN2-3,5]}
Highway Net^{[HW1]}
with open gates
ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) and currently the most cited NN,^{[MOST]} is a version (with open gates) of our earlier Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). NNs with rapidly changing "fast weights" were introduced by v.d. Malsburg (1981) and others.^{}[FAST,a,b] Deep learning architectures that can manipulate structured data such as graphs^{[T22]} were our graph NN-like, Transformer-like Fast Weight Programmers of 1991^{[FWP0-1][FWP6][FWP]} which learn to continually rewrite mappings from inputs to outputs (addressed below), and the work of Baldi and colleagues.^{[BA96-03]} Today, graph NNs are used in numerous applications.
Werbos,^{[BP2][BPTT1]} Williams,^{[BPTT2][CUB0-2]} and others^{[ROB87][BPTT3][DL1]} analyzed ways of implementing gradient descent^{[GD'][STO51-52][GDa-b][GD1-2a]} in RNNs. Kohonen's self-organising maps became popular.^{[KOH82-89]} in space and time.^{[BB2][NAN1-4][NHE][HEL]} See overviews^{[MIR](Sec. 15, Sec. 17)} and recent renewed interest in such methods.^{[NAN5][FWPMETA6][HIN22]} version of this became popular under the moniker "dropout."^{[Drop1-4][GPUCNN4]} Generative Adversarial Networks (GANs) have become very popular.^{}[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.^{[AC90-20][GAN1]} Two dueling NNs (a probabilistic generator and a predictor) are trying to maximize each other's loss in a minimax game.^{[AC](Sec. 1)} (using stochastic units^{[AC90]} like in the much later StyleGANs^{[GAN2]}). the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain.^{[AC90]} (The world model can also be used for continual online action planning.^{[AC90][PLAN2-3][PLAN]})
4 years before a 2014 paper on GANs,^{[GAN1]} my well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a given set.^{[AC20][AC][T22](Sec. XVII)} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]} has been widely used for exploration in Reinforcement Learning^{[SIN5][OUD13][PAT17][BUR18]} for synthesis of realistic images,^{[GAN1,2]} although the latter domain was recently taken over by Rombach et al.'s Latent Diffusion, another method published in Munich,^{[DIF1]} building on Jarzynski's earlier work in physics from the previous millennium^{[DIF2]} and more recent papers.^{[DIF3-5]} Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.^{[PM0-2][AC20][R2][MIR](Sec. 7)} which is now considered a remaining grand challenge.^{}[LEC] The early 1990s, however, saw first exceptions: NNs that learn to decompose complex spatio-temporal observation sequences into compact but meaningful chunks^{[UN0-3]} (see further below), and NN-based planners of hierarchical action sequences for compositional learning,^{[HRL0]} as discussed next. This work injected concepts of traditional "symbolic" hierarchical AI^{[NS59][FU77]} into end-to-end differentiable "sub-symbolic" NNs. end-to-end differentiable NN-based subgoal generators for Hierarchical Reinforcement Learning (HRL).^{[HRL0]} Soon afterwards, this was also done with recurrent NNs that learn to generate sequences of subgoals.^{[HRL1-2][PHD][MIR](Sec. 10)} problem."^{[LEC]}
Compare other NNs that have "worked on command" since April 1990, in particular, for learning selective attention,^{[ATT0-3]} artificial curiosity and self-invented problems,^{[PP][PPa,1,2][AC]} upside-down reinforcement learning^{[UDRL1-2]} and its generalizations.^{[GGP]} Recently, Transformers^{}[TR1] have been all the rage, e.g., generating human-sounding texts.^{[GPT3]} Transformers with "linearized self-attention"^{[TR5-6]} were first published in March 1991^{[FWP0-1][FWP6][FWP]} These so-called "Fast Weight Programmers" or "Fast Weight Controllers"^{[FWP0-1]} separated storage and control like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way (rather than in a hybrid fashion^{[PDA1-2][DNC]}). The "self-attention" in standard Transformers^{[TR1-4]} combines this with a projection and softmax (using attention terminology like the one I introduced in 1993^{[ATT][FWP2][R4]}).
Today's Transformers heavily use unsupervised pre-training^{[UN0-3]} (see next section), another deep learning methodology Annus Mirabilis of 1990-1991.^{[MIR][MOST]}
The 1991 fast weight programmers 1992^{[FWPMETA1-9][HO1]} extended my 1987 diploma thesis,^{[META1]} which introduced algorithms not just for learning but also for meta-learning or learning to learn,^{[META]} to learn better learning algorithms through experience. This became very popular in the 2010s^{[DEC]} when computers were a million times faster. layers of neurons or many subsequent computational stages.^{}[MIR] ones^{[DL1-2]} (but see a 1989 paper^{[MOZ]}). of arbitrary depth.^{[DL1]} Before the 1990s, however, RNNs failed to learn deep problems in practice.^{[MIR](Sec. 0)} scales:^{[LEC]} the Neural Sequence Chunker^{[UN0]} or Neural History Compressor.^{[UN1]} "very deep learning" tasks of depth > 1000^{[UN2]} (requiring Neural History Compressor.^{[UN3]} (See also recent work on unsupervised NN-based abstraction.^{[OBJ1-5]}) More than a decade after this work,^{[UN1]} called Deep Belief Networks (DBNs).^{[UN4]} (or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]} using my NN distillation procedure of 1991.^{[UN0-1][MIR]} NN distillation was also republished many years later,^{[DIST2][MIR][HIN][T22]} and is widely used today. used by Transformers^{[TR1-6]} for Transformers with linearized self-attention were also first published^{[FWP0-6]} in Annus Mirabilis of 1990-1991,^{[MIR][MOST]} together with unsupervised/self-supervised pre-training for deep learning.^{[UN0-3]} See the previous section. Deep learning is hard because of the Fundamental Deep Learning Problem his diploma thesis which I had the pleasure to supervise.^{[VAN1]} First he implemented the Neural History Compressor above but then did much more: In both cases, learning fails (compare^{[VAN2]}). This analysis led to basic principles of what's now called LSTM (see below). Long Short-Term Memory (LSTM) recurrent neural network^{[LSTM1-6]} overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 diploma thesis,^{[VAN1]} which I consider one of the most important documents in the history of machine learning. It also provided essential insights for overcoming the problem, through basic principles (such as constant error flow) of what we called LSTM in a tech report of 1995.^{[LSTM0]} After the main peer-reviewed publication in 1997^{[LSTM1][25y97]} (now the most cited NN article of the 20th century^{[MOST]}), application of LSTM to speech (2004).^{[LSTM10]} 2005 saw the first publication of LSTM with full backpropagation through time and of bi-directional LSTM^{[LSTM3]} (now widely used). Another milestone of 2006 was the training method "Connectionist Temporal Classification" or CTC^{[CTC]} for simultaneous alignment and recognition of sequences. Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). NNs and traditional approaches such as Hidden Markov Models (HMMs).^{[BW][BRI][BOU][HYB12][T22]} three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM was soon used for everything that involves sequential data such as speech^{[LSTM10-11][LSTM4][DL1]} and videos. Google's speech recognition on the Android smartphones.^{[GSR15]} Many other companies adopted this.^{[DL4]} on-device speech recognition of 2019 (now on your phone, not on the server) LSTM. In 1995, we already had an excellent neural probabilistic text model^{}[SNT] whose basic concepts were Nakamura and Shikano's 1989 word category prediction model.^{[NPMa]} In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} achieve only 10 billion clicks),^{[FB17][DL4]} Apple's Quicktype on roughly 1 billion iPhones,^{[DL4]} the voice of Amazon's Alexa,^{[DL4]} image caption generation^{[DL4]} & automatic email answering^{[DL4]} etc. Business Week called LSTM "arguably the most commercial AI achievement."^{[AV1]} have "LSTM" in their title.^{[DEC]}
Highway Network^{[HW1]} (previous NNs had at most a few tens of layers). Microsoft's ResNet^{[HW2]} (which won the ImageNet 2015 contest) is a version thereof The earlier Highway Nets perform roughly as well as their ResNet versions on ImageNet.^{[HW3]} Variants of highway gates are also used for certain algorithmic tasks where the pure residual layers do not work as well.^{[NDR]}
Deep learning is all about NN depth.^{[DL1]} LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward Net version called ResNet the most cited NN of the 21st.^{[MOST]} (Citations, however, are a highly questionable measure of true impact.^{[NAT1]}) Reinforcement Learning (RL),^{}[KAE96][BER96][TD3][UNI][GM3][LSTMPG] expected cumulative reward signals.^{[DL1]} formulated in the general RL framework.^{[UNI]} Monte Carlo (tree) search (MC, 1949),^{[MOC1-5]} dynamic programming (DP, 1953),^{[BEL53]} artificial evolution (1954),^{[EVO1-7]([TUR1],unpublished)} alpha-beta-pruning (1959),^{[S59]} control theory and system identification (1950s),^{[KAL59][GLA85]} stochastic gradient descent (SGD, 1951),^{[STO51-52]} and universal search techniques (1973).^{[AIT7]} system identification,^{[WER87-89][MUN87][NGU89]} DP and its online variant called Temporal Differences (TD),^{[TD1-3]} artificial evolution,^{[EVONN1-3]} and policy gradients.^{[GD1][PG1-3]} Many additional references on this can be found in Sec. 6 of the 2015 survey.^{[DL1]}
When there is a Markovian interface^{[PLAN3]} RL with DP/TD/MC-based FNNs can be very successful, as shown in 1994^{[TD2]} (master-level backgammon player) and the 2010s^{[DM1-2a]} (superhuman players for Go, chess, and other games). history of previous inputs, our combinations of RL algorithms and LSTM^{[LSTM-RL][RPG]} have become standard, in particular, our LSTM trained by policy gradients (2007).^{[RPG07][RPG][LSTMPG]}
For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} commonsense reasoning^{[MAR15]} and learning to think.^{[PLAN4-5]} time scales?^{[LEC]} We published answers to these questions in 1990-91: self-supervised neural history compressors^{[UN][UN0-3]} learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators^{[HRL3][MIR](Sec. 10)} learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in 1997^{[AC97][AC99][AC02]} and 2015-18.^{[PLAN4-5]} century^{}[SHA7a][RAU1] by Heron of Alexandria was perhaps the first machine with a stored program.^{[BAN][KOE1]} It used pins on
Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"^{[SMO13]}) designed the first machine (the step reckoner) that could perform all four arithmetic operations, and the first with a memory.^{[BL16]} cards (1679),^{[L79][L03][LA14][HO66]} and published the chain rule^{[LEI07-10]} (see above), essential ingredient of deep learning and modern AI.
Leonardo Torres y Quevedo (mentioned in the introduction) became it at the 1951 Paris AI conference.^{[AI51][BRO21][BRU4]} Konrad Zuse The corresponding patent of 1936^{[ZU36-38][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Unlike Babbage, Zuse used Leibniz' principles of binary computation (1679)^{[L79][LA14][HO66][L03]} This greatly simplified the hardware.^{[LEI21,a,b]} Church^{[CHU]} (1935), Turing^{[TUR]} (1936), and Post^{[POS]} (1936). conditional jump instruction.^{[RO98]}
John Atanasoff (the "father of tube-based computing"^{[NASC6a]}). Julius Edgar Lilienfeld in 1925.^{[LIL1-2]} used to break the Nazi code.^{[NASC6]} someone other than Zuse (1941)^{[RO98]} was Howard Aiken's decimal MARK I (US, 1944). and the 1948 upgrade of ENIAC, which was reprogrammed by entering numerical instruction codes into read-only memory.^{[HAI14b]} with several transistors on a common substrate (granted in 1952).^{[IC49-14]} In 1959, Robert Noyce presented a monolithic IC.^{[IC14]} ICs/GPUs of today (2022) contain many billions of transistors (almost all of them of Lilienfeld's 1925 FET type^{[LIL1-2]}). Moore's Law which states that the number of transistors^{[LIL1-2]} raw computational power of all human brains combined.^{[RAW]} According to Bremermann (1982),^{[BRE]} as previously noted back in 2004.^{[OOPS2][ZUS21]} are actually light beams).^{[DL2]} are expected to become even much more important than they are today.^{[DL2]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}
He combined Georg Cantor's diagonalization trick^{[CAN]} with the foundational work by Gottlob Frege^{[FRE]} (who introduced the first formal language in 1879), Thoralf Skolem^{[SKO23]} (who introduced primitive recursive functions in 1923) and Jacques Herbrand^{[GOD86]} (who identified Gottfried Wilhelm Leibniz^{[L86][WI48]} (see above), deductively equivalent^{[LE18]} to the later Boolean Algebra of 1847.^{[BOO]} In 1936, Alan M. Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result.^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing.^{[POS]} the world's first working programmable general-purpose computer,^{[ZU36-38][RO98][ZUS21]} the first high-level programming language.^{[BAU][KNU]} 1945^{[KNU]} in 1948.^{[ZU48]} Compare Newell & Simon's later work on theorem proving (1956).^{[NS56]} In 1964, Ray Solomonoff combined Bayesian (actually Laplacian^{[STI83-85]}) probabilistic reasoning and theoretical computer science^{[GOD][CHU][TUR][POS]} of learning to predict future data from past observations.^{[AIT1][AIT10]} With Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),^{[AIT1-22]} going beyond traditional information theory^{[SHA48][KUL]} this concept,^{[AIT7][AIT5][AIT12-13][AIT16-17]} as well as applications to NNs.^{[KO2][CO1-3]}
In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant^{[UNI]}) augmented Solomonoff's universal predictor^{[AIT1][AIT10]} environments.^{[AIT20,22]} He also derived the asymptotically fastest algorithm for all well-defined computational problems,^{[AIT21]} a beautiful pattern of exponential acceleration in it,^{}[OMG] which I have presented in many talks since then, and which also made it into Sibylle Berg's award-winning book "GRM: Brainfuck."^{[OMG2]} intervals: just a few decades or centuries or at most millennia.^{[OMG1]} Heron of Alexandria^{[RAU1]} in the 1st century). The telephone (e.g., Meucci 1857, Reis 1860, Bell 1876)^{[NASC3]} Haber-Bosch process for creating artificial fertilizer, without which the world could feed at most 4 billion people.^{[HAB1-2]} first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).^{[AUT]} Back then, I worked on my 1987 diploma thesis,^{[META1]} which introduced algorithms not just for learning but also for meta-learning or learning to learn,^{[META]} to learn better learning algorithms through experience (now a very popular topic^{[DEC]}). And then came our Miraculous Year 1990-91^{[MIR]} at TU Munich, the root of today's most cited NNs^{[MOST]} and of modern deep learning through artificial curiosity and generative adversarial NNs for agents that invent their own problems (see above),^{[AC90-AC20][PP-PP2][SA17]} Transformers with linearized self-attention (see above),^{[FWP0-6][TR5-6]} distilling teacher NNs into student NNs (see above),^{[UN][UN0-3]} at multiple levels of abstraction and multiple time scales (see above),^{[HRL0-2][LEC]} and other exciting stuff. Much of this has become very popular, and improved the lives of billions of people.^{[DL4][DEC][MOST]} (take all of this with a grain of salt, though^{[OMG1]}). lab for decades^{[AC][AC90,AC90b]}) will quickly improve themselves, restricted only by the fundamental limits of computability and physics. it,^{[ACM16][FA15][SP16][SA17]} make more and bigger AIs. Those who don't won't have an impact.^{[ACM16][FA15][SP16]}
Some of the material above was taken from previous AI Blog posts.^{[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22]}
publication page and my
arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
555+ References (and many more in the survey^{[DL1]})
In 2022, we are celebrating the following works from a quarter-century ago.
1. Journal paper on Long Short-Term Memory, the
(and basis of the most cited NN of the 21st).
all possible metaverses
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
meta-reinforcement learning.
5. Journal paper on hierarchical Q-learning.
8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age.
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity.
PDF.
The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
(more).
PDF.
More.
PDF.
PDF.
general system
systems with intrinsic motivation,^{[AC90-AC95]} the system also
See later publications.^{[AC99][AC02]}
PDF.
PDF.
PDF. (More on
artificial scientists and artists.)
IEEE link.
PDF.
With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]}
(more).
Preprint arXiv/1906.04493.
Link.
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
H. Bruderer^{[BRU4]} calls that the first conference on AI.
Blog of Werner Vogels, CTO of Amazon (Nov 2016):
PDF.
First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network,^{[AMH3]} based on the (uncited) Lenz-Ising recurrent architecture.^{[L20][I25][T22]}
Mentions the recurrent Ising model^{[L20][I25]}on which the (uncited) Amari network^{[AMH1,2]} is based.
The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.^{[AMH1]} [AMH2] did not cite [AMH1].
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber
Transformers with linearized self-attention (1991-93).^{[FWP]} Today, both types are very popular.
PDF.
PDF.
More.
PS. (PDF.)
H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite.
In fact, Hinton was the reviewer of a 1990 paper^{[ATT2]}
his own work:^{[ATT3]}
attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
arXiv/1409.0473, 2014-16.
This work on soft "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
J. Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around
Bloomberg, May 15, 2018.
PDF.
HTML.
PDF.
by Sherrington & Kirkpatrick^{[SK75]} & Glauber^{[G63]} nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} nor
Amari's work (1967-68)^{[GD1-2]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[S20][DLC]} failed to cite the prior art.^{[T22]}
formal Algebra of Thought (1686)^{[L86][WI48]} was
deductively equivalent^{[LE18]} to the much later
Precursor of modern backpropagation.^{[BP1-5]}
PDF.
Link.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
Link.
IEEE Spectrum, 2021. Link.
English version: [CNN1+]. More in Scholarpedia.
Link.
[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1-5]} and weight-sharing
PDF.
Spatial Averaging.^{[CNN1]}
Spatial Averaging.^{[CNN1]}
PDF.
PDF.
PDF.
Inverse, 2016. Link.
Since November 2021: Comments on version 1 of the report^{[T22]}
in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive.
PDF.
PDF.
Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named
1st superhuman result in 2011.^{[DAN1]} Now everybody is using this approach.
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
the artificial neural network called DanNet
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The
1991 NN distillation procedure,^{[UN0-2][MIR](Sec. 2)}
More.
Deep Learning.
HTML.
A "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML.
Local copy (HTML only).
Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By
greatly improved (CTC-based)
on-device speech recognition
(on the phone, not the server)
LSTM.
PDF.
J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber's lab solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the
Internet Archive),
referring to Hinton's^{[UN4]} and Bengio's^{[UN5]}
unsupervised pre-training for deep NNs^{[UN4]} (2006) although
this type of deep learning dates back to Schmidhuber's work of 1991.^{[UN1-2][UN]}
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed^{[DLC2]} "Deep Learning Conspiracy" (Nature 521 p 436).
it). More on this under [T22].
J. Schmidhuber (AI Blog, 2022).
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022.
Preprint arXiv:2212.11279.
Tweet of 2022.
arxiv:1312.5602.
Link.
the first sentence of the abstract of the earlier tech report version^{[DM1]}
was created earlier by Jan Koutnik et al. in Schmidhuber's lab.^{[CO2]}
and PhDs in computer science. More.
Alphastar has a "deep LSTM core."
Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007).
Preprint arXiv:2112.10752, LMU Munich, 2021.
neural networks learning to control dynamic external memories.^{[PDA1-2][FWP0-1]}
arXiv:1808.03578, 2018.
arXiv:1808.03578, 2018.
Conf. on Neural Networks, Vol. 2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.^{[R62][T22]}
This overview does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.^{}[R62][T22]
Link.
used LSTM
over 4 billion automatic translations per day (The Verge, August 4, 2017);
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013.
PDF.
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
alternative^{[FWP0-1]} to recurrent NNs.
the fast weights^{[FAST,FASTa,b]} of
Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-8]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]}
(now often called keys and values for self-attention^{[TR1-6]}).
The similar Transformers^{[TR1-2]} combine this with projections
Transformers with linearized self-attention^{[TR5-6]}
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
RNNs that program themselves.
See tweet of 2022.
PDF.
normalization).^{[FWP]}
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30-year anniversary.
PDF.
Preprint: arXiv:1811.12143. PDF.
PDF. Very similar to [FWP0-2], in both motivation [FWP2] and execution.
This work on "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
Preprint: arXiv:2003.08165.
PDF.
HTML overview.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
Preprint: arXiv:2106.06295 (June 2021).
PDF.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here.
Preprint arXiv:2012.14905 [cs.LG], 2020.
Report arXiv:2011.07831 [cs.AI], 2020.
Preprint: arXiv:2202.05780.
PDF.
Probably the first paper on using stochastic gradient descent^{[STO51-52]}
reverse mode of automatic differentiation or backpropagation^{[BP1]}).
OCR-based PDF scan of pages 94-135 (see pages 119-120).
Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.^{[GD1]} (S. Amari, personal communication, 2021.)
Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022).
arXiv:cs/0309048 (2003).
More.
PDF.
Cognitive Computation 1(2):177-193, 2009. PDF.
More.
Google Research Blog, Sep 2015, see also
Aug 2015 Google's speech recognition based on CTC and LSTM.
Alphr Technology, Jul 2015, or 9to5google, Jul 2015
WIRED, Sep 2016,
siliconANGLE, Sep 2016
Blog post, Internet Archive, 2010.
A blog post describing basic ideas^{[AC][AC90,AC90b][AC20]} of GANs.
A description of GANs that does not cite Schmidhuber's original GAN principle of 1990^{[AC][AC90,AC90b][AC20][R2][T22]} (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization^{[PM0-2][AC20][T22]}).
Link.
This was number 1 on Hacker News.
Frankfurter Allgemeine Zeitung, 16/6/2021.
Preprint arXiv/2005.14165.
for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint.
win four important computer vision competitions 2011-2012 before others won any
PDF.
HTML overview.
competitor.^{[DAN1]} This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
PDF.
DanNet,^{[DAN,DAN1][R6]}
to win computer vision contests in 2011^{[GPUCNN2-3,5]} (AlexNet and VGG Net^{[GPUCNN9]} followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)^{[RELU1]} and dropout (a variant of Hanson 1990 stochastic delta rule)^{[Drop1-4]} but neither cites the original work^{[RELU1][Drop1]} nor the basic CNN architecture (Fukushima, 1979).^{[CNN1]}
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
PDF.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
first deep learner to win a medical imaging contest (2012). Link.
J. Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century^{[HAB1]}
PDF.
PDF.
Bengio claimed^{[YB20]}
Schmidhuber's publications on exactly this topic
date back to 1991-93.^{[UN0-2][UN]}
An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.^{[NHE]}
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22].
previous related work.^{[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6]}
PDF.
what Y. LeCun called an "open problem" in 2022.^{[LEC]}
North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990.
PDF.
This work did not cite Schmidhuber's gradient-based subgoal generators for hierarchical reinforcement learning (1990).^{[HRL0-2]}
PDF.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The
LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]}
More.
Link.
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
More.
arxiv:1612.07771 (2016). Also at ICLR 2017.
This work did not cite the earlier LSTM^{[LSTM0-6]} trained by Connectionist Temporal Classification (CTC, 2006).^{[CTC]} CTC-LSTM was successfully applied to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}) and became the first superior end-to-end neural speech recogniser that outperformed the
state of the art, dramatically improving Google's speech recognition.^{[GSR][GSR15][DL4]}
Markov models (HMMs).^{[BW][BRI][BOU]} [HYB12] still used the old hybrid approach and did not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.^{[LSTM8]}
Ernst Ising and Wilhelm Lenz in the 1920s.^{[L20][I25][K41][W45][T22]} It settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs.^{[AMH1-2]}
Who Invented the IC?
Preprint arXiv:1704.04760
PDF.
PDF.
Mathematischen Schriften, ed. C. Gerhardt, Berlin 1879, vol.7, p.223. English link.
Link.
arXiv:1607.06450, 2016.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years
See tweet1.
LeCun also listed the "5 best ideas 2012-2022" without mentioning that
See tweet2.
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online:
19/5/2021.
[LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
Based on [LSTM0]. More.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
PDF.
Preprint: arxiv:1506.07452.
PDF.
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent
PDF.
are actually a variant of the vanilla LSTM architecture^{[LSTM2]} (2000) which the authors did not cite
although this work^{[LSTM2]} was the one that introduced gated recurrent units.
Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method.
learn to count^{[LSTMGRU2]} nor learn simple non-regular
languages;^{[LSTMGRU2]} they
according to Google Brain.^{[LSTMGRU3]})
Preprint arXiv:1805.04908.
Architectures. Preprint arXiv:1703.03906
A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert^{[M69]}
researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]}
and then also by Amari's SGD for MLPs.^{[GD1-2]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
better GP methods through Meta-Evolution. More.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020. The
Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
By 2010, when compute was 100 times more expensive than today, both the feedforward NNs^{[MLP1]}
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers).
Annus Mirabilis of 1990-1991.^{[MIR]}
PDF.
PDF.
Preprint arXiv:1608.05343, 2016.
Preprint arXiv:1611.01578 (PDF), 2017.
Compare the earlier Neural Architecture Search of Bayer et al. (2009) for LSTM-like topologies.^{[LSTM7]}
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b.
Letter, Science, vol 336, p 1639, June 2012.
See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a)
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC6a] J. Schmidhuber. Comment on "Biography: The ABC of computing" by J. Gilbey, Nature 468 p 760-761 (2010). Link.
[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004
[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
HTML.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
Link.
excellent 1995 neural probabilistic text model.^{[SNT]} See also Nakamura and Shikano's 1989 word category prediction model.^{[NPMa]}
Compare Konrad Zuse's much earlier 1948 work on
theorem proving^{[ZU48]}
the first high-level programming language.^{[BAU][KNU]}
NY Times article
Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
arxiv:1912.06680.
An LSTM composes 84% of the model's total parameter count.
2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.
Link.
J. Schmidhuber (Blog, 2006).
Is History Converging? Again?
history's exponential acceleration since the Big Bang.^{[OMG]}
Preprint arXiv/1606.06724.
Preprint arXiv/1708.03498.
Preprint arXiv/1802.10353.
Preprint arXiv/2010.03635.
Preprint arXiv/2011.12930.
PDF.
HTML.
HTML overview.
OOPS source code in crystalline format.
PDF.
HTML.
Link.
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and
the GAN principle
Based on TR FKI-126-90 (1990).^{}[AC90]
More.
PDF.
Partially based on TR FKI-126-90 (1990).^{[AC90]}
Report arXiv:1210.0118 [cs.AI], 2015.
One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
Preprint: arXiv:1809.01999.
Github: World Models.
minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF.
More.
1991. PDF.
More.
PDF. More.
Link.
arXiv:1112.5309 [cs.AI]
PDF.
First Experiments with PowerPlay.
arXiv:1210.8385 [cs.AI].
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.
in 1987^{[META1][META]} long before Bengio
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R9] Reddit/ML, 2019. We
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun
[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers
Although these MLPs did not yet have deep learning, because only the last layer learned,^{[DL1]}
Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM1-2][CONN21][T22]}
J. Schmidhuber (AI Blog, 2001). Raw Computing Power.
Preprint arXiv/1311.2524, Nov 2013.
Preprint arXiv/1703.06870, 2017.
PDF.
The first paper on policy gradients for LSTM. This approach has become very important in reinforcement learning.^{[LSTMPG]}
This experimental analysis of backpropagation did not cite the origin of the method,^{[BP1-5]} also known as the reverse mode of automatic differentiation.
the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} as well as
Amari's work (1967-68)^{[GD1-2]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[DL3,3a]} failed to cite the prior art.^{[T22]}
Link.
A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert^{[M69]}
researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]}
and then also by Amari's SGD for MLPs.^{[GD1-2]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
in the 1960s-70s, especially outside of the Anglosphere.^{[DEEP1-2][GD1-3][CNN1][DL1-2][T22]}
The Past, Present and Future of Artificial Intelligence.
Link.
PDF.
Much later this was called a probabilistic language model.^{[T22]}
PDF.
Link.
ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link.
Local copy 1 (HTML only).
Local copy 2 (HTML only).
[T22] debunks this justification.
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.
Debunking [T19] and [DL3a] .
the 1991 publication on what's now called "Transformers with linearized self-attention."^{[FWP0-6][TR5-6]}
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30-year anniversary.
Link.
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though.
The Turing Test.
YouTube video, 2022.
Preprint arXiv/1912.02875, 5 Dec 2019.
Preprint arXiv/1912.02877, 5 Dec 2019.
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised
PDF.
By 1993, the approach solved problems of depth 1000 [UN2]
neural knowledge distillation procedure
The systems of 1991 allowed for much deeper learning than previous methods. More.
1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF.
approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
can be found here (depth > 1000).
2006. PDF.
It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)^{[UN0-3]}
the first NNs shown to solve very deep problems.
(or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]}
This can greatly facilitate very deep downstream learning.^{[UN0-3]}
The comment under reference^{[UN4]} applies here as well.
Theory of Universal Learning Machines & Universal AI.
Link.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.
Results are essentially identical to those of Schmidhuber's diploma student Sepp Hochreiter (1991).^{[VAN1]} Even after a common publication,^{[VAN3]} the first author of [VAN2] published papers^{[VAN4]} that cited only their
own [VAN2] but not the original work.
PDF.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
Link.
Youtube video [see 28:16].
However, in 2010, Schmidhuber's team in Switzerland showed^{[MLP1-2]}
unsupervised pre-training is not necessary
Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times.
WWW link (retrieved 15 May 2020).
Local copy (plain HTML only).
Schmidhuber's publications on exactly this topic
date back to 1991-93.^{[UN0-2][UN]}
already in 1995.^{[SNT]}
a general, practical, program-controlled computer.
architecture [NEU45].
PDF.
J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936.
Weltwoche, Nr. 33.21, 19 August 2021.
PDF.
Menu
directory
status & updates
copyrights
AI Blog
Twitter: @SchmidhuberAI
(v1: 24 Sep 2021,
v2: 31 Dec 2021)
Versions since 2021 archived in the Internet Archive
This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021).
deep learning survey,^{[DL1]} and can also be seen as a short history of the deep learning revolution, at least as far as ACM's erroneous laudation and the Turing Lecture are concerned.
2015 survey of deep learning^{[DL1]}
June 2020 article^{[T20a][R12]}
version 1 of the present report.
(see Executive Summary
I,
V,
II,
XII,
XIX,
XXI,
XIII,
XIV,
XX,
XVII).
(A) speech recognition,
(B) natural language processing,
(C) robotics,
(D) computer vision,
(VII) medicine, astronomy, materials science.
A,
B,
C,
D,
VII,
XVII,
VI,
XVI).
II,
V,
XX,
XVIII)
with Dr. Bengio & Dr. Hinton (see Sec. XVII, I).
I respond to LBH's recent ACM article (July 2021).
expands material in my Critique of the 2019 Honda Prize^{[HIN]} (~3,000 words).
Abstract & Outline (~300 words),
Introduction (~300 words),
Critique of LBH's ACM article (Turing Lecture) of July 2021^{[DL3a]}
Executive summary of what's wrong with ACM's laudation (~1,000 words),
21 comments on 21 claims by ACM (~8,000 words),
Conclusion (~2,000 words).
All backed up by over 300 references (over 10,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
science is self-correcting."^{}[SV20]
they are mine or other people's.^{[DL1-2][HIN][NASC1-9]} The present page is offered as a resource for all good computer scientists who share this inclination.
and to fight plagiarism,^{[FAKE2]}
collusion rings,^{[LIT21]} and systemic academic corruption in all of their more and less subtle forms.^{[FAKE]}
Sec. 2
LBH's 2021 ACM article^{[DL3a]} which necessitated an extension of the
first version
of this post.^{[T20a][R12]}
ACM's official justification^{[T19]} of the
2018 A.M. Turing Award^{[R1]}
After the Executive Summary in Sec. 3, Sec. 4 will split
ACM's full text^{[T19]}
into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
Most of the critiques are based on references to original papers and material from the AI Blog.^{[AIB][MIR][DEC][HIN]}
publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.^{}[DL3a]
LBH's well-known earlier omissions.^{[DLC][HIN][T20a]}
LBH claim to "briefly describe the origins of deep learning"^{[DL3a]} without even mentioning the world's first working deep learning nets by
Ivakhnenko and Lapa in 1965^{[DEEP1-2][R8]} (see Sec. II).
this class of methods was pioneered in 1991^{[UN-UN2]} (see Sec. II, III).
Highway Net,
the first really deep feedforward NN.^{[HW1-3]}
(see Sec. D, VI).
were all driven by my lab:^{[MOST]} In 1991, I had the
first very deep NNs based on unsupervised pre-training;^{[UN-UN2]}
LSTMs
brought essentially unlimited depth to gradient-based supervised recurrent NNs;^{[LSTM0-17]}
later our Highway Nets^{[HW1-3]} brought it to feedforward NNs.
from 2007^{[LSTM4,14]}
based on LSTM^{[LSTM0-6]} (1990s-2005) and CTC (2006).^{[CTC]}
our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years^{[GSR][GSR15-19][DL4]} (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule^{[Drop1-3]} (see Sec. XIV).
perceptrons through stochastic gradient descent^{[GD1-3]} (without reverse mode backpropagation^{[BP1]}).
Fukushima who introduced ReLUs in 1969^{[RELU1-2]} (see Sec. XIV).
called AlexNet,^{[GPUCNN4]} without mentioning that our earlier groundbreaking deep GPU-based DanNet^{[GPUCNN1-3,5-8][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011^{[GPUCNN1-8][R5-6]} (see Sec. XIV).
XVIII).
already in 1965^{[DEEP1-2][R8]} (see Sec. II).
earlier fast weights of von der Malsburg (1981) and Feldman (1982).^{[FAST,FASTa-b][FWP]}
described in the 1991-93 papers on Fast Weight Programmers and linear Transformers^{[FWP0-1,6]} (see Sec. XVI, XVII-2).
dedicate an extra section to attention-based Transformers,^{[TR1-6]} citing Bengio's team (2014) for "soft attention"^{[ATT14]} without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers^{[FWP,FWP0-2,6][ATT]} (see Sec. XVII-1, XVI).
LBH claim that Bengio's team^{[NPM]}
of text compression^{[SNT]} (see Sec. XVI, XVII-1).
LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)^{[GAN0-1]} without mentioning that
GANs are instances
of the Adversarial Curiosity Principle of 1990^{[AC90-20][MIR](Sec. 5)} (see Sec. XVII).
In summation, LBH have repeatedly chosen to ignore the previous well-known critiques^{[DLC][HIN][T20a]} and deep learning surveys,^{[DL1-2]} and ACM's peer review process failed to catch this. ACM's Code of Ethics and Professional Conduct^{[ACM18]} states: "Computing
and deep learning (e.g., Sec. I), ACM lauds
Numerous references can be found under the relevant section links I-XXI
which adhere to the sequential order of ACM's text^{[T19]}
Sec. II:
it became really deep in 1991 in my lab,
unsupervised pre-training of NNs,
supervised LSTM.
Sec. I contains 4 subsections
A, B, C, D
A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition
combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were
Hinton (2012) and Bengio (XV)
our revolutionary CTC-LSTM which was soon on most smartphones.
Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI):
(soon used for several billions of
was also based on our LSTM.
Sec. C: Robotics.
most visible breakthroughs
Sec. D: Computer Vision
XVIII & XIV & XI & VI)
and applied to speech. All before LeCun's CNN work (XVIII).
deep NNs
pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for
superior computer vision in 2011,
winning 4 image recognition contests in a row
is an open-gated version of our earlier Highway Nets.
Sec. XIV:
deep & fast CNN
(where LeCun participated),
Sec. XI: ACM mentions GPU-accelerated NNs
deep GPU-NN of 2010
debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton),
and our GPU-CNN of 2011 (DanNet) was the first
XVIII:
Fukushima and Waibel (see Sec. D).
The first application of CNNs with backpropagation to biomedical/biometric images is due to Baldi and Chauvin.^{[BA93]}
VII: ACM explicitly mentions medicine and
first to win medical imaging competitions
Sec. XII & XIX & XXI: Modern
backpropagation
XIII &
II &
V
III &
IX &
X &
XX):
Sec. XX: ACM credits LeCun for work on
Sec. XXI: ACM credits LeCun for work on
XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences.
CTC-LSTM
A &
B).
XVI: ACM
We started this in 1990-93
long before LBH
Sec. XVII:
Artificial Curiosity
vanishing gradients (1991),
metalearning (1987),
unsupervised pre-training (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
fast weight programmers using
and other topics.^{}[R2-R6]
Sec. IV is on Turing (1936) and his predecessors
Critique of LBH's ACM article (Turing Lecture) of July 2021.
Sec. Conclusion:
In the recent decade of deep learning,
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
Sec. II &
III &
V &
XII &
XIII &
XVII &
XIV &
XIX &
XX &
XXI.
In what follows, ACM's full text [T19] is split into 21 parts
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1-2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2]} vanishing gradients (1991)^{[VAN1]} & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991).^{[FWP0-2,6]} ^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8]} This may explain some of ACM's misattributions.^{[T19]} II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,^{[DEC]} they were academia and industry,^{[DL4]} mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)^{[LSTM0-6]} vanishing gradient problem student Sepp Hochreiter in 1991.^{[VAN1]} This happened long before the similar work of Bengio (see Sec. XVII).^{[MIR] (Sec. 3,Sec. 4)} LSTM was refined with my student Felix Gers^{[LSTM2]} through "forget gates" based on end-to-end-differentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP0-1]} (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).^{[CTC]} Our team successfully applied CTC-trained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). Markov models (HMMs)^{[BW][BRI][BOU]} (Sec. XV). Hinton et al. (2012) still used the old hybrid approach^{[HYB12]} and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer^{[LSTM4][LSTM14]} as a postdoc in Hinton's lab.^{[LSTM8]} CTC-LSTM dramatically improved Google's speech recognition.^{[GSR][GSR15][DL4]} on-device speech recognition^{[GSR19]} (not any longer on the server) LSTM^{[MIR](Sec. 4)} (see Sec. VI & XI & XV). of text^{[SNT]} (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} See also Sec. VI & XI & XV. tailored by Bengio's team.^{[ATT14][FWP]} However, such attention mechanisms also have their roots in my lab (1991);^{[FWP][FWP0-2,6]} see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.^{[LSTM-RL][RPG][LSTMPG]} In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]} beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]} OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).^{[OAI2]} Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]} Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,^{[AM16]} predicting what's going on in nuclear fusion reactors, and so on.^{}[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).^{[JOU17]} Apparently the first LSTM journal paper^{[LSTM1][R5]} is now the 20th century D. Computer Vision was revolutionized in the 2010s by a particular feedforward neural net (NN) called the convolutional NN (CNN).^{[CNN1-4]} The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979),^{[CNN1]} who also introduced the now widely used rectified linear units (ReLUs) in 1969.^{[RELU1]} In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990^{[CNN3a]} and by Weng et al. for higher-dimensional CNNs in 1993.^{[CNN3]} Since 1989, LeCun's team has contributed improvements of CNNs, especially for images^{[CNN2,4]} (see Sec. XVIII). Finally, my own team showed in 2010^{[MLP1]} unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton^{[VID1]} who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]} CNNs of 2006.^{[GPUCNN]} winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^{[GPUCNN5]} at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^{[DAN1]} in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet^{[GPUCNN3]} of Hinton's student Krizhevsky won the ImageNet^{[IM09]} 2012 contest^{[GPUCNN4-5][R6]} (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.^{[SCAN]} The VGG network (ImageNet 2014 winner)^{[GPUCNN9]} and other highly cited CNNs^{[RCNN1-3]} further extended the work of 2011.^{[MIR](Sec. 19)} ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) and currently the most cited neural network,^{[MOST]} is a version (with open gates) of our earlier Highway Net (May 2015).^{[HW1-3][R5]} The Highway Net is actually the feedforward net version of vanilla LSTM.^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.
appeared long before the 1980s. The first non-learning recurrent NN (RNN) architecture (the Lenz-Ising model) was analyzed by physicists in the 1920s.^{[L20][I25][K41][W45]} were also discussed in 1943 by McCulloch and Pitts^{[MC43]} and formally analyzed in 1956 by Kleene.^{[K56]} In 1972, Amari reused the Lenz-Ising model to build a learning RNN, later sometimes called the Hopfield network or Amari-Hopfield Network.^{[AMH1-3]} artificial evolution^{[TUR1]} and single adaptive layer learned in 1958^{[R58]} (Joseph^{[R61]} Widrow & Hoff's similar Adaline learned in 1962.^{[WID62]} regression and the method of least squares^{[DL1-2]} multilayer perceptrons (MLPs) were discussed by Steinbuch^{[ST61-95]} (1961), Joseph^{[R61]} (1961), and Rosenblatt^{[R62]} (1962), who wrote about "back-propagating errors" in an MLP with a hidden layer,^{[R62]} but did not yet have a general deep learning algorithm for deep MLPs (what's now called backpropagation is quite different and was first published by Linnainmaa in 1970^{[BP1-BP5][BPA-C]}). Compare also Selfridge's multilayer Pandemonium^{[SE59]} (1959). containing the now popular multiplicative gates).^{[DEEP1-2][DL1-2]} A paper of 1971^{[DEEP2]} already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} LBH failed to cite this, just like they failed to cite Amari,^{[GD1]} who in 1967 proposed stochastic gradient descent^{[STO51-52]} (SGD) for MLPs and whose implementation^{[GD2,GD2a]} (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin's work^{[GDa-b]}). deep convolutional NN architecture was first introduced in the 1970s;^{[CNN1]} his very popular ReLU already in 1969.^{[RELU1-2]} XIII, III, V, VIII, IX, and X. LBH & co-authors, e.g., Sejnowski^{[S20]} (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert^{[M69]} researchers took a fresh look at the problem in the 1980s."^{[S20]} However, as mentioned above, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL1-2]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method^{[DEEP1-2][DL2]} (and then also by Amari's SGD for MLPs^{[GD1-2]}). Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)} (but see a 1989 paper^{[MOZ]}). However, it became really deep in 1991 in my lab,^{[UN-UN3]} which has See Sec. 1 of the overview:^{[MIR]} First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.^{[UN2][DL1][UN]} (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000^{[LSTM17]} more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[HIN](Sec. II)[MIR] (Sec. 19)} III. Note that LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets^{[HW1-3]} brought it to feedforward NNs.^{[MOST]}
by others (Sec. III).^{[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4]} deep learning multilayer perceptrons (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]} unsupervised pre-training for deep NNs,^{[UN1-2]} the vanishing gradient problem (1991)^{[VAN1]} & solutions to it (Sec. A), GPU-accelerated NNs (2004),^{[GPUNN][GPUCNN5]} and other foundations.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)} II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",^{[DL7]} referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} we had this type of deep learning already in 1991;^{[UN][UN1-2]} see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs^{[DEEP1-2]} which Hinton,^{[UN4]} Bengio,^{[UN5]} and LBH^{[DL3,DL3a]} did not cite either. See Sec. X.
my comments systematically track the sequential order of ACM's claims.^{[T19]}
ACM's statement on Turing is greatly misleading, like some of its other statements.^{[T19]} any type of computation-based AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]} Much of early AI in the 1940s-70s was actually about theorem proving^{[ZU48][NS56]}
In 1936, Turing Turing Machine.^{[TUR]} He rederived the above-mentioned result,^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]} In the same year of 1936, Emil Post published yet another independent universal model of computing,^{[POS]} my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.^{[HIN]}) open problem "P=NP?" in his famous letter to John von Neumann (1956).^{[GOD56][URQ10]} Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936^{[ZU36-38][Z36][RO98][ZUS21]} predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]} Zuse also created the first high-level programming language in the early 1940s.^{[BAU][KNU]} conditional jump instruction.^{[RO98]}
that learn internal representations (1965),^{[DEEP1-2][R8]} stochastic gradient descent for multilayer perceptrons (1967),^{[GD1-3]} modern backpropagation (1970),^{[BP1,2][R7]} architectures of recurrent NNs (1925-56)^{[I25][MC43][K56]} and convolutional NNs (1979),^{[CNN1]} principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]} unsupervised pre-training for deep NNs (1991),^{[UN1-2][UN]} vanishing gradients (1991)^{[VAN1]} & solutions to it (Sec. A),^{[LSTM0-17][CTC]} (2004),^{[GPUNN][GPUCNN5]} record-breaking deep supervised NNs (2010)^{[MLP1-2]} and contest-winning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]} NNs with over 100 layers (2015),^{[HW1-3][R5]} transformer-like^{[TR1-6][FWP]} attention^{[FWP][ATT]} through fast weight programmers (1991),^{[FWP0-2,6]} and more.^{[DL1-2][R2-R8]} Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11]} II & I & III & XIII & X & XVII & XII & XVIII & XX.
"advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011^{[MLP1-2][DAN][DAN1][GPUCNN5][R6]} and through Highway Net-like NNs (2015),^{[HW1-3][R5]} although the principles of CNNs were invented and developed by others since the 1970s.^{[CNN1-4]} See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.^{[MIR]}
Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]} DanNet^{[DAN][DAN1][GPUCNN5]} the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).^{[GPUCNN5,8]} and were able to greatly improve steel defect detection.^{[ST]} All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky^{[GPUCNN4-5][R6]} and the VGG network^{[GPUCNN9]} mitosis detection.^{[MGC][GPUCNN5,8]} approach of D & XI).
without citing them.^{[DL1][DLC][HIN][R2-R4][R7-R8]} V & XII & XIX & II & III & XIII & XVII & X & I.
who failed to cite them, even in later work.^{[HIN][DLC][DL1-2][DEEP1-2][RELU1-2][R7-R8]} See Sec. II & III & XIII & V & X & XIV & I.
first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL^{[DL6,6a]} was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",^{}[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.^{[DEEP1-2][CNN1][HIN][R8][DL1][DLC]} Even deep learning through unsupervised pre-training was introduced by others.^{[UN1-3][R4][HIN](Sec. II)} II & III & XIII & V & I.
ignored by LBH's papers^{[HIN][R7-R8][R2-R5]} (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).
ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),^{[GPUNN][GPUCNN5]} made GPU-based NNs fast and deep enough an important benchmark record,^{[MLP1-2]} unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.^{[VID1]} our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]} vision (explicitly mentioned by ACM) for the first time^{[R6]} (see Sec. D).
Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.^{[LSTM1-4][CTC]} In particular, as mentioned in Sec. A, such as HMMs.^{[BW][BOU][BRI][HYB12]} As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.
ACM's statement is "less wrong" than Honda's^{[HIN](Sec. I)} but still (and apparently even other award committees^{[HIN](Sec. I)}) backpropagation by Rumelhart et al. (1985-86)^{[RUM]} (1982).^{[BP2]} And the article^{[RUM]} even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),^{[BP1]} Kelley already had a precursor thereof in the field of control theory;^{[BPA]} see also later work of the early 1960s.^{[BPB][BPC]}^{[R7]} internal representations in hidden layers of NNs.^{[RUM]} But this was essentially just an experimental analysis of a known method.^{[BP1-2]} And history of backpropagation can be found at Scholarpedia^{[DL2]} and in my award-winning survey.^{[DL1]} Also see Sec. XIX, II.
Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.^{[BP1]} recent debate:^{[HIN]} It is true that in 2018, Hinton^{[AOI]} Rumelhart^{[RUM]} with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.^{[HIN]} Neither in a popular book^{[AOI]} nor in other recent work^{[DL3,DL3a]} did he cite Linnainmaa (1970),^{[BP1]} the true creator.^{[BP4-5]} that his 2015 survey^{[DL3]} does cite Werbos (1974) who however described the method correctly only later in 1982^{[BP2]} and also failed to cite Linnainmaa.^{[BP1]} Compare the 1967-68 work of Amari:^{[GD1-3]} to my knowledge the first to propose and implement stochastic gradient descent^{[STO51-52]} reverse mode gradient descent method now known as backpropagation^{[BP1]}); see also Tsypkin's work of 1966.^{[GDa-b]} Linnainmaa's backpropagation method was well-known.^{[BP5][DL1-2][DLC]} It wasn't created by "lots of different people" as Hinton suggested^{[AOI][HIN][R11]} one person who published first^{[BP1]} and therefore should get the credit.
Boltzmann Machine (BM)^{[BM]} a learning.^{[HIN]} Recently, however, I learnt through a reader that even the BM paper^{[BM]} did not cite prior relevant work by Sherrington & Kirkpatrick^{[SK75]} and Glauber.^{[G63]} (Compare related work.^{[H86][H88][S93]}) multilayer perceptrons with arbitrarily many layers.^{[DEEP1-2][HIN]} Sec. II V & X.^{[MIR](Sec. 1)[R8]}
As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert^{[M69]} at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800^{[DL1-2]}) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.^{[DEEP2][GD1-3][CNN1][DL1-2]}
Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).^{[Drop1-3]} Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,^{[DAN1]} Sec. D above. Back then, the only really of deep CNNs through GPUs.^{[GPUCNN1,3,5][R6]} Already before ImageNet 2012,^{[R6]} fast deep CNN called DanNet a monopoly on winning computer vision competitions.^{[GPUCNN5]} It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011^{[GPUCNN2][DAN,DAN1][R6]} long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.^{[MIR]}
since the late 1980s.^{[BW][BRI][BOU]} LSTM (1990s-2005)^{[LSTM0-6]} and CTC^{[CTC]} (2006), which were applied to speech in 2007.^{[LSTM4][LSTM14]} CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.^{[BW][BRI][BOU][HYB12]} See also Sec. A.
5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.^{[SNT]} Bengio^{[NPM]} characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures^{[PO87][PO90]}). In the 2010s, was actually the LSTM of our team,^{[LSTM0-6]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B. Bengio's team^{[ATT14]} has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),^{[FWP2][FWP]} and "hard" attention (in observation space) in the context of RL^{[ATT][ATT0-1]} (1990). attention-based Transformers^{[TR1-6]} are FWPs of 1991^{[FWP0-1]} which have become a popular alternative to RNNs. My FWP of 1991^{[FWP0-1]} (now often called keys and values for self-attention).^{[TR1-6][FWP]} the 2010s,^{[DEC]} Transformers^{[TR1-2]} a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly^{[LSTM13,17]} linear Transformers or Performers^{[TR5-6]} which are formally equivalent to my 1991 FWPs (apart from normalization).^{[FWP6][FWP]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves.
See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper^{[ATT2]} his own work:^{[ATT3]}
GANs^{[GAN0-1]} (2010-2014) are actually a simple application^{[AC]} of the adversarial curiosity (AC) principle from 1990^{[AC90,90b][AC20]} (see also surveys^{[AC09-10]}). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis^{[GAN1]} (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,^{[GAN1]} a well-known 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.^{[AC20][AC]} early adversarial machine learning settings^{[S59][H90]} neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]}) Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected their erroneous claims^{[GAN1]} about the other (1991).^{}[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,^{[AV1]} their NIPS 2014 paper^{[GAN1]} and some of the erroneous claims it made about my prior work.^{[AC20]} Goodfellow eventually admitted that PM is adversarial (his paper^{[GAN1]} still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^{[AC90]} like in StyleGANs^{[GAN2]}). When the authors^{[GAN1]} I published one myself in the hopes of correcting the annals of history.^{[AC20]} that they are instances of my earlier work.^{[R2][AC20]} vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^{[VAN2]} without citing Sepp. was settled in favor of Sepp.^{[VAN1]} However, even after a common publication,^{[VAN3]} Bengio published papers^{[VAN4][XAV]} are poor indicators of truly pioneering work.^{[NAT1]} (Margin note: Bengio states^{[YB20]} that in 2018 he one must at least clarify it later,^{[DLC]} Bengio also claims^{[YB20]} that in 1995 my publications on exactly this topic date back to 1991-93.^{[UN0-2][UN]} which I started in 1987^{[META1][META]} long before Bengio that he did it before me.^{[R3]} Bengio also writes^{[YB20]} that in Regarding attention-based Transformers,^{[TR1-6]} Bengio^{[DL3a]} cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.^{[FWP,FWP0-2,6]} Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"^{[LSTMGRU]} for a variant of our vanilla LSTM architecture^{[LSTM2]} (2000) which he did not cite although our work^{[LSTM2]} was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method. learn to count^{[LSTMGRU2]} nor learn simple non-regular languages;^{[LSTMGRU2]} they according to Google Brain.^{[LSTMGRU3]}) unsupervised pre-training for deep NNs.^{[UN0-4][HIN](Sec. II)[MIR](Sec. 1)} Hinton's paper^{[UN4]} (2006) appeared long after my earlier work on this^{[UN0-2]} the first NNs shown to solve very deep problems (see Sec. II above).^{[UN]} It was published in 1991-92^{[UN1]} when compute was about 1000 times more expensive than in 2006. survey (2015),^{[DL3][DLC]} See also Sec. II & III. compressing or distilling one NN into another.^{[UN0-2][DIST1-2][MIR](Sec. 2)} Hinton^{[DIST2]} (2006) did not cite my much earlier original work on this (1991),^{[UN1][UN]} not even in his later patent application fast weight programmers^{[FWP][FWP0-4a]} through tensor-like outer products (1991-2016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above). learning sequential attention with NNs.^{[MIR](Sec. 9)} Hinton^{[ATT3]} (2010) our much earlier work on this^{[ATT1][ATT]} although he was both reviewer and editor of my summary^{[ATT2]} (1990; see Sec. XVI above).
The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.^{[R4]} Remarkably, three of them are related to the 1991 paper^{[UN1][UN]} which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.^{[MIR]} See Sec. I for additional related issues of credit assignment.
LeCun's team has made important contributions to CNNs since 1989.^{[CNN2,4]} However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^{[CNN1]} NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.^{[MIR]} at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN1-3]} won the superhuman performance three times worse performance).^{[DAN1]} Again see Sec. D. Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]} at ICPR 2012, our DanNet^{[GPUCNN1-3]} won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)^{[GPUCNN5,7,8]} (before the similar AlexNet won ImageNet 2012^{[GPUCNN5][R6]} and the similar VGG network^{[GPUCNN9]} won ImageNet 2014). mitosis detection.^{[MGC][GPUCNN5,7,8]} Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,^{[AM16][DL1]} All of these fields were heavily shaped in the 2010s by our non-CNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See Sec. A, B, VI, XI.
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^{[BP2-4]} (see also Amari's work on SGD for MLPs of 1967-68^{[GD1-2a]}). recent work.^{[DL3,DL3a][DLC]} In 1960, Kelley already had a precursor of the algorithm.^{[BPA]} Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"^{[DL1]} (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.^{[DL2]}^{[BP4]}
However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)^{[DEEP1-2]} and Amari^{[GD1-2]} (and also Fukushima^{[CNN1][DL2]}) had long before LeCun. See Sec. D & II & XIII & V.
LeCun et al. neither cited the origins^{[BP1]} (1970) of this widely used type of automatic differentiation for differentiable networks of modules^{[DL2][BP4-5][DLC]} for such systems.^{[S80]} See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work;^{[PO87-90]} compare the important work of Baldi and colleagues.^{[BA96-03]}
(Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).^{[UN2]} For example, our adaptive subgoal generators (1991)^{[HRL0-2]} were trained through end-to-end-differentiable chains of such modules.^{[MIR](Sec. 10)} planning and reinforcement learning with recurrent neural world models (1990).^{[PLAN][MIR](Sec. 11)} Same for my linear transformer-like fast weight programmers^{[FWP0-2][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI) see "100 Authors against Einstein."^{}[AH1] ad hominem attacks^{[AH2-3][HIN]} "If you cannot dispute a fact-based message, attack the messenger himself."^{[HIN]} Science has a well-established way of dealing with plagiarism (which may be unintentional^{[PLAG1][CONN21]} or not^{[FAKE2]}) award can ever change that.^{[HIN]} and their co-workers have contributed useful improvements of deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} whom they did not cite, in contrast to ACM's Code of Ethics and Professional Conduct^{[ACM18]} II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:^{[DLC][HIN]} to self-correction,"^{[SV20]} as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video^{[VID2]} Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article^{[NYT3]} Google's original 2016 paper on Google Translate^{[WU]} mentions LSTM over 50 times (see Sec. B). In ad hominem style,^{[AH2-3]} claiming credit he doesn't deserve for many, many things",^{[NYT1]} without LeCun also called the GANs of Bengio's team^{[GAN1]} GANs are variations of my work in 1990.^{[AC90,90b][AC20][R2]} According to Bloomberg,^{[AV2]} Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."^{[FAKE]} LBH, who called themselves the deep learning conspiracy,^{[DLC][DLC1-2]} Our LSTM paper^{[LSTM1]} has got more citations than any paper by Bengio or LeCun,^{[R5]} Hinton's most cited paper (2012) is the one on GPU-based CNNs.^{[GPUCNN4][R5]} It follows our earlier work on supervised deep NNs (2010)^{[MLP1]} unsupervised pre-training for deep NNs by myself ^{[UN][UN0-3]} and later championed by Hinton;^{[UN4][VID1]} see Sec. D). Hinton (2012)^{[GPUCNN4]} characterizes our deep and fast DanNet (2011)^{[GPUCNN1-3]} as AlexNet won one;^{[R6]} see Sec. D, XIV. The highly cited VGG network (2014)^{[GPUCNN9]} Hinton's 2nd most cited paper^{[RUM][R5]} of Hinton's paper,^{[RUM]} adding citations for a book by Rumelhart & McClelland^{[R5]}). Backpropagation is a previously invented method^{[BP1]} whose origins of Ivakhnenko whom he has never cited;^{[DEEP1-2][R7-R8]} see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),^{[GAN1]} which are instances of my artificial curiosity (1990)^{[AC90,90b][AC20][R2]} which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)^{[UN4]} by ours^{[UN0-2][UN]} were preceded by Hanson's^{[Drop1-3]} As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,^{[DL3a]} again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,^{[MOST]} of deep learning MLPs since 1965^{[DEEP1-2][GD1-2a]} (see Sec. II, XX) and backpropagation (1960-70)^{[BPA][BP1]} (see Sec. XIX, XII) and convolutional NNs since 1979^{[CNN1-4]} (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets^{[HW1-3]} brought it to feedforward NNs in May 2015.^{[MOST]} superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^{[DEC]} speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.^{[DEC]} Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.
As mentioned earlier,^{[MIR](Sec. 21)} it is not always clear^{[DLC]} depth that really learned.^{[DEEP1-2][R8]} Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.^{[GD1-2a]} A few years later, modern backpropagation unintentional^{[PLAG1][CONN21]} or intentional.^{[FAKE2]}
Yes, this critique is also an implicit critique of certain other awards to LBH.^{[HIN]} reddit.com/r/MachineLearning^{[R1-R12]} (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.^{[MIR]}
Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:^{[LECP]} "... else cites."^{[LECP]} weights and an adaptive output layer.^{[R62]} So Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs)^{[ELM1]} revisionist narrative of ELMs^{[ELM2][CONN21]} self-proclaimed "deep learning conspiracy"^{[DLC1-2]}
Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,^{[HIN]} as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,^{[NASC1-2]} the telephone,^{[NASC3]} the computer,^{[NASC4-7]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]} AI scientists and AI historians equipped with artificial curiosity^{[SA17][AC90-AC20][PP-PP2][R1]}
Thanks publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]} (more). Preprint arXiv/1906.04493. ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network^{[AMH2]} or Amari-Hopfield Network.^{[AMH3]} The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.^{[AMH1]} [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^{[FWP]} Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.^{[BP1-4]} PDF. Link. PDF. First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^{[DL2]} English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weight-sharing PDF. Spatial Averaging.^{[CNN1]} Spatial Averaging.^{[CNN1]} PDF. PDF. PDF. Since November 2021: Comments on version 1 of the present report^{[T21v1]} in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.^{[DAN1]} J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.^{[UN-UN2]} More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's^{[UN4]} and Bengio's^{[UN5]} unsupervised pre-training for deep NNs^{[UN4]} (2006) although this type of deep learning dates back to 1991.^{[UN1-2][UN]} II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed^{[DLC1-2]} "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. In fact, the ELM concept goes back to Rosenblatt's work around 1960.^{}[R62] used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J. Schmidhuber (AI Blog, 26 March 2021). alternative^{[FWP0-1]} to recurrent NNs. the fast weights^{[FAST,FASTa]} of Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-7]} can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]} (now often called keys and values for self-attention^{[TR1-6]}). The similar Transformers^{[TR1-2]} combine this with projections linear Transformers or Performers^{[TR5-6]} In 1993, I introduced the attention terminology^{[FWP2]} now used in this context,^{[ATT]} and RNNs that program themselves. PDF.