Menu directory status & updates copyrights
  • Summary comments
  • Play with the [time, mind]-bending perspective yourself
  • Ratio of actual to semi-log detrended data : [advantages, disadvantages]
  • Future potential work
  • Comparison of [TradingView, Yahoo finance] data
  • [data, software] cart [description, links] Menu directory status & updates copyrights
  • Menu directory status & updates copyrights Menu directory status & updates copyrights
  • At present, the full video (540 Mbytes) is too slow (dragging, deep voices, slow video), and is too cumbersome to go from one time to another. So until I convert to a different video [codec, contailer] formats (perhaps H.264 codec & .MKV container?) or find a video viewer that is better suited to large files, the videos for each scene are posted instead (see the listing below), giving better throughput and easy of going from one scene to another by separate loading. Microsoft Windows (and hopefully MacIntosh?) users can view this by downloading the VLC media viewer. "... VLC is a free and open source cross-platform multimedia player and framework that plays most multimedia files, and various streaming protocols. ..." At present, this full video cannot be moved forward and back within the video, something I will fix when I get the time, as the ability to go back over material and skip sections is particularly important with this video. In the meantime, the separate "Scenes" listed below can be used by moving back and forward.
  • The QNial programming language was used to [direct, sequence, conduct, whatever] the video production, together with a LibreOffice Calc spreadsheet that acts as a great front-end for preparing code specific to the video sequencing. These can be found in the Programming code directory listing, and will be handy for anyone interested in the details of how I produced the video. I like to describe the QNial programming language of Queen's University, Kingston, Ontario, Canada as "... the beautiful child of a marriage between LISP and APL ...". It is not commonly used today, and even though it is an interpreted language, I always get frustrated with other languages that I also use, it's conceptual power always brings me back home to it. Bug hunting can be problematic if you don't build in bug taps and [structured, object oriented] capabiities, but for much of what I do I keep those chains to a minimum so I can use the full power of the language. Menu directory status & updates copyrights
  • Summary - my commentary as part of Perry Kincaid's winar, 31Mar2022.
  • Key files - to [view, hear] my commentary
  • References - unfortunately, the list is very incomplete, but does provide some links Perry Kincaid, founder of KEI Networks, organised a PUBLIC webinar Alberta is high on hydrogen : Introducing hydrogen to Alberta's energy mix and commentaries about how and why, Thursday, March 31st 4:00pm MST.
  • Slide show - open source presentation file format .odp. Microsoft PowerPoint will probably complain, but should be able to load.
  • Voice narrative - in mp3 audio file format.
  • Adobe pdf - file format.
  • Voice script - text file with script for the voice commentary. Also included are notes for some of the slides that were not commented on (marked by "(X)"). Click to view most files related to this presentation
    Menu directory status & updates copyrights Ben Davidson of Suspicious Observers posted 3 brilliant videos on nearby stellar flaring, as further support for a potential "micro-flare" or other solar disruption to explain the 12,000 year [mythological observations, paleontology, geology, planetary] quasi-periodicity of disruptive events on Earth, which by appearances may be "imminent". I like Ben's <=50 to >=200 year uncertainty - and even though that is still a bit of guess, he is meticulous in pointing out the uncertainties.
  • 24Dec2019 DISASTER CYCLE | Signs in the Sky Now
  • 26Dec2019 Galactic Sheet Impact | Timing the Arrival
  • 27Dec2019 Nearby Superflares | What Do They Mean If we take an "Electric Universe" perspective, in particular Wal Thornhill's Birkeland current concepts for large-scale astronomy, and Don Scott's very interesting "solar transistor" model together with his 2015 Birkeland current model (also 2018 elaboration), then perhaps shifts in the galactic currents could be expected to "reincarnate-light up" or "dim-extinguish" stars to various degrees as the currents shift and move. Many stars (I can't remember all of them - perhaps brown dwarfs, giant planets close to being stars, etc) are not visible by light emission, but perhaps they are easily re-activated when current change. Perhaps in exteme case this might lead to "swapping the central star role" between a large planets and its star in the local z-pinch? In other words, the "lit-up regions" motions may relate more to drifts of galactic currents than to the motions of the stars themselves? In that manner, the "galactic spirals" could move independently of the stars.


    Note that Donald Scott's own analysis of "stellar velocity profiles" provides yet another explanation of what is observed. So my speculations here are just one of many that have been proposed.

    Menu directory status & updates copyrights ALL videos are provided in ogv file format, which is of higher quality and easier and more natural to me in a Linux environment. Microsoft Windows (and hopefully MacIntosh?) users can view this by downloading the VLC media viewer. "... VLC is a free and open source cross-platform multimedia player and framework that plays most multimedia files, and various streaming protocols. ...".
  • Ben Davidson of Suspicious Observers posted 3 brilliant videos on nearby stellar flaring, as further support for a potential "micro-flare" or other solar disruption to explain the 12,000 year [mythological observations, paleontology, geology, planetary] quasi-periodicity of disruptive events on Earth, which by appearances may be "imminent". But can stellar [apparent birth, brighten, dim, apparent death] also provide further potential evidence? Naturally we view stars' life-paths as "unidirectional", but is this a full picture, or can all these processes recur, as is the case fortheir [sunspots, micro-to-super novas, etc]? What has long fascinated me is the statement that the spirals of the galaxies move more rapidly than the stars in the galaxy, and how that might relate to the [Newtonian, General Relativity] problem at large scales.
  • Toolsets can be browsed via: Past and Future Worlds directory. Perhaps these may be of interest, help] to others putting together a film from Linux-based free software.
  • Toolsets can be browsed via: Big Data, Deep Learning, and Safety directory. Perhaps these may be of interest, help] to others putting together a film from Linux-based free software.
  • Toolsets can be browsed via: Icebreaker unchained directory. Perhaps these may be of interest, help] to others putting together a film from Linux-based free software.
  • Menu directory status & updates copyrights
  • code code development overall Menu directory status & updates copyrights
  • fileops run commentary, overall.html
    fileops run commentary, overall.html
    fileops run commentary, overall.html
    code code development Menu directory status & updates copyrights
  • code code development overall Menu directory status & updates copyrights
  • fileops run commentary, overall.html
    fileops run commentary, overall.html
    fileops run commentary, overall.html
    code code development Menu directory status & updates copyrights
  • Howell - TradingView PineScript [description, problem, debug].html
  • Howell - TradingView PineScript of priceTimeFractals.html
  • 0_PineScript notes.txt - details of software [code, bug, blogSolutions]
  • 0_PineScript errors.txt - [error, solution]s that keep coming back
  • Howell - References related to Puetz [H]UWS.html
  • Kivanc Ozbilgics Turtle Trade PineScript - documention.txt
  • Kivanc Ozbilgics Turtle Trade PineScript, plus 8-year detrended SP500.txt
  • RicardoSantos, Function Polynomial Regression.txt
  • sickojacko maximum [,relative] drawdown calculating functions.txt
  • TradingView auto fib extension.txt Menu directory status & updates copyrights
  • Perhaps more importantly, are lessons that can be learned from my own failutres, and some of the techniques I've used to help debug my Pine Script code. General comments are provided on my webPage TradingView PineScript [description, problem, debug].
    This section also appears in my webPage for users , and also applies to programmers. Users only have to set up the basic chart and symbols in TradingView based on my chart PuetzUWS [time, price] multiFractal mirrors, SPX 1872-2020. To do so you must be a TradingView subscriber. After that, copy over my PineScript coding, which you can find on my TradingView page - click on "SCRIPTS", and select my script "PuetzUWS [time, price] multifractal of detrended SPX 1871-2020". Further setup details are given below.
    Download symbol data (like [TVC:[USOIL, GOLD], NASDAQ: NDX]) from [TradingView, yahoo finance, etc]. My own data for SPX is in my LibreCalc spreadsheet SP500 1872-2020 TradingView, 1928-2020 yahoo finance.ods. Actually, it's in several different spreadsheets, hence the possibility of glitches as [update, change]s are made...
    Users can simply follow standard Trading View guide instructions to install the Pine Script program that super-imposes fractal [time, price] grids on their charts. I don't recommend that you do this UNLESS your [are, want to be] familiar with PineScript programming. The reason that I say that is because for every market symbol being tracked, you must provide a formula for the semi-log price [trend, relative standard deviation]. Preferably get as long a series as you can get, eg download from TradingView. If you don't have 20+ years of data (eg the young crypto market), it may be a waste of your time. Here are the statements that you need to adapt to your symbol's data :
    For details, see Howell - TradingView PineScript [description, problem, debug].html.
  • Menu directory status & updates copyrights
  • Special comments
  • Regular [1,6] month market views

  • https://tr.tradingview.com/script/pB5nv16J/?utm_source=notification_email&utm_medium=email&utm_campaign=notification_pubscript_update

    https://www.tradingview.com/script/12M8Jqu6-Function-Polynomial-Regression/

    Menu directory status & updates copyrights





    Menu directory status & updates copyrights
  • Key [results, comments]
  • How can the Great Pricing Waves be correlated with "dead" system?
  • Ratio of actual to semi-log detrended data : [advantages, disadvantages]
  • Future potential work
  • Comparison of [TradingView, Yahoo finance] data
  • [data, software] cart [description, links] NOTE : The model here is DEFINITELY NOT suitable for application to [trade, invest]ing! It's far too [early, incomplete, immature, erroneous]! See Steven Puetz's http://www.uct-news.com/ for the investment side of the UWS.
    I typically use a LibreOffice Calc spreadsheets to [collect, rearrange, simple transform] data. For this project : 1_Fischer 1200-2020.ods
    This is susceptible to serious bias in selecting the [start, end] dates for each segment. See the spreadsheet 1_Fischer 1200-2020.ods.
    The year ~1926 was taken as the [start, end] point for my 1872-2020 detrend StockMkt Indices 1871-2022 PuetzUWS2011 [start, end] point, so I use it here as well. (23Feb2023 - original text said 1940, perhaps it is still like that?)
    This is easy with the spreadsheet - one column of regression results I use 10 year intervals per segment, but you only really need the [start, end] dates [-,+] 20 years. The extra 20 years extends the segments at both ends for visual clarity. For an example, see the spreadsheet 1_Fischer 1200-2020.ods, sheet "Fig 0.01 SLRsegments".
    Save the "SLRsegments" to a data file that canused by GNUplot. Example : Fig 0.01 line segments for GNUplots.dat. Notice that olumn titles can use free-format text, except for the comma, which separates columns.
  • Save data of 1_Fischer 1200-2020.ods to data file , example Fig 0.01 linear regression raw data.dat
  • For each curve, Fischer linear regressions.ndf (23Feb2023 no longer exists?) - a special operator (proceedure) is created to select a segments's dataFile, handles [data, results] file [input, output], and calls fit_linearRegress.ndf
  • text data file : Fig 0.01 Price of consumables in England 1201-1993.dat
  • gnuplot script : Fig 0.01 Price of consumables in England 1201-1993.plt
  • graph output : Fig 0.01 Price of consumables in England 1201-1993.png
  • Fig 0.01 Price of consumables in England 1201-1993 detrended.plt - This covers the medieval to mdern era, and is used to colect curves for different data. The restricted t-frame provides a more accurate view of that period.
  • 1850 BC to 2020 AD prices detrended.plt - Obviously this covers a variety of regions, time-frames. What I really need is data going 7,500 years (~3 cycles of 2,400 years (Halstatt cycle) corsponding to a 2006 project on the rise and fall of civilisations _Civilisations and the sun, and if I find [time to do it, data] this would be nice. https://www.digitizeit.xyz/ https://www.gimp.org http://www.gnuplot.info/ Menu directory status & updates copyrights Executive Intelligence Review This article appears in the October 27, 2000 issue of Executive Intelligence Review. Subscribe to EIR Menu directory status & updates copyrights

  • Menu directory status & updates copyrights Menu directory status & updates copyrights
  • Messages sorted by: [ date ][ thread ][ subject ][ author ]
  • Next message: Jas Jain: "Re: Inflation"
  • Previous message: Jas Jain: "Re: Thank you...:)"
  • Maybe in reply to: Dan Brindle: "Earnings growth" Jas Jain (jjain@cisco.com)
    http://www.yardeni.com/public/sp52_c.pdf
    (http://stats.bls.gov:80/cgi-bin/surveymost?wp).] Data from before
    Menu directory status & updates copyrights
  • Key [results, comments]
  • Play with the [time, mind]-bending perspective yourself
  • Ratio of actual to semi-log detrended data : [advantages, disadvantages]
  • Future potential work
  • Comparison of [TradingView, Yahoo finance] data
  • [data, software] cart [description, links]


    Wow! Even knowing that the [eyes, mind] often see patterns that aren't really there (as per random noise), one can basically re-create the essence of the 1872-1960 timeframe simply by copying ONLY 4 chunks of the 1950-2020 time-frame!! Of course, there is nothing new about this - TradingView members are often comparing current market squiggles to the past, over different timescales. I would actually be surprised if the aqbove graph hasn't already been done hundreds of times before. [System_T, Amjad Farooq, TexasWestCapital, perhaps Harry Dent] and others are examples of recent pattern-matching for the SP500, with their comparisons of the 2020 crash to other time periods. But my overlays in the graph above above did not involve [re-scaling, rotating] or other transformation of the time segments, transparencies], so that is [noteworthy, simple, pure]. Scale is important, even if only to confirm the applicability of multi-scalar processes.
    While you probably don't have the gimp (GNU image manipulation program) installed on your computer (yet), it is available for free (I think on MS Windows as well, not just Linux?). With gimp, you will be able to work with my .xcf format file SP500 time-section transparencies. If you are new to gimp, be prepared to lose a lot of hair and gain a lot of wrinkles - it's not the the easiest learning curve, but it is powerful (and cheap!).
  • 7,500 years of history - This is the same challenge that I had with a [lunitic, scattered, naive] model of history by my father and I, where it was necessary to cut ?150? years out of a 7,500 year time series to "kind of make it fit". Steven Yaskall recognized us as the "two fools who rushed in" in his book "Grand Phases on the Sun". We were justifiably proud of that.
  • Smooth sinusoidal curves and regular periodicities - I seems that mathematicians and scientists still [think, apply] models assuming ideal waveforms, even when [their tools, reality] do not. Stephen Puetz's "Universal Waves Series" (UWS) is the most [powerful, fantastic] meta-level model for [natual, human] cycles that I have ever seen, by far. It even has an awesome probablistic-ranked list of expected timings of events at different timescales. However, perhaps more remains to be done on subtleshifts in real time series? I don't know - I'm just guessing.

    While most results are provided in sections above, links to data [spreadsheets, text files] and software [???, source code] are listed below along with brief comments. A full listing of files (including other SP500 web-pages) can be seen via this Directory's listing. Hopefully this will help those who want to do something different, as the programs etc mayhelp with [learning, debugging].
  • TradingView data text file and spreadsheet - I had to upgrade my TradingView subscription to Pro+ to download the data for years prior to 1928, as I couldn't find another source. Note that the S&P500 index started in 1926, so I assume that proxy [data, index memberships] were used for prior years. I used the spreadsheet to [gather, view, process] data, and copied the resulting tables to text files for use by gnuplot (see "Software" below).
  • Yahoo finance data (23Feb2023 the text file has been lost, but the data is in the linked spreadsheet with TradingView data). I was happy to have another "somewhat independent" data source, even if they are both from the same S&P or other source. This really helps as a check on my data treatment (see the section above "Comparison of [TradingView, Yahoo finance] data").
  • TradingView Pine language You are probably wondering why I didn't provide a Pine script, which would make this much more useful to the TradingView community. Laziness is the rule - especially as I am hoping that a Pine Scripter (maybe you?) might do this.
  • gnuplot I've used the unofficial extension .plt to designate gnuplot scripts for each of the graphs. You can see these files in the market data subdirectories (eg 200913 for 13Sep2020, 220802 for 02Aug2022).
  • gimp (GNU image manipulation program) is what I used for the SP500 time-section transparencies. For more details, see the section above "Play with the [time, mind]-bending perspective yourself".
  • gnuplot.sh is the tiny bash script used to select gnuplot scripts. My other bash scripts can be found here.
  • QNial programming language - Quenn's University Nested Interactive Array Language (Q'Nial) is my top prefered programming language for modestly complex to insane programming challenges, along with at least 3 other people in the world. Bash scripts make a great companion to QNial. semi-log formula.ndf is the tiny "program" used to set up the semi-log line fits. More generally : here are many of my QNial programs. Subdirectories provide programs for various projects etc. Menu directory status & updates copyrights    
    Subscribe to the NATIONAL POST
    Regina
    11°
    Partly cloudy
     About nationalpost.com | Subscriber Exclusive [Profile[Register]   [Logout • Find a job
    • Find a car
    • Find a home
    • Obituaries
    • Personals
    • Find an ad News
    World
    Canada
    Issues & Ideas
    Toronto
    Body & Health
    Editorials
    Letters
    Arts & Life
    Sports
    Sports Briefs
    Driver's Edge
    Financial Post
    Post Movies
    FP Investing
    FP Market Data
    Toronto
    Columnists
    Diversions
    Section Scan
    7-Day Archive
    Online Extras
    Subscribe
    Newspaper Ads

    G3
    Weddings
    Driver's Edge
    Post Movies
    FP Entrepreneur
    Weekend Post
    FP Weekend
    Travel
    Homes
    FP Business Magazine
    Post Homes Magazine
    Subscriber services
    Renew Subscription
    Update Credit Card
    User Help
    Corrections
    Send us a news tip
    - To the editor
    - About your event
    - Site Feedback
    Mobile Headlines
    Daily Newsletter
    Contests
    Contact the Post
    Advertise
    Advanced Search
    National news
    Global National
    Features
    [tonight on Global]
    [full listings]-->
    document.write('
     ' +
    Search canada.com   About Us   Advertise   Site Map   Privacy   Terms   FAQ   Our Partners Copyright © CanWest Interactive, a division of CanWest MediaWorks Publications Inc. All rights reserved.
    Copyright & Permission Rules
    document.write(' Menu directory status & updates copyrights
  • multpl.com
  • Qiang Zhang 30Jan2021 Price Earning Ratio model - This is similar to, but better than, my own model below. His github has several other interesting investment-related postings, including Black-Scholes derivative pricing. see Howell - SP500 PE Shiller ratios versus 10 year Treasury bond yields, with earnings growth & discount factors.ods
  • time-varying [SP500_growFuture, etc] - there is little chance of growth rates lasting more than a year or two, especially || > 20%. Frankly, they are constantly changing year-to-year in a big way. The time series approach mentioned below is a simple basis for anticipating this in a statistic manner as a start. Other approaches get more into predictions based on some concept or another.
  • SP500 index, variable [dividends, internal investment & stock buybacks, earnings] - I won't be looking at these any time soon ....
  • Elliot Wave Theory, notable Robert Prechter (including Socionomics). Amoung many, many fun topics, the arguments presented about how the Fed FOLLOWSnterest rates, only gng the impression of leading, is espectially relevant to theis web-page.
  • Harry S. Dent Jr - demographics, with astounding successes in the past (at least twice on decade-or-longer-out basis, perhaps a bit muffled with the last decade.
  • Stephen Puetz - Universal Wave Series stunning results across a huge swath of subject areas!! eminds me of the system of 20+ Mayan calendars.
  • Brian Frank of Frank funds - "Slaughterhouse-Five (Hundred), Passive Investing and its Effects on the U.S. Stock Market" - Index fund [distortion, eventual destabilization] of the markets. This was a recent fascinating read for me. (MarketWatch 10Apr2020) Menu directory status & updates copyrights I will change this every six months or year, just to profile my different projects past and ongoing. See also past home page highlights, Howell's blog, my assorted blogs.

    Icebreaker Unchained : we should have lost WWII

    I have not yet made a webPage for this project (so many years after it was shelved in Aug2015!), but [documentation, information, unfinished scripts] are provided in the Stalin supported Hitler (video production) directory and Icebreaker directory (which should be combined into one). Two very simple animations took sooooo loooong to produce. They total only ~ 1 minute for both "A year of stunning victories" map scan-zooms of the Poland, false war, lowlands, France and Dunkirk). Worse, the unfinished part 1 of 6 videos (!1 hour length) wasn't saved to a complete file, and the software to it needs massive updating. The vidoes are in ogv format (.ogg)- use the VLC media player to view (some other media programs also work).
    25May2021 Here are two example graphs of TSLA options that I have been working on. I am far from getting into options trading, I just want to learn more about the market. For more details (but no webPage yet), see QNial software coding for options data processing (also "winURL yahoo finance news download.ndf" in the same directory for yahoo finance news downloads), and several graphs of Tesla options.

    04Jul202 Edwin Kaal periodic table of the elements


    1872-2020 SP500 index, ratio of opening price to semi-log detrended price


    David Fischer - The Great pricing Waves 1200-1990 AD

    Menu directory status & updates copyrights "Mega-Life, Mega-Death, and the invisible hand of the Sun: Towards a quasi-predictive model for the rise and fall of civilisations", Click to see a full-sized image of the chart in your browser.. (~3.5 feet squared on my kitchen wall. My printed out version includes hand annotated comparisons to the Mayan calendar and other references.)

    1872-2020 SP500 index, ratio of opening price to semi-log detrended price


    Menu directory status & updates copyrights
  • Menu directory status & updates copyrights Menu directory status & updates copyrights
  • Genetic

  • Genetic

  • Junk

  • 2005 IJCNN website - official/
  • 2006 WCCI website - official/
  • 2007 IJCNN Orlando website/
  • Holidays - neural networks and genomics.html
  • Howell 2005 - Presentation, Junk DNA and Neural Networks conjecture on directions and implications.ppt
  • Howell 2006 - Genetic specification of neural networks, draft concepts and implications.pdf
  • Howell 2006 - Presentation, Genetic Specification of Recurrent Neural Networks Initial Thoughts.ppt
  • Menu directory status & updates copyrights Celebrating 20 years of neural networks!

    1. IJCNN07 Orlando Florida USA - Publicity Chair, Guest Editor for the Neural Networks Special Issue
    2. IJCNN06 Vancouver BC Canada - International Liaison
    3. Menu directory status & updates copyrights Menu directory status & updates copyrights Menu directory status & updates copyrights Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning AI Blog
      @SchmidhuberAI This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021). 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] (see Executive Summary I, V, II, XII, XIX, XXI, XIII, XIV, XX, XVII). (A) speech recognition, (B) natural language processing, (C) robotics, (D) computer vision, (VII) medicine, astronomy, materials science. A, B, C, D, VII, XVII, VI, XVI). II, V, XX, XVIII) with Dr. Bengio & Dr. Hinton (see Sec. XVII, I). I respond to LBH's recent ACM article (July 2021). expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). Abstract & Outline (~300 words), Introduction (~300 words), Critique of LBH's ACM article (Turing Lecture) of July 2021[DL3a] Executive summary of what's wrong with ACM's laudation (~1,000 words), 21 comments on 21 claims by ACM (~8,000 words), Conclusion and Acknowledgments (~2,000 words). All backed up by over 250 references (~9,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. science is self-correcting."[SV20] they are mine or other people's.[DL1-2][HIN][NASC1-9] The present page is offered as a resource for all good computer scientists who share this inclination. and to fight plagiarism, collusion rings,[LIT21] and systemic academic corruption in all of their more and less subtle forms.[FAKE] Sec. 2 LBH's 2021 ACM article[DL3a] which necessitated an extension of the first version of this post.[T20a][R12] ACM's official justification[T19] of the 2018 A.M. Turing Award[R1] After the Executive Summary in Sec. 3, Sec. 4 will split ACM's full text[T19] into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI. Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.[DL3a] LBH's well-known earlier omissions.[DLC][HIN][T20a] LBH claim to "briefly describe the origins of deep learning"[DL3a] without even mentioning the world's first working deep learning nets by Ivakhnenko and Lapa in 1965[DEEP1-2][R8] (see Sec. II). this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, the first really deep feedforward NN.[HW1-3] (see Sec. D, VI). were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs;[LSTM0-17] later our Highway Nets[HW1-3] brought it to feedforward NNs. from 2007[LSTM4,14] based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule[Drop1-2] (see Sec. XIV). von der Malsburg who introduced ReLUs in 1973[CMB] (see Sec. XIV). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). XVIII). already in 1965[DEEP1-2][R8] (see Sec. II). earlier fast weights of von der Malsburg (1981) and Feldman (1982).[FAST,FASTa-b][FWP] described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). dedicate an extra section to attention-based Transformers,[TR1-6] citing Bengio's team (2014) for "soft attention"[ATT14] without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers[FWP,FWP0-2,6][ATT] (see Sec. XVII-1, XVI). LBH claim that Bengio's team[NPM] of text compression[SNT] (see Sec. XVI, XVII-1). LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)[GAN0-1] without mentioning that GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). In summation, LBH have repeatedly chosen to ignore the previous well-known critiques[DLC][HIN][T20a] and deep learning surveys,[DL1-2] and deep learning (e.g., Sec. I), ACM lauds Numerous references can be found under the relevant section links I-XXI which adhere to the sequential order of ACM's text[T19] Sec. II: it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. Sec. I contains 4 subsections A, B, C, D A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were Hinton (2012) and Bengio (XV) our revolutionary CTC-LSTM which was soon on most smartphones. Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI): (soon used for several billions of was also based on our LSTM. Sec. C: Robotics. most visible breakthroughs Sec. D: Computer Vision XVIII & XIV & XI & VI) and applied to speech. All before LeCun's CNN work (XVIII). deep NNs pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. Sec. XIV: deep & fast CNN (where LeCun participated), Sec. XI: ACM mentions GPU-accelerated NNs deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first XVIII: Fukushima and Waibel (see Sec. D). VII: ACM explicitly mentions medicine and first to win medical imaging competitions Sec. XII & XIX & XXI: Modern backpropagation XIII & II & V III & IX & X & XX): Sec. XX: ACM credits LeCun for work on Sec. XXI: ACM credits LeCun for work on XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences. CTC-LSTM A & B). XVI: ACM We started this in 1990-93 long before LBH Sec. XVII: Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using and other topics.[R2-R6] Sec. IV is on Turing (1936) and his predecessors Critique of LBH's ACM article (Turing Lecture) of July 2021. Sec. Conclusion: In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications) Sec. II & III & V & XII & XIII & XVII & XIV & XIX & XX & XXI. In what follows, ACM's full text [T19] is split into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI.

      Critique of 2018 Turing Award LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1-2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),[GPUNN][DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991).[FWP0-2,6] [DL1-2][R2-R8] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] This may explain some of ACM's misattributions.[T19] II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,[DEC] they were academia and industry,[DL4] mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)[LSTM0-6] vanishing gradient problem student Sepp Hochreiter in 1991.[VAN1] This happened long before the similar work of Bengio (see Sec. XVII).[MIR] (Sec. 3,Sec. 4) LSTM was refined with my student Felix Gers[LSTM2] through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).[CTC] Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). Markov models (HMMs)[BW][BRI][BOU] (Sec. XV). Hinton et al. (2012) still used the old hybrid approach[HYB12] and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer[LSTM4][LSTM14] as a postdoc in Hinton's lab.[LSTM8] CTC-LSTM dramatically improved Google's speech recognition.[GSR][GSR15][DL4] on-device speech recognition[GSR19] (not any longer on the server) LSTM[MIR](Sec. 4) (see Sec. VI & XI & XV). of text[SNT] (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] See also Sec. VI & XI & XV. tailored by Bengio's team.[ATT14][FWP] However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.[LSTM-RL][RPG][LSTMPG] In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,[AM16] predicting what's going on in nuclear fusion reactors, and so on.[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).[JOU17] Apparently the first LSTM journal paper[LSTM1][R5] is now the most frequently cited D. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4] The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979).[CNN1] The popular downsampling variant called max-pooling was introduced by Weng et al. (1993).[CNN3] In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. LeCun's team later contributed improvements of CNNs, especially for images[CNN2,4] (see Sec. XVIII). Finally, my own team showed in 2010[MLP1] unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton[VID1] who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet[GPUCNN3] of Hinton's student Krizhevsky won the ImageNet[IM09] 2012 contest[GPUCNN4-5][R6] (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the work of 2011.[MIR](Sec. 19) ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) which currently gets more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.

      Critique of 2018 Turing Award appeared long before the 1980s. were proposed already in the 1940s/50s[MC43][K56] (but don't forget prior work in physics since the 1920s[L20][I25][K41][W45]). deep convolutional NN architecture was proposed in the 1970s.[CNN1] NNs without hidden layers learned in 1958[R58] regression and the method of least squares[DL1-2]). about deeper adaptive NNs[R61,R62] layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that's what it was.[MIR](Sec. 1)[R8] LBH failed to cite this. XIII & III & V & VIII & IX & X. LBH & co-authors, e.g., Sejnowski[S20] (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, as mentioned above, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method.[DEEP1-2][DL2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) (but see a 1989 paper[MOZ]). However, it became really deep in 1991 in my lab,[UN-UN3] which has See Sec. 1 of the overview:[MIR] First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.[UN2][DL1][UN] (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000[LSTM17] more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) III. Note that LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

      Critique of 2018 Turing Award by others (Sec. III).[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4] deep learning multilayer perceptrons (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), GPU-accelerated NNs (2004),[GPUNN][GPUCNN5] and other foundations.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",[DL7] referring to Hinton's[UN4] and Bengio's[UN5] we had this type of deep learning already in 1991;[UN][UN1-2] see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs[DEEP1-2] which Hinton,[UN4] Bengio,[UN5] and LBH[DL3,DL3a] did not cite either. See Sec. X.

      Critique of 2018 Turing Award my comments systematically track the sequential order of ACM's claims.[T19]

      ACM's statement on Turing is greatly misleading, like some of its other statements.[T19] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a] Much of early AI in the 1940s-70s was actually about theorem proving[ZU48][NS56]

      In 1936, Turing Turing Machine.[TUR] He rederived the above-mentioned result,[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing,[POS] my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.[HIN]) open problem "P=NP?" in his famous letter to John von Neumann (1956).[GOD56][URQ10] Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936[ZU36-38][Z36][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Zuse also created the first high-level programming language in the early 1940s.[BAU][KNU] conditional jump instruction.[RO98]

      Critique of 2018 Turing Award that learn internal representations (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] (2004),[GPUNN][GPUCNN5] record-breaking deep supervised NNs (2010)[MLP1-2] and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991),[FWP0-2,6] and more.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11] II & I & III & XIII & X & XVII & XII & XVIII & XX.

      Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011[MLP1-2][DAN][DAN1][GPUCNN5][R6] and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).[GPUCNN5,8] and were able to greatly improve steel defect detection.[ST] All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky won ImageNet 2012.[GPUCNN5][R6] mitosis detection.[MGC][GPUCNN5,8] approach of D & XI).

      Critique of 2018 Turing Award without citing them.[DL1][DLC][HIN][R2-R4][R7-R8] V & XII & XIX & II & III & XIII & XVII & X & I.

      Critique of 2018 Turing Award who failed to cite them, even in later work.[HIN][DLC][DL1-2][DEEP1-2][CMB][R7-R8] See Sec. II & III & XIII & V & X & XIV & I.

      Critique of 2018 Turing Award first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL[DL6,6a] was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II) II & III & XIII & V & I.

      Critique of 2018 Turing Award ignored by LBH's papers[HIN][R7-R8][R2-R5] (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).

      ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),[GPUNN][GPUCNN5] made GPU-based NNs fast and deep enough an important benchmark record,[MLP1-2] unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.[VID1] our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5] vision (explicitly mentioned by ACM) for the first time[R6] (see Sec. D).

      Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.[LSTM1-4][CTC] In particular, as mentioned in Sec. A, such as HMMs.[BW][BOU][BRI][HYB12] As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.

      Critique of 2018 Turing Award ACM's statement is "less wrong" than Honda's[HIN](Sec. I) but still (and apparently even other award committees[HIN](Sec. I) backpropagation by Rumelhart et al. (1985-86)[RUM] (1982).[BP2] And the article[RUM] even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),[BP1] Kelley already had a precursor thereof in the field of control theory;[BPA] see also later work of the early 1960s.[BPB][BPC][R7] internal representations in hidden layers of NNs.[RUM] But this was essentially just an experimental analysis of a known method.[BP1-2] And history of backpropagation can be found at Scholarpedia[DL2] and in my award-winning survey.[DL1] Also see Sec. XIX, II.

      Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.[BP1] recent debate:[HIN] It is true that in 2018, Hinton[AOI] Rumelhart[RUM] with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.[HIN] Neither in a popular book[AOI] nor in other recent work[DL3,DL3a] did he cite Linnainmaa (1970),[BP1] the true creator.[BP4-5] that his 2015 survey[DL3] does cite Werbos (1974) who however described the method correctly only later in 1982[BP2] and also failed to cite Linnainmaa[BP1] (compare Amari's work of 1977[BP6]). Linnainmaa's method was well-known.[BP5][DL1-2][DLC] It wasn't created by "lots of different people" as Hinton suggested[AOI][HIN][R11] one person who published first[BP1] and therefore should get the credit.

      Critique of 2018 Turing Award Boltzmann Machine (BM)[BM] a learning.[HIN] Recently, however, I learnt through a reader that even the BM paper[BM] did not cite prior relevant work by Sherrington & Kirkpatrick[SK75] and Glauber.[G63] (Compare related work.[H86][H88][S93]) multilayer perceptrons with arbitrarily many layers.[DEEP1-2][HIN] Sec. II V & X.[MIR](Sec. 1)[R8]

      As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert[M69] at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800[DL1-2]) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.[DEEP2][BP6][CNN1][DL1-2]

      Critique of 2018 Turing Award Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).[Drop1-2] Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,[DAN1] Sec. D above. Back then, the only really of deep CNNs through GPUs.[GPUCNN1,3,5][R6] Already before ImageNet 2012,[R6] fast deep CNN called DanNet a monopoly on winning computer vision competitions.[GPUCNN5] It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011[GPUCNN2][DAN,DAN1][R6] long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award since the late 1980s.[BW][BRI][BOU] LSTM (1990s-2005)[LSTM0-6] and CTC[CTC] (2006), which were applied to speech in 2007.[LSTM4][LSTM14] CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.[BW][BRI][BOU][HYB12] See also Sec. A.

      Critique of 2018 Turing Award 5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.[SNT] Bengio[NPM] characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures[PO87][PO90]). In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. Bengio's team[ATT14] has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). attention-based Transformers[TR1-6] are FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. My FWP of 1991[FWP0-1] (now often called keys and values for self-attention).[TR1-6][FWP] the 2010s,[DEC] Transformers[TR1-2] a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly[LSTM13,17] linear Transformers or Performers[TR5-6] which are formally equivalent to my 1991 FWPs (apart from normalization).[FWP6][FWP] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves.

      See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper[ATT2] his own work:[ATT3]

      Critique of 2018 Turing Award GANs[GAN0-1] (2010-2014) are actually a simple application[AC] of the adversarial curiosity (AC) principle from 1990[AC90,90b][AC20] (see also surveys[AC09-10]). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis[GAN1] (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,[GAN1] a well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.[AC20][AC] early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20]) Bengio et al. neither cited the original work[AC90,90b][AC20] nor corrected their erroneous claims[GAN1] about the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,[AV1] their NIPS 2014 paper[GAN1] and some of the erroneous claims it made about my prior work.[AC20] Goodfellow eventually admitted that PM is adversarial (his paper[GAN1] still claims the opposite), but emphasized that it's not generative. However, the even earlier AC[AC90,90b][AC10][AC20] is both adversarial and generative (its generator contains probabilistic units[AC90] like in StyleGANs[GAN2]). When the authors[GAN1] I published one myself in the hopes of correcting the annals of history.[AC20] that they are instances of my earlier work.[R2][AC20] vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. was settled in favor of Sepp.[VAN1] However, even after a common publication,[VAN3] Bengio published papers[VAN4][XAV] are poor indicators of truly pioneering work.[NAT1] (Margin note: Bengio states[YB20] that in 2018 he one must at least clarify it later,[DLC] Bengio also claims[YB20] that in 1995 my publications on exactly this topic date back to 1991-93.[UN0-2][UN] which I started in 1987[META1][META] long before Bengio that he did it before me.[R3] Bengio also writes[YB20] that in Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"[LSTMGRU] for a variant of our vanilla LSTM architecture[LSTM2] (2000) which he did not cite although our work[LSTM2] was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) Hinton's paper[UN4] (2006) appeared long after my earlier work on this[UN0-2] the first NNs shown to solve very deep problems (see Sec. II above).[UN] It was published in 1991-92[UN1] when compute was about 1000 times more expensive than in 2006. survey (2015),[DL3][DLC] See also Sec. II & III. compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) Hinton[DIST2] (2006) did not cite my much earlier original work on this (1991),[UN1][UN] not even in his later patent application fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) Hinton[ATT3] (2010) our much earlier work on this[ATT1][ATT] although he was both reviewer and editor of my summary[ATT2] (1990; see Sec. XVI above).

      The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.[R4] Remarkably, three of them are related to the 1991 paper[UN1][UN] which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.[MIR] See Sec. I for additional related issues of credit assignment.

      Critique of 2018 Turing Award LeCun's team has made important contributions to CNNs since 1989.[CNN2,4] However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.[MIR] at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance three times worse performance).[DAN1] Again see Sec. D. at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)[GPUCNN5,7,8] (before the similar AlexNet won ImageNet 2012[GPUCNN5][R6] and the similar VGG network[GPUCNN9] won ImageNet 2014). mitosis detection.[MGC][GPUCNN5,7,8] Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,[AM16][DL1] All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See Sec. A, B, VI, XI.

      Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari's work of 1977[BP6]). recent work.[DL3,DL3a][DLC] In 1960, Kelley already had a precursor of the algorithm.[BPA] Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"[DL1] (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.[DL2][BP4]

      Critique of 2018 Turing Award However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)[DEEP1-2] (and also Fukushima[CNN1][DL2]) had long before LeCun. See Sec. D & II & XIII & V.

      Critique of 2018 Turing Award LeCun et al. neither cited the origins[BP1] (1970) of this widely used type of automatic differentiation for differentiable networks of modules[DL2][BP4-5][DLC] for such systems.[S80] See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work.[PO87-90]

      (Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).[UN2] For example, our adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) see "100 Authors against Einstein."[AH1] ad hominem attacks[AH2-3][HIN] "If you cannot dispute a fact-based message, attack the messenger himself."[HIN] award can ever change that.[HIN] and their co-workers have contributed useful improvements of deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] whom they did not cite II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:[DLC][HIN] to self-correction,"[SV20] as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video[VID2] Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article[NYT3] Google's original 2016 paper on Google Translate[WU] mentions LSTM over 50 times (see Sec. B). In ad hominem style,[AH2-3] claiming credit he doesn't deserve for many, many things",[NYT1] without LeCun also called the GANs of Bengio's team[GAN1] GANs are variations of my work in 1990.[AC90,90b][AC20][R2] According to Bloomberg,[AV2] Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."[FAKE] LBH, who called themselves the deep learning conspiracy,[DLC] Our LSTM paper[LSTM1] has got more citations than any paper by Bengio or LeCun,[R5] Hinton's most cited paper (2012) is the one on GPU-based CNNs.[GPUCNN4][R5] It follows our earlier work on supervised deep NNs (2010)[MLP1] unsupervised pre-training for deep NNs by myself [UN][UN0-3] and later championed by Hinton;[UN4][VID1] see Sec. D). Hinton (2012)[GPUCNN4] characterizes our deep and fast DanNet (2011)[GPUCNN1-3] as AlexNet won one;[R6] see Sec. D, XIV. The highly cited VGG network (2014)[GPUCNN9] Hinton's 2nd most cited paper[RUM][R5] of Hinton's paper,[RUM] adding citations for a book by Rumelhart & McClelland[R5]). Backpropagation is a previously invented method[BP1] whose origins of Ivakhnenko whom he has never cited;[DEEP1-2][R7-R8] see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),[GAN1] instances of my artificial curiosity (1990)[AC90,90b][AC20][R2] which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)[UN4] by ours[UN0-2][UN] were preceded by Hanson's[Drop1-2] As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,[DL3a] again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,[MOST] of deep learning MLPs since 1965[DEEP1-2] (see Sec. II, XX) and backpropagation (1960-70)[BPA][BP1] (see Sec. XIX, XII) and convolutional NNs since 1979[CNN1-4] (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.[DEC] Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.

      As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] depth that really learned.[DEEP1-2][R8] Five years later, modern backpropagation

      Yes, this critique is also an implicit critique of certain other awards to LBH.[HIN] reddit.com/r/MachineLearning[R1-R12] (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.[MIR]

      Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:[LECP] "... else cites."[LECP]

      Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,[HIN] as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2]

      Creative Commons LicenseThanks to many expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing PDF. Spatial Averaging.[CNN1] PDF. PDF. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.[UN-UN2] More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to 1991.[UN1-2][UN] II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J.  Schmidhuber (AI Blog, 26 March 2021). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections linear Transformers or Performers[TR5-6] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas[AC][AC90, AC90b][AC20] of GANs. Description of GANs that does not cite the original work of 1990[AC][AC90,AC90b][AC20][R2] (also containing wrong claims about Predictability Minimization[PM0-2][AC20]). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.[HW3] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs[MLP1] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).[AC90] Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed[MLP1-2] unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

      Deep Learning: Our Miraculous Year 1990-1991 Menu directory status & updates copyrights 1991: Neural nets learn to program neural nets with fast weights (1991) - by Juergen Schmidhuber AI Blog
      Twitter: @SchmidhuberAI Traditionally this is done with recurrent NNs (RNNs) published.[FWP0-1] the fast weights of another NN (see Sec. 1). In 1991, one of them[FWP0-1] (now often called keys and values for self-attention; Sec. 2). The very similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] to the 1991 Fast Weight Programmers[MOST] (see this tweet). In 1993, I also introduced the attention terminology[FWP2] now used in this context[ATT] (Sec. 4), and RNNs that program themselves (Sec. 3). famous vanishing gradient problem aka deep learning problem (analyzed a few months later in 1991[VAN1]) through additive fast weight changes (Sec. 5). additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5) Annus Mirabilis of deep learning.[MIR] brand new, improved version[FWP6] of the 1991 fast weight update rule (Sec. 6). reinforcement learning through neuroevolution[FWP5] (2005-, Sec. 7), goal-conditioned policy generators (2022),[GGP] metalearning machines that learn to learn[FWPMETA1-9] (1992-2022, Sec. 8).

      Goedel Machine As I have frequently emphasized since 1990,[AC90][PLAN][META] artificial neural network (NN) universal self-referential formal systems,[GOD][GOD34] I built NNs whose outputs are changes of programs or weight matrices of other NNs[FWP0-2] (Sec. 1, 2, 3), their own weight change algorithms or learning algorithms[FWPMETA1-5] (Sec. 8). gradient descent procedure[BP1-4][BPA][R7]) can compute a direction in program space where one may find a better program,[AC90] better program-modifying program.[FWP0-2][FWPMETA1-5] started in 1965 layers.[DEEP1-2] Their activation functions were Kolmogorov-Gabor polynomials which include the now popular multiplicative gates,[DL1-2] fast weights. von der Malsburg was the first to explicitly emphasize the importance of NNs with rapidly changing weights.[FAST] The second paper on this was published by Feldman in 1982.[FASTa] The weights of a 1987 NN were sums of weights with a large learning rate and weights with a small rate[FASTb][T22] (but have nothing to do with the NN-programming NNs discussed below). Fast Weight Programmers (FWPs) were published in 1991-93[FWP0-2] (Sec. 1, 2, 3, 4). attention[ATT] (Sec. 4) and Transformers[TR1-6] (Sec. 2, 3, 4, 5).

      End-To-End Differentiable Fast Weights: NNs Learn to Program NNs (1991) on 26 March 1991, slow NN that learns by backpropagation[BP1-4] to rapidly modify the fast weights of another NN,[FWP0] essentially published in Neural Computation.[FWP1] attention[ATT] (Sec. 4) That is, I separated storage and control like in traditional computers, but in a fully neural way (rather than in a hybrid fashion[PDA1][PDA2][DNC]). Synthetic Gradients.[NAN1-5] recurrent NNs (RNNs) One of the FWPs of 1991[FWP0-1] is illustrated in the figure. There is A disadvantage addressed in Sec. 2 is that the slow net needs many output units if the fast net is large.

      Slow neural net programs fast neural net through additive outer products

      The Fast Weight Programmer[FWP0-1] depicted in Sec. 1 has a slow net unit for each fast weight. However, Section 2 of the same 1991 paper[FWP0] linear[TR5-6] Transformers[TR1-2] or attention[ATT] (compare Sec. 4). to the fast weight (which then may be normalized by a squashing function[FWP0]). The second order tensor products.[FWP0-3a] linear Transformers).[FWP6][TR5-6] The highly successful Transformers of 2017[TR1-2] can be viewed as a combination of my additive outer product fast weight principle[FWP0-2] NN-programmed fast weights (Sec. 5 & 1). linear Transformers (2020-21)[TR5-6] abandoned the softmax, essentially resurrecting the original 1991 system.[FWP0-1] Compare Sec. 6. go back at least to Hebb's informal rule (1949)[HE49] and Steinbuch's Learning Matrix around 1960.[ST61-63][AMH1-2][KOH72][LIT74][PAL80][KOS88] since 1991.[FWP0-3a][TR5-6] I offered the FWPs of 1991[FWP0-1] as an sequence-processing recurrent NNs (RNNs) (Sec. 1), the computationally most powerful NNs of them all.[UN][MIR](Sec. 0) Modern Transformers are also viewed as RNN alternatives, despite their limitations.[TR3-4] The slow net and the fast net of the 1991 system[FWP0-1] in Sec. 2 were feedforward NNs (FNNs), like most current Transformers.[TR1-6] I collapsed all of this into a single RNN that could rapidly reprogram all of its own fast weights through additive outer product-based weight changes.[FWP2] One motivation reflected by the title of the paper[FWP2] of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) See also our more recent work on FWPs since 2017,[FWP3-3a][FWPMETA7][FWP6] and compare a recent study.[RA21] 4. Attention terminology of 1993 End-to-End Differentiable Sequential Neural Attention 1990-93. Juergen Schmidhuber Today, everybody is talking about attention when it comes to describing the principles of Transformers.[TR1-2] The additive outer products[FWP0-1] of the Fast Weight Programmers described in Sec. 2 and Sec. 3 Similarly, the attention weights or self-attention weights (see also[FWP4b-d]) NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22] 1993 paper[FWP2] which internal spotlights of attention Fast Weight Programmers.[FWP2][ATT] Apart from possible normalization/squashing,[FWP0] are additive (Sec. 1 & 2). do not suffer during sequence learning from the famous vanishing gradient by my brilliant student Sepp Hochreiter a few months later in his 1991 diploma thesis.[VAN1]

      LSTM and both of them dating back to 1991, our miraculous year of deep learning.[MIR] Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step That is, the core of LSTM is operating in a linear additive activation space (ignoring LSTM's multiplicative gates).[LSTM1][VAN1][MIR](Sec. 4 & Sec. 8) Additive FWPs[FWP0-2] (Sec. 1 & 2), however, solve the problem through a dual approach, By favoring additive operations yielding non-vanishing first derivatives and error flow,[VAN1] Transformers[TR1-6] also follow the additive approach.[FWP0-2] (compare Sec. 2 and Sec. 4 on attention terminology since 1993).

      Highway Networks:
LSTM's traditional additive activation-based  approach<sup><small><small><a href=[LSTM1-13] is mirrored in the LSTM-inspired Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep It is essentially a feedforward version of LSTM[LSTM1] with forget gates.[LSTM2] Residual Net or ResNet[HW2] (Dec 2015). Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,[DEC] major IT companies overwhelmingly used smartphones.[DL4] rapidly learn to solve quickly[LSTM13] while plain Transformers can't yet.[TR4] unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1) dates back to 1991[UN] Recent work of February 2021[FWP6] mechanisms[TR5-6] and Fast Weight Programmers[FWP0-2] (Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) variants.[TR5-6] Building on previous work[FWPMETA7] on FWPs (Sec. 1, 2, 3, 8), we replace the 1991 elementary programming instruction based on additive outer products[FWP0-2] by a delta rule-like[WID] language modeling tasks.[FWP6] Our code is public. work of June 2021[FWP7] (also with Robert Csordas) points out that the original FWP formulation of 1991[FWP0-1] is more general than the one of linear Transformers: a slow NN continually reprograms the weights of a fast NN with Our code is public.

      Reinforcement learning robotino double pole balancer with neuroevolution for fast weights as shown in 2005 with my former postdoc Faustino Gomez[FWP5] (now CEO of NNAISENSE) Our 2005 paper on deep RL[DL6,6a] was actually the first machine learning numerous weights of large NNs through very compact codes.[KO0-2][CO1-4] Here we exploited that the Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search[CO2] unsupervised pre-training.

      Recent work of 2022[GGP] with

      self-referential weight matrix My first work on metalearning machines that learn to learn was published in 1987.[META][R3] metalearning in a very general way. In references[FWPMETA1-5] since 1992, the slow NN and the fast NN (Sec. 1) are recurrent and identical. The RNN can see its own errors or reward signals called eval(t+1) in the image.[FWPMETA5]

      The 1993 FWP of Sec. 3[FWP2] also was an RNN RNN above,[FWPMETA1-5] it used outer products between key patterns and value patterns (Sec. 2) to manipulate used gradient descent in LSTM networks[LSTM1] instead of traditional functions of two variables[HO1] (more on LSTM and fast weights in Sec. 5). In 2020, Imanol et al. augmented an LSTM with an associative fast weight memory.[FWPMETA7] partially observable environments.[FWPMETA7] Our recent MetaGenRL (2020)[METARL10] meta-learns See the blog post of my PhD student Louis Kirsch. outer-product-like fast weights encoded in the activations of LSTMs.[FWPMETA6] variables[FWP2] (Sec. 3). VS-ML can also learn to implement the backpropagation learning algorithm[BP1-4] purely in the end-to-end differentiable forward dynamics of RNNs.[FWPMETA6]

      In 2022, we also published at ICML a modern self-referential weight matrix (SWRM)[FWPMETA8] based on the 1992 SWRM.[FWPMETA1-5] self-improvement (compare this tweet). A modern self-referential weight matrix (2022) based on the one of 1992 There is another version of this article This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Creative Commons License J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network. The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.[AMH1] [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] PDF. PDF. PDF. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Deep Learning. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? neural networks learning to control dynamic external memories.[PDA1-2][FWP0-1] J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] In 1993, he introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. See tweet of 2022. PDF. "Transformer with linearized self-attention."[FWP] PDF. HTML. Pictures (German). See tweet of 2022 for 30-year anniversary. PDF. Preprint: arXiv:1811.12143. PDF. PDF. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Preprint: arXiv:2202.05780. Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022). Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. PDF. PDF. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. More. PDF. PDF. J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. Preprint arXiv:1608.05343, 2016. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732. the 1991 publication on what's now called "Transformers with linearized self-attention."[FWP0-6][TR5-6] attention terminology in 1993.[ATT][FWP2][R4] See tweet of 2022 for 30-year anniversary. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. PDF. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 15 June 1991 (advisor J. Schmidhuber). PDF. Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber. Transformers with linearized self-attention in Neural Computation 1992, equivalent to fast weight programmers (apart from normalization), separating storage and control. Key/value was called FROM/TO. The attention terminology was introduced at ICANN 1993. Juergen Schmidhuber. Menu directory status & updates copyrights
      https://people.idsia.ch/~juergen/deep-learning-history.html AI Blog
      @SchmidhuberAI
      arXiv:2212.11279 is dominated by artificial neural networks (NNs) and deep learning,[DL1-4] hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous deep learning survey[DL1] mentioning my own team's work, because (as of 2022) the most cited NNs are based on it.[MOST] Sec. 1: Introduction
      Sec. 2: 1676: The Chain Rule For Backward Credit Assignment
      Sec. 3: Circa 1800: First Neural Net (NN) / Linear Regression / Shallow Learning
      Sec. 4: 1920-1925: First Recurrent NN (RNN) Architecture. ~1972: First Learning RNNs
      Sec. 5: 1958: Multilayer Feedforward NN (without Deep Learning)
      Sec. 6: 1965: First Deep Learning
      Sec. 7: 1967-68: Deep Learning by Stochastic Gradient Descent
      Sec. 8: 1970: Backpropagation. 1982: For NNs. 1960: Precursor.
      Sec. 9: 1979: First Deep Convolutional NN (1969: Rectified Linear Units)
      Sec. 10: 1980s-90s: Graph NNs / Stochastic Delta Rule (Dropout) / More RNNs / Etc
      Sec. 11: Feb 1990: Generative Adversarial Networks / Artificial Curiosity / NN Online Planners
      Sec. 12: April 1990: NNs Learn to Generate Subgoals / Work on Command
      Sec. 13: March 1991: NNs Learn to Program NNs. Transformers with Linearized Self-Attention
      Sec. 14: April 1991: Deep Learning by Self-Supervised Pre-Training. Distilling NNs
      Sec. 15: June 1991: Fundamental Deep Learning Problem: Vanishing/Exploding Gradients
      Sec. 16: June 1991: Roots of Long Short-Term Memory / Highway Nets / ResNets
      Sec. 17: 1980s-: NNs for Learning to Act Without a Teacher
      Sec. 18: It's the Hardware, Stupid!
      Sec. 19: But Don't Neglect the Theory of AI (Since 1931) and Computer Science
      Sec. 20: The Broader Historic Context from Big Bang to Far Future
      Sec. 21: Acknowledgments
      Sec. 22: 555+ Partially Annotated References (many more in the award-winning survey[DL1])
      quite erroneous ideas about the origins of the universe (see the final section

      A history of AI written in the 1980s would have emphasized topics such as theorem proving,[GOD][GOD34][ZU48][NS56] logic programming, expert systems, and heuristic search.[FEI63,83][LEN83] an old area of research seeing renewed interest. Practical AI dates back at least to 1914, when Leonardo Torres y Quevedo (see below) built the first working chess end game player[BRU1-4] any type of computation-based AI.[GOD][BIB3][GOD21,a,b] emphasis on topics such as support vector machines and kernel methods,[SVM1-4] Bayesian (actually Laplacian or possibly Saundersonian[STI83-85]) reasoning[BAY1-8][FI22] and other concepts of probability theory and statistics,[MM1-5][NIL98][RUS95] decision trees,e.g.,[MIT97] ensemble methods,[ENS1-4] swarm intelligence,[SW1] and evolutionary computation.[EVO1-7]([TUR1],unpublished) Why? Because back then such techniques drove many successful AI applications.

      A history of AI written in the 2020s must emphasize concepts such as the even older chain rule[LEI07] and deep nonlinear artificial neural networks (NNs) trained by gradient descent,[GD'] in particular, feedback-based recurrent networks, which are general computers whose programs are weight matrices.[AC90] Why? Because many of the most famous and most commercial recent AI applications depend on them.[DL4] MACY conferences (1946-1953)[MACY51] and the 1951 Paris conference on calculating machines and human thought, now often viewed as the first conference on AI.[AI51][BRO21][BRU4] modern AI based on "deep learning" with NNs.[DL1-2][DEC] minimize pain, maximize pleasure, drive cars, etc.[MIR](Sec. 0)[DL1-4]

      The present piece also debunks a frequently repeated, misleading "history of deep learning"[S20][DL3,3a] which ignores most of the pioneering work mentioned below.[T22] See Footnote 6. The title image of the present article is a reaction to an erroneous piece of common knowledge which says[T19] that the use of NNs "as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s," although such NNs appeared long before the 1980s.[T22] on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] Finally,

      Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

      In 1676, Gottfried Wilhelm Leibniz textbook on Leibniz' differential calculus.[LEI07-10][L84]

      Cauchy This answer is used by the technique of gradient descent (GD), apparently first proposed by Augustin-Louis Cauchy in 1847[GD'] (and much later by Jacques Hadamard[GD'']; the stochastic version called SGD is due to Herbert Robbins and Sutton Monro (1951)[STO51-52]).

      Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals Sangamagrama and colleagues of the Indian Kerala school.[MAD86-05] "the world's first computer scientist"[LA14]) also laid foundations of modern computer science. He and the first with an internal memory.[BL16] He described the principles of binary computers (1679)[L79][L03][LA14][HO66][LEI21,a,b] His formal Algebra of Thought (1686)[L86][WI48] was deductively equivalent[LE18] to the much later Boolean Algebra (1847).[BOO] all possible questions through computation;[WI48]

      Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L'Hopital (1696).[CONN21] doing this).[T22] It was not published until 1970, as discussed below.[BP1,4,5]

      In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795

      In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Later Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795.[STI81]

      In 1795, Gauss used what's now called a linear neural net, but Legendre published this first in 1805. Gauss is often called the greatest mathematician since antiquity Rosenblatt's perceptron (1958)[R58] combined a linear NN as above with an output threshold function to obtain a pattern classifier (compare his more advanced work on multi-layer networks discussed below). Joseph[R61] Widrow & Hoff's similar Adaline learned in 1962.[WID62]

      In 1924, Ernst Ising published the first recurrent network architecture: the Ising Model or Lenz-Ising model analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.[L20][I24,I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first learning RNNs (see below). were also discussed in 1943 by neuroscientists Warren McCulloch und Walter Pitts[MC43] and formally analyzed in 1956 by Stephen Cole Kleene.[K56]

      In 1972, Shun-Ichi Amari made the Ising recurrent net adaptive. This was the first published learning artificial recurrent neural network

      In 1972, Shun-Ichi Amari made the Lenz-Ising recurrent architecture adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights.[AMH1] See also Stephen Grossberg's work on biological networks,[GRO69] David Marr's[MAR71] and Teuvo Kohonen's[KOH72] work, and Kaoru Nakano's learning RNN.[NAK72]

      Alan Turing

      10 years later, the Amari network was republished (and its storage capacity analyzed).[AMH2] Some called it the Hopfield Network (!) or Amari-Hopfield Network.[AMH3] sequence-processing generalization thereof.[AMH1] learning RNNs. This, however, was first published many decades later,[TUR1] which explains the obscurity of his thoughts here.[TUR21] (Margin note: it has been pointed out that the famous "Turing Test" should actually be called the "Descartes Test."[TUR3,a,b][TUR21])

      Today, the most popular RNN is the Long Short-Term Memory (LSTM) mentioned below, which has become the most cited NN of the 20th century.[MOST]

      In 1958, Frank Rosenblatt had  multilayer perceptrons whose last layer learned

      In 1958, Frank Rosenblatt not only combined linear NNs and threshold functions (see the section on shallow learning since 1800), he also had more interesting, deeper multilayer perceptrons (MLPs).[R58] because only the last layer learned,[DL1] Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.[ELM1-2][CONN21][T22]

      MLPs were also discussed in 1961 by Karl Steinbuch[ST61-95] and Roger David Joseph[R61] (1961). See also Oliver Selfridge's multilayer Pandemonium[SE59] (1959). wrote about "back-propagating errors" in an MLP with a hidden layer[R62] although he did not yet have a general deep learning algorithm for deep MLPs. What's now called backpropagation is quite different and was first published in 1970, as discussed below.[BP1-BP5][BPA-C]

      Today, the most popular FNN is a version of the LSTM-based Highway Net (mentioned below) called ResNet,[HW1-3] which has become the most cited NN of the 21st century.[MOST]

      In 1965, Alexey Ivakhnenko & Valentin Lapa introduced  the first working deep learning algorithm for deep MLPs with arbitrarily many hidden layers multiplicative gates).[DEEP1-2][DL1-2][FDL] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] first introduced to Machine Learning much later by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] (Margin note: our 2005 paper on deep learning[DL6,6a] was the first machine learning publication with the word combination "learn deep" in the title.[T22])

      In 1967-68, Shun-Ichi Amari trained deep MLPs by stochastic gradient descent

      Ivakhnenko and Lapa (1965, see above) end-to-end fashion from scratch by stochastic gradient descent (SGD),[GD1] a method proposed in 1951 by Robbins & Monro.[STO51-52]

      Amari's implementation[GD2,GD2a] (with his student Saito) learned internal representations in a five layer MLP with two modifiable layers, which was trained to classify

      See also Iakov Zalmanovich Tsypkin's even earlier work on gradient descent-based on-line learning for non-linear systems.[GDa-b]

      Remarkably, as mentioned above, Amari also published learning RNNs in 1972.[AMH1]

      who invented backpropagation?

      In 1970, Seppo Linnainmaa was the first to publish what's now known as backpropagation, the famous algorithm for credit assignment in networks of differentiable nodes,[BP1,4,5]

      In 1960, Henry J. Kelley had a precursor of backpropagation in the field of control theory

      In 1982, Paul Werbos proposed to use the method to train NNs,[BP2] extending ideas in his 1974 thesis.

      In 1960, Henry J. Kelley already had a precursor of backpropagation in the field of control theory;[BPA] see also later work of the early 1960s by Stuart Dreyfus and Arthur E. Bryson.[BPB][BPC][R7] Unlike Linnainmaa's general method,[BP1] the systems of the 1960s[BPA-C]

      Backpropagation is essentially an efficient way of implementing Leibniz's chain rule[LEI07-10] (1676) (see above) for deep networks. Cauchy's gradient descent[GD'] uses this to such that the NN behaves more and more like some teacher, which could be a human, or another NN,[UN-UN2] or something else. had just become accessible in wealthier academic labs. An experimental analysis of the known method[BP1-2] yield useful internal representations in hidden layers of NNs.[RUM] At least for supervised learning, backpropagation is generally more efficient than Amari's above-mentioned deep learning through the more general SGD method (1967), which learned useful internal representations in NNs about 2 decades earlier.[GD1-2a]

      It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1] postdoc Dan Ciresan[MLP1-2] pre-training for important applications.[MLP2]

      10-year anniversary of supervised deep learning breakthrough (2010)

      Our system set a new performance record[MLP1] on Jung & Oh in 2004[GPUNN]). A reviewer called this a "wake-up call to the machine learning community." researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky neither cited this work nor corrected his book later.[HIN](Sec. I)[T22] (such as the Boltzmann machine[BM][HIN][SK75][G63][T22]) without relating them to the original work,[DLC][S20][T22] although the true history is well-known. in the 1960s-70s, especially outside of the Anglosphere.[DEEP1-2][GD1-3][CNN1][DL1-2][T22] Blatant misattribution and unintentional[PLAG1][CONN21] or intentional[FAKE2] plagiarism are still tainting the entire field of deep learning.[T22] Scientific journals "need to make clearer and firmer commitments to self-correction,"[SV20] as is already the standard in other scientific fields.

      In 1979, Kunihiko Fukushima introduced the convolutional neural network (CNN) architecture. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4] Neocognitron.[CNN1] rectified linear units (ReLUs) for NNs (1969).[RELU1] They are now widely used in CNNs and other NNs.

      In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990[CNN3a] and by Juan Weng et al. for higher-dimensional CNNs in 1993.[CNN3] Yann LeCun's team has contributed improvements of CNNs, especially for images.[CNN2,4][T22] Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93]

      History of computer vision contests won by deep CNNs since 2011 CNNs (Dan Ciresan et al., 2011).[GPUCNN1,3,5] Our fast GPU-based[GPUNN][GPUCNN5] CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] In 2011, DanNet became the first pure deep CNN to win computer vision contests.[GPUCNN2-3,5] Competition[GPUCNN5] ICDAR 2011 Chinese handwriting DanNet[GPUCNN1-3] IJCNN 2011 traffic signs DanNet[DAN,DAN1][R6] ISBI 2012 image segmentation DanNet[GPUCNN3a] ICPR 2012 medical imaging DanNet[GPUCNN8] ImageNet 2012 AlexNet[GPUCNN4] MICCAI 2013 Grand Challenge DanNet[GPUCNN8] ImageNet 2014 VGG Net[GPUCNN9] ResNet,[HW2] a
      Highway Net[HW1]
      with open gates
      winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), and were able to greatly improve steel defect detection.[ST] CVPR paper on DanNet[GPUCNN3] 5 months later, the similar GPU-accelerated AlexNet won the ImageNet[IM09] 2012 contest.[GPUCNN4-5][R6] Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the DanNet of 2011.[MIR](Sec. 19)[MOST]

      ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) and currently the most cited NN,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). NNs with rapidly changing "fast weights" were introduced by v.d. Malsburg (1981) and others.[FAST,a,b] Deep learning architectures that can manipulate structured data such as graphs[T22] were our graph NN-like, Transformer-like Fast Weight Programmers of 1991[FWP0-1][FWP6][FWP] which learn to continually rewrite mappings from inputs to outputs (addressed below), and the work of Baldi and colleagues.[BA96-03] Today, graph NNs are used in numerous applications.

      Werbos,[BP2][BPTT1] Williams,[BPTT2][CUB0-2] and others[ROB87][BPTT3][DL1] analyzed ways of implementing gradient descent[GD'][STO51-52][GDa-b][GD1-2a] in RNNs. Kohonen's self-organising maps became popular.[KOH82-89] in space and time.[BB2][NAN1-4][NHE][HEL] See overviews[MIR](Sec. 15, Sec. 17) and recent renewed interest in such methods.[NAN5][FWPMETA6][HIN22] version of this became popular under the moniker "dropout."[Drop1-4][GPUCNN4] Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.[AC90-20][GAN1] Two dueling NNs (a probabilistic generator and a predictor) are trying to maximize each other's loss in a minimax game.[AC](Sec. 1) (using stochastic units[AC90] like in the much later StyleGANs[GAN2]). the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain.[AC90] (The world model can also be used for continual online action planning.[AC90][PLAN2-3][PLAN])

      Artificial Curiosity & Creativity Since 1990-91

      4 years before a 2014 paper on GANs,[GAN1] my well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a given set.[AC20][AC][T22](Sec. XVII) early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20] Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another has been widely used for exploration in Reinforcement Learning[SIN5][OUD13][PAT17][BUR18] for synthesis of realistic images,[GAN1,2] although the latter domain was recently taken over by Rombach et al.'s Latent Diffusion, another method published in Munich,[DIF1] building on Jarzynski's earlier work in physics from the previous millennium[DIF2] and more recent papers.[DIF3-5] Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) which is now considered a remaining grand challenge.[LEC] The early 1990s, however, saw first exceptions: NNs that learn to decompose complex spatio-temporal observation sequences into compact but meaningful chunks[UN0-3] (see further below), and NN-based planners of hierarchical action sequences for compositional learning,[HRL0] as discussed next. This work injected concepts of traditional "symbolic" hierarchical AI[NS59][FU77] into end-to-end differentiable "sub-symbolic" NNs. end-to-end differentiable NN-based subgoal generators for Hierarchical Reinforcement Learning (HRL).[HRL0] Soon afterwards, this was also done with recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) problem."[LEC]

      Compare other NNs that have "worked on command" since April 1990, in particular, for learning selective attention,[ATT0-3] artificial curiosity and self-invented problems,[PP][PPa,1,2][AC] upside-down reinforcement learning[UDRL1-2] and its generalizations.[GGP] Recently, Transformers[TR1] have been all the rage, e.g., generating human-sounding texts.[GPT3] Transformers with "linearized self-attention"[TR5-6] were first published in March 1991[FWP0-1][FWP6][FWP] These so-called "Fast Weight Programmers" or "Fast Weight Controllers"[FWP0-1] separated storage and control like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way (rather than in a hybrid fashion[PDA1-2][DNC]). The "self-attention" in standard Transformers[TR1-4] combines this with a projection and softmax (using attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).

      26 March 1991: Neural nets learn to program neural nets with fast weights—like today's Transformer variants. 2021: New stuff!

      Today's Transformers heavily use unsupervised pre-training[UN0-3] (see next section), another deep learning methodology Annus Mirabilis of 1990-1991.[MIR][MOST]

      The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. layers of neurons or many subsequent computational stages.[MIR] ones[DL1-2] (but see a 1989 paper[MOZ]). of arbitrary depth.[DL1] Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) scales:[LEC] the Neural Sequence Chunker[UN0] or Neural History Compressor.[UN1] First Very Deep Learner of 1991 "very deep learning" tasks of depth > 1000[UN2] (requiring Neural History Compressor.[UN3] (See also recent work on unsupervised NN-based abstraction.[OBJ1-5]) More than a decade after this work,[UN1] called Deep Belief Networks (DBNs).[UN4] (or negative log probability) of the data representation in the level below.[HIN][T22][MIR] using my NN distillation procedure of 1991.[UN0-1][MIR] NN distillation was also republished many years later,[DIST2][MIR][HIN][T22] and is widely used today. used by Transformers[TR1-6] for Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] together with unsupervised/self-supervised pre-training for deep learning.[UN0-3] See the previous section. Sepp Hochreiter's Analysis of the Fundamental Deep Learning Problem (1991) Deep learning is hard because of the Fundamental Deep Learning Problem his diploma thesis which I had the pleasure to supervise.[VAN1] First he implemented the Neural History Compressor above but then did much more: In both cases, learning fails (compare[VAN2]). This analysis led to basic principles of what's now called LSTM (see below). Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 diploma thesis,[VAN1] which I consider one of the most important documents in the history of machine learning. It also provided essential insights for overcoming the problem, through basic principles (such as constant error flow) of what we called LSTM in a tech report of 1995.[LSTM0] After the main peer-reviewed publication in 1997[LSTM1][25y97] (now the most cited NN article of the 20th century[MOST]), application of LSTM to speech (2004).[LSTM10] 2005 saw the first publication of LSTM with full backpropagation through time and of bi-directional LSTM[LSTM3] (now widely used). Recurrent Neural Networks, especially LSTM Another milestone of 2006 was the training method "Connectionist Temporal Classification" or CTC[CTC] for simultaneous alignment and recognition of sequences. Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). NNs and traditional approaches such as Hidden Markov Models (HMMs).[BW][BRI][BOU][HYB12][T22] three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM was soon used for everything that involves sequential data such as speech[LSTM10-11][LSTM4][DL1] and videos. Google's speech recognition on the Android smartphones.[GSR15] Many other companies adopted this.[DL4] on-device speech recognition of 2019 (now on your phone, not on the server) LSTM. In 1995, we already had an excellent neural probabilistic text model[SNT] whose basic concepts were Nakamura and Shikano's 1989 word category prediction model.[NPMa] In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] achieve only 10 billion clicks),[FB17][DL4] Apple's Quicktype on roughly 1 billion iPhones,[DL4] the voice of Amazon's Alexa,[DL4] image caption generation[DL4] & automatic email answering[DL4] etc. Business Week called LSTM "arguably the most commercial AI achievement."[AV1] have "LSTM" in their title.[DEC]

      Highway Networks:
our <a href=Highway Network[HW1] (previous NNs had at most a few tens of layers). Microsoft's ResNet[HW2] (which won the ImageNet 2015 contest) is a version thereof The earlier Highway Nets perform roughly as well as their ResNet versions on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks where the pure residual layers do not work as well.[NDR]

      Deep learning is all about NN depth.[DL1] LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward Net version called ResNet the most cited NN of the 21st.[MOST] (Citations, however, are a highly questionable measure of true impact.[NAT1]) Reinforcement Learning (RL),[KAE96][BER96][TD3][UNI][GM3][LSTMPG] expected cumulative reward signals.[DL1] formulated in the general RL framework.[UNI] Monte Carlo (tree) search (MC, 1949),[MOC1-5] dynamic programming (DP, 1953),[BEL53] artificial evolution (1954),[EVO1-7]([TUR1],unpublished) alpha-beta-pruning (1959),[S59] control theory and system identification (1950s),[KAL59][GLA85] stochastic gradient descent (SGD, 1951),[STO51-52] and universal search techniques (1973).[AIT7] system identification,[WER87-89][MUN87][NGU89] DP and its online variant called Temporal Differences (TD),[TD1-3] artificial evolution,[EVONN1-3] and policy gradients.[GD1][PG1-3] Many additional references on this can be found in Sec. 6 of the 2015 survey.[DL1]

      When there is a Markovian interface[PLAN3] RL with DP/TD/MC-based FNNs can be very successful, as shown in 1994[TD2] (master-level backgammon player) and the 2010s[DM1-2a] (superhuman players for Go, chess, and other games). history of previous inputs, our combinations of RL algorithms and LSTM[LSTM-RL][RPG] have become standard, in particular, our LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]

      Deep Reinforcement Learning with Policy Gradients for Long Short-Term Memory (LSTM) ) For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] commonsense reasoning[MAR15] and learning to think.[PLAN4-5] time scales?[LEC] We published answers to these questions in 1990-91: self-supervised neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in 1997[AC97][AC99][AC02] and 2015-18.[PLAN4-5] century[SHA7a][RAU1] by Heron of Alexandria Highlights of over 2000 years of computing history. Juergen Schmidhuber was perhaps the first machine with a stored program.[BAN][KOE1] It used pins on

      2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber. Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) designed the first machine (the step reckoner) that could perform all four arithmetic operations, and the first with a memory.[BL16] cards (1679),[L79][L03][LA14][HO66] and published the chain rule[LEI07-10] (see above), essential ingredient of deep learning and modern AI.

      Leonardo Torres y Quevedo, the  20th century's first pioneer of practical AI Leonardo Torres y Quevedo (mentioned in the introduction) became it at the 1951 Paris AI conference.[AI51][BRO21][BRU4] Konrad Zuse The corresponding patent of 1936[ZU36-38][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Unlike Babbage, Zuse used Leibniz' principles of binary computation (1679)[L79][LA14][HO66][L03] This greatly simplified the hardware.[LEI21,a,b] Church[CHU] (1935), Turing[TUR] (1936), and Post[POS] (1936). conditional jump instruction.[RO98]

      1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber. John Atanasoff (the "father of tube-based computing"[NASC6a]). Julius Edgar Lilienfeld in 1925.[LIL1-2] used to break the Nazi code.[NASC6] someone other than Zuse (1941)[RO98] was Howard Aiken's decimal MARK I (US, 1944). and the 1948 upgrade of ENIAC, which was reprogrammed by entering numerical instruction codes into read-only memory.[HAI14b] with several transistors on a common substrate (granted in 1952).[IC49-14] In 1959, Robert Noyce presented a monolithic IC.[IC14] ICs/GPUs of today (2022) contain many billions of transistors (almost all of them of Lilienfeld's 1925 FET type[LIL1-2]). Moore's Law which states that the number of transistors[LIL1-2] raw computational power of all human brains combined.[RAW] According to Bremermann (1982),[BRE] as previously noted back in 2004.[OOPS2][ZUS21] are actually light beams).[DL2] are expected to become even much more important than they are today.[DL2] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

      1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. He combined Georg Cantor's diagonalization trick[CAN] with the foundational work by Gottlob Frege[FRE] (who introduced the first formal language in 1879), Thoralf Skolem[SKO23] (who introduced primitive recursive functions in 1923) and Jacques Herbrand[GOD86] (who identified Gottfried Wilhelm Leibniz[L86][WI48] (see above), deductively equivalent[LE18] to the later Boolean Algebra of 1847.[BOO] In 1936, Alan M. Turing Turing Machine.[TUR] He rederived the above-mentioned result.[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing.[POS] the world's first working programmable general-purpose computer,[ZU36-38][RO98][ZUS21] the first high-level programming language.[BAU][KNU] 1945[KNU] in 1948.[ZU48] Compare Newell & Simon's later work on theorem proving (1956).[NS56] In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] of learning to predict future data from past observations.[AIT1][AIT10] With Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL] this concept,[AIT7][AIT5][AIT12-13][AIT16-17] as well as applications to NNs.[KO2][CO1-3]

      In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff's universal predictor[AIT1][AIT10] environments.[AIT20,22] He also derived the asymptotically fastest algorithm for all well-defined computational problems,[AIT21] a beautiful pattern of exponential acceleration in it,[OMG] which I have presented in many talks since then, and which also made it into Sibylle Berg's award-winning book "GRM: Brainfuck."[OMG2] intervals: just a few decades or centuries or at most millennia.[OMG1] The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) Heron of Alexandria[RAU1] in the 1st century). The telephone (e.g., Meucci 1857, Reis 1860, Bell 1876)[NASC3] Haber-Bosch process for creating artificial fertilizer, without which the world could feed at most 4 billion people.[HAB1-2] first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich, the root of today's most cited NNs[MOST] and of modern deep learning through artificial curiosity and generative adversarial NNs for agents that invent their own problems (see above),[AC90-AC20][PP-PP2][SA17] Transformers with linearized self-attention (see above),[FWP0-6][TR5-6] distilling teacher NNs into student NNs (see above),[UN][UN0-3] at multiple levels of abstraction and multiple time scales (see above),[HRL0-2][LEC] and other exciting stuff. Much of this has become very popular, and improved the lives of billions of people.[DL4][DEC][MOST] (take all of this with a grain of salt, though[OMG1]). lab for decades[AC][AC90,AC90b]) will quickly improve themselves, restricted only by the fundamental limits of computability and physics. it,[ACM16][FA15][SP16][SA17] make more and bigger AIs. Those who don't won't have an impact.[ACM16][FA15][SP16] the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997

      Creative Commons License Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22] publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 555+ References (and many more in the survey[DL1]) In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal paper on Long Short-Term Memory, the (and basis of the most cited NN of the 21st). all possible metaverses 3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments. meta-reinforcement learning. 5. Journal paper on hierarchical Q-learning. 8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. general system systems with intrinsic motivation,[AC90-AC95] the system also See later publications.[AC99][AC02] PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. H. Bruderer[BRU4] calls that the first conference on AI. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network,[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.[L20][I25][T22] Mentions the recurrent Ising model[L20][I25]on which the (uncited) Amari network[AMH1,2] is based. The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.[AMH1] [AMH2] did not cite [AMH1]. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite. In fact, Hinton was the reviewer of a 1990 paper[ATT2] his own work:[ATT3] attentional component (the fixation controller)." See [MIR](Sec. 9)[R4]. arXiv/1409.0473, 2014-16. This work on soft "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.[FWP,FWP0-2,6][ATT] J.  Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around Bloomberg, May 15, 2018. PDF. HTML. PDF. by Sherrington & Kirkpatrick[SK75] & Glauber[G63] nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)[DEEP1-2][HIN] nor Amari's work (1967-68)[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors[S20][DLC] failed to cite the prior art.[T22] formal Algebra of Thought (1686)[L86][WI48] was deductively equivalent[LE18] to the much later Precursor of modern backpropagation.[BP1-5] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] Link. IEEE Spectrum, 2021. Link. English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1-5] and weight-sharing PDF. Spatial Averaging.[CNN1] Spatial Averaging.[CNN1] PDF. PDF. PDF. Inverse, 2016. Link. Since November 2021: Comments on version 1 of the report[T22] in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] Now everybody is using this approach. J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. the artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The 1991 NN distillation procedure,[UN0-2][MIR](Sec. 2) More. Deep Learning. HTML. A "survey" of deep learning that does not mention the pioneering works of deep learning [T22]. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22]. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber's lab solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to Schmidhuber's work of 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436). it). More on this under [T22]. J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. Tweet of 2022. arxiv:1312.5602. Link. the first sentence of the abstract of the earlier tech report version[DM1] was created earlier by Jan Koutnik et al. in Schmidhuber's lab.[CO2] and PhDs in computer science. More. Alphastar has a "deep LSTM core." Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007). Preprint arXiv:2112.10752, LMU Munich, 2021. neural networks learning to control dynamic external memories.[PDA1-2][FWP0-1] arXiv:1808.03578, 2018. arXiv:1808.03578, 2018. Conf. on Neural Networks, Vol. 2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.[R62][T22] This overview does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.[R62][T22] Link. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) [FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013. PDF. J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa,b] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] In 1993, he introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. See tweet of 2022. PDF. normalization).[FWP] PDF. HTML. Pictures (German). See tweet of 2022 for 30-year anniversary. PDF. Preprint: arXiv:1811.12143. PDF. PDF. Very similar to [FWP0-2], in both motivation [FWP2] and execution. This work on "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.[FWP,FWP0-2,6][ATT] Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Preprint: arXiv:2202.05780. PDF. Probably the first paper on using stochastic gradient descent[STO51-52] reverse mode of automatic differentiation or backpropagation[BP1]). OCR-based PDF scan of pages 94-135 (see pages 119-120). Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.[GD1] (S. Amari, personal communication, 2021.) Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022). arXiv:cs/0309048 (2003). More. PDF. Cognitive Computation 1(2):177-193, 2009. PDF. More. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing basic ideas[AC][AC90,AC90b][AC20] of GANs. A description of GANs that does not cite Schmidhuber's original GAN principle of 1990[AC][AC90,AC90b][AC20][R2][T22] (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization[PM0-2][AC20][T22]). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. DanNet,[DAN,DAN1][R6] to win computer vision contests in 2011[GPUCNN2-3,5] (AlexNet and VGG Net[GPUCNN9] followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)[RELU1] and dropout (a variant of Hanson 1990 stochastic delta rule)[Drop1-4] but neither cites the original work[RELU1][Drop1] nor the basic CNN architecture (Fukushima, 1979).[CNN1] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. [GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). first deep learner to win a medical imaging contest (2012). Link. J.  Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1] PDF. PDF. Bengio claimed[YB20] Schmidhuber's publications on exactly this topic date back to 1991-93.[UN0-2][UN] An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.[NHE] [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22]. previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6] PDF. what Y. LeCun called an "open problem" in 2022.[LEC] North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. This work did not cite Schmidhuber's gradient-based subgoal generators for hierarchical reinforcement learning (1990).[HRL0-2] PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. This work did not cite the earlier LSTM[LSTM0-6] trained by Connectionist Temporal Classification (CTC, 2006).[CTC] CTC-LSTM was successfully applied to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]) and became the first superior end-to-end neural speech recogniser that outperformed the state of the art, dramatically improving Google's speech recognition.[GSR][GSR15][DL4] Markov models (HMMs).[BW][BRI][BOU] [HYB12] still used the old hybrid approach and did not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.[LSTM8] Ernst Ising and Wilhelm Lenz in the 1920s.[L20][I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs.[AMH1-2] Who Invented the IC? Preprint arXiv:1704.04760 PDF. PDF. Mathematischen Schriften, ed. C. Gerhardt, Berlin 1879, vol.7, p.223. English link. Link. arXiv:1607.06450, 2016. [LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years See tweet1. LeCun also listed the "5 best ideas 2012-2022" without mentioning that See tweet2. [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. [LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. are actually a variant of the vanilla LSTM architecture[LSTM2] (2000) which the authors did not cite although this work[LSTM2] was the one that introduced gated recurrent units. Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs[MLP1] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. Preprint arXiv:1608.05343, 2016. Preprint arXiv:1611.01578 (PDF), 2017. Compare the earlier Neural Architecture Search of Bayer et al. (2009) for LSTM-like topologies.[LSTM7] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC6a] J. Schmidhuber. Comment on "Biography: The ABC of computing" by J. Gilbey, Nature 468 p 760-761 (2010). Link. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732. Link. excellent 1995 neural probabilistic text model.[SNT] See also Nakamura and Shikano's 1989 word category prediction model.[NPMa] Compare Konrad Zuse's much earlier 1948 work on theorem proving[ZU48] the first high-level programming language.[BAU][KNU] NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. Link. J. Schmidhuber (Blog, 2006). Is History Converging? Again? history's exponential acceleration since the Big Bang.[OMG] Preprint arXiv/1606.06724. Preprint arXiv/1708.03498. Preprint arXiv/1802.10353. Preprint arXiv/2010.03635. Preprint arXiv/2011.12930. PDF. HTML. HTML overview. OOPS source code in crystalline format. PDF. HTML. Link. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN principle Based on TR FKI-126-90 (1990).[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).[AC90] Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. Link. arXiv:1112.5309 [cs.AI] PDF. First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. in 1987[META1][META] long before Bengio [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Although these MLPs did not yet have deep learning, because only the last layer learned,[DL1] Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.[ELM1-2][CONN21][T22] J. Schmidhuber (AI Blog, 2001). Raw Computing Power. Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. The first paper on policy gradients for LSTM. This approach has become very important in reinforcement learning.[LSTMPG] This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation. the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)[DEEP1-2][HIN] as well as Amari's work (1967-68)[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors[DL3,3a] failed to cite the prior art.[T22] Link. A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) in the 1960s-70s, especially outside of the Anglosphere.[DEEP1-2][GD1-3][CNN1][DL1-2][T22] The Past, Present and Future of Artificial Intelligence. Link. PDF. Much later this was called a probabilistic language model.[T22] PDF. Link. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T22] debunks this justification. [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22]. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. Debunking [T19] and [DL3a] . the 1991 publication on what's now called "Transformers with linearized self-attention."[FWP0-6][TR5-6] attention terminology in 1993.[ATT][FWP2][R4] See tweet of 2022 for 30-year anniversary. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. The Turing Test. YouTube video, 2022. Preprint arXiv/1912.02875, 5 Dec 2019. Preprint arXiv/1912.02877, 5 Dec 2019. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. By 1993, the approach solved problems of depth 1000 [UN2] neural knowledge distillation procedure The systems of 1991 allowed for much deeper learning than previous methods. More. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)[UN0-3] the first NNs shown to solve very deep problems. (or negative log probability) of the data representation in the level below.[HIN][T22][MIR] This can greatly facilitate very deep downstream learning.[UN0-3] The comment under reference[UN4] applies here as well. Theory of Universal Learning Machines & Universal AI. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. Results are essentially identical to those of Schmidhuber's diploma student Sepp Hochreiter (1991).[VAN1] Even after a common publication,[VAN3] the first author of [VAN2] published papers[VAN4] that cited only their own [VAN2] but not the original work. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. However, in 2010, Schmidhuber's team in Switzerland showed[MLP1-2] unsupervised pre-training is not necessary Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). Schmidhuber's publications on exactly this topic date back to 1991-93.[UN0-2][UN] already in 1995.[SNT] a general, practical, program-controlled computer. architecture [NEU45]. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application. J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936. Weltwoche, Nr. 33.21, 19 August 2021. PDF. Menu directory status & updates copyrights Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award AI Blog
      Twitter: @SchmidhuberAI
      (v1: 24 Sep 2021, v2: 31 Dec 2021) Versions since 2021 archived in the Internet Archive This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021). deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM's erroneous laudation and the Turing Lecture are concerned. 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] version 1 of the present report. (see Executive Summary I, V, II, XII, XIX, XXI, XIII, XIV, XX, XVII). (A) speech recognition, (B) natural language processing, (C) robotics, (D) computer vision, (VII) medicine, astronomy, materials science. A, B, C, D, VII, XVII, VI, XVI). II, V, XX, XVIII) with Dr. Bengio & Dr. Hinton (see Sec. XVII, I). I respond to LBH's recent ACM article (July 2021). expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). Abstract & Outline (~300 words), Introduction (~300 words), Critique of LBH's ACM article (Turing Lecture) of July 2021[DL3a] Executive summary of what's wrong with ACM's laudation (~1,000 words), 21 comments on 21 claims by ACM (~8,000 words), Conclusion (~2,000 words). All backed up by over 300 references (over 10,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. science is self-correcting."[SV20] they are mine or other people's.[DL1-2][HIN][NASC1-9] The present page is offered as a resource for all good computer scientists who share this inclination. and to fight plagiarism,[FAKE2] collusion rings,[LIT21] and systemic academic corruption in all of their more and less subtle forms.[FAKE] Sec. 2 LBH's 2021 ACM article[DL3a] which necessitated an extension of the first version of this post.[T20a][R12] ACM's official justification[T19] of the 2018 A.M. Turing Award[R1] After the Executive Summary in Sec. 3, Sec. 4 will split ACM's full text[T19] into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI. Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.[DL3a] LBH's well-known earlier omissions.[DLC][HIN][T20a] LBH claim to "briefly describe the origins of deep learning"[DL3a] without even mentioning the world's first working deep learning nets by Ivakhnenko and Lapa in 1965[DEEP1-2][R8] (see Sec. II). this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, the first really deep feedforward NN.[HW1-3] (see Sec. D, VI). were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs;[LSTM0-17] later our Highway Nets[HW1-3] brought it to feedforward NNs. from 2007[LSTM4,14] based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule[Drop1-3] (see Sec. XIV). perceptrons through stochastic gradient descent[GD1-3] (without reverse mode backpropagation[BP1]). Fukushima who introduced ReLUs in 1969[RELU1-2] (see Sec. XIV). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). XVIII). already in 1965[DEEP1-2][R8] (see Sec. II). earlier fast weights of von der Malsburg (1981) and Feldman (1982).[FAST,FASTa-b][FWP] described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). dedicate an extra section to attention-based Transformers,[TR1-6] citing Bengio's team (2014) for "soft attention"[ATT14] without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers[FWP,FWP0-2,6][ATT] (see Sec. XVII-1, XVI). LBH claim that Bengio's team[NPM] of text compression[SNT] (see Sec. XVI, XVII-1). LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)[GAN0-1] without mentioning that GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). In summation, LBH have repeatedly chosen to ignore the previous well-known critiques[DLC][HIN][T20a] and deep learning surveys,[DL1-2] and ACM's peer review process failed to catch this. ACM's Code of Ethics and Professional Conduct[ACM18] states: "Computing and deep learning (e.g., Sec. I), ACM lauds Numerous references can be found under the relevant section links I-XXI which adhere to the sequential order of ACM's text[T19] Sec. II: it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. Sec. I contains 4 subsections A, B, C, D A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were Hinton (2012) and Bengio (XV) our revolutionary CTC-LSTM which was soon on most smartphones. Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI): (soon used for several billions of was also based on our LSTM. Sec. C: Robotics. most visible breakthroughs Sec. D: Computer Vision XVIII & XIV & XI & VI) and applied to speech. All before LeCun's CNN work (XVIII). deep NNs pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. Sec. XIV: deep & fast CNN (where LeCun participated), Sec. XI: ACM mentions GPU-accelerated NNs deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first XVIII: Fukushima and Waibel (see Sec. D). The first application of CNNs with backpropagation to biomedical/biometric images is due to Baldi and Chauvin.[BA93] VII: ACM explicitly mentions medicine and first to win medical imaging competitions Sec. XII & XIX & XXI: Modern backpropagation XIII & II & V III & IX & X & XX): Sec. XX: ACM credits LeCun for work on Sec. XXI: ACM credits LeCun for work on XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences. CTC-LSTM A & B). XVI: ACM We started this in 1990-93 long before LBH Sec. XVII: Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using and other topics.[R2-R6] Sec. IV is on Turing (1936) and his predecessors Critique of LBH's ACM article (Turing Lecture) of July 2021. Sec. Conclusion: In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications) Sec. II & III & V & XII & XIII & XVII & XIV & XIX & XX & XXI. In what follows, ACM's full text [T19] is split into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI.

      Critique of 2018 Turing Award LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1-2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),[GPUNN][DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991).[FWP0-2,6] [DL1-2][R2-R8] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] This may explain some of ACM's misattributions.[T19] II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,[DEC] they were academia and industry,[DL4] mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)[LSTM0-6] vanishing gradient problem student Sepp Hochreiter in 1991.[VAN1] This happened long before the similar work of Bengio (see Sec. XVII).[MIR] (Sec. 3,Sec. 4) LSTM was refined with my student Felix Gers[LSTM2] through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).[CTC] Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). Markov models (HMMs)[BW][BRI][BOU] (Sec. XV). Hinton et al. (2012) still used the old hybrid approach[HYB12] and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer[LSTM4][LSTM14] as a postdoc in Hinton's lab.[LSTM8] CTC-LSTM dramatically improved Google's speech recognition.[GSR][GSR15][DL4] on-device speech recognition[GSR19] (not any longer on the server) LSTM[MIR](Sec. 4) (see Sec. VI & XI & XV). of text[SNT] (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] See also Sec. VI & XI & XV. tailored by Bengio's team.[ATT14][FWP] However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.[LSTM-RL][RPG][LSTMPG] In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,[AM16] predicting what's going on in nuclear fusion reactors, and so on.[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).[JOU17] Apparently the first LSTM journal paper[LSTM1][R5] is now the 20th century D. Computer Vision was revolutionized in the 2010s by a particular feedforward neural net (NN) called the convolutional NN (CNN).[CNN1-4] The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979),[CNN1] who also introduced the now widely used rectified linear units (ReLUs) in 1969.[RELU1] In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990[CNN3a] and by Weng et al. for higher-dimensional CNNs in 1993.[CNN3] Since 1989, LeCun's team has contributed improvements of CNNs, especially for images[CNN2,4] (see Sec. XVIII). Finally, my own team showed in 2010[MLP1] unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton[VID1] who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet[GPUCNN3] of Hinton's student Krizhevsky won the ImageNet[IM09] 2012 contest[GPUCNN4-5][R6] (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the work of 2011.[MIR](Sec. 19) ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) and currently the most cited neural network,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.

      Critique of 2018 Turing Award appeared long before the 1980s. The first non-learning recurrent NN (RNN) architecture (the Lenz-Ising model) was analyzed by physicists in the 1920s.[L20][I25][K41][W45] were also discussed in 1943 by McCulloch and Pitts[MC43] and formally analyzed in 1956 by Kleene.[K56] In 1972, Amari reused the Lenz-Ising model to build a learning RNN, later sometimes called the Hopfield network or Amari-Hopfield Network.[AMH1-3] artificial evolution[TUR1] and single adaptive layer learned in 1958[R58] (Joseph[R61] Widrow & Hoff's similar Adaline learned in 1962.[WID62] regression and the method of least squares[DL1-2] multilayer perceptrons (MLPs) were discussed by Steinbuch[ST61-95] (1961), Joseph[R61] (1961), and Rosenblatt[R62] (1962), who wrote about "back-propagating errors" in an MLP with a hidden layer,[R62] but did not yet have a general deep learning algorithm for deep MLPs (what's now called backpropagation is quite different and was first published by Linnainmaa in 1970[BP1-BP5][BPA-C]). Compare also Selfridge's multilayer Pandemonium[SE59] (1959). containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin's work[GDa-b]). deep convolutional NN architecture was first introduced in the 1970s;[CNN1] his very popular ReLU already in 1969.[RELU1-2] XIII, III, V, VIII, IX, and X. LBH & co-authors, e.g., Sejnowski[S20] (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, as mentioned above, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method[DEEP1-2][DL2] (and then also by Amari's SGD for MLPs[GD1-2]). Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) (but see a 1989 paper[MOZ]). However, it became really deep in 1991 in my lab,[UN-UN3] which has See Sec. 1 of the overview:[MIR] First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.[UN2][DL1][UN] (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000[LSTM17] more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) III. Note that LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

      Critique of 2018 Turing Award by others (Sec. III).[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4] deep learning multilayer perceptrons (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), GPU-accelerated NNs (2004),[GPUNN][GPUCNN5] and other foundations.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",[DL7] referring to Hinton's[UN4] and Bengio's[UN5] we had this type of deep learning already in 1991;[UN][UN1-2] see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs[DEEP1-2] which Hinton,[UN4] Bengio,[UN5] and LBH[DL3,DL3a] did not cite either. See Sec. X.

      Critique of 2018 Turing Award my comments systematically track the sequential order of ACM's claims.[T19]

      ACM's statement on Turing is greatly misleading, like some of its other statements.[T19] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a] Much of early AI in the 1940s-70s was actually about theorem proving[ZU48][NS56]

      In 1936, Turing Turing Machine.[TUR] He rederived the above-mentioned result,[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing,[POS] my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.[HIN]) open problem "P=NP?" in his famous letter to John von Neumann (1956).[GOD56][URQ10] Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936[ZU36-38][Z36][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Zuse also created the first high-level programming language in the early 1940s.[BAU][KNU] conditional jump instruction.[RO98]

      Critique of 2018 Turing Award that learn internal representations (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] (2004),[GPUNN][GPUCNN5] record-breaking deep supervised NNs (2010)[MLP1-2] and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991),[FWP0-2,6] and more.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11] II & I & III & XIII & X & XVII & XII & XVIII & XX.

      Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011[MLP1-2][DAN][DAN1][GPUCNN5][R6] and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93] DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).[GPUCNN5,8] and were able to greatly improve steel defect detection.[ST] All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky[GPUCNN4-5][R6] and the VGG network[GPUCNN9] mitosis detection.[MGC][GPUCNN5,8] approach of D & XI).

      Critique of 2018 Turing Award without citing them.[DL1][DLC][HIN][R2-R4][R7-R8] V & XII & XIX & II & III & XIII & XVII & X & I.

      Critique of 2018 Turing Award who failed to cite them, even in later work.[HIN][DLC][DL1-2][DEEP1-2][RELU1-2][R7-R8] See Sec. II & III & XIII & V & X & XIV & I.

      Critique of 2018 Turing Award first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL[DL6,6a] was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II) II & III & XIII & V & I.

      Critique of 2018 Turing Award ignored by LBH's papers[HIN][R7-R8][R2-R5] (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).

      ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),[GPUNN][GPUCNN5] made GPU-based NNs fast and deep enough an important benchmark record,[MLP1-2] unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.[VID1] our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5] vision (explicitly mentioned by ACM) for the first time[R6] (see Sec. D).

      Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.[LSTM1-4][CTC] In particular, as mentioned in Sec. A, such as HMMs.[BW][BOU][BRI][HYB12] As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.

      Critique of 2018 Turing Award ACM's statement is "less wrong" than Honda's[HIN](Sec. I) but still (and apparently even other award committees[HIN](Sec. I)) backpropagation by Rumelhart et al. (1985-86)[RUM] (1982).[BP2] And the article[RUM] even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),[BP1] Kelley already had a precursor thereof in the field of control theory;[BPA] see also later work of the early 1960s.[BPB][BPC][R7] internal representations in hidden layers of NNs.[RUM] But this was essentially just an experimental analysis of a known method.[BP1-2] And history of backpropagation can be found at Scholarpedia[DL2] and in my award-winning survey.[DL1] Also see Sec. XIX, II.

      Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.[BP1] recent debate:[HIN] It is true that in 2018, Hinton[AOI] Rumelhart[RUM] with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.[HIN] Neither in a popular book[AOI] nor in other recent work[DL3,DL3a] did he cite Linnainmaa (1970),[BP1] the true creator.[BP4-5] that his 2015 survey[DL3] does cite Werbos (1974) who however described the method correctly only later in 1982[BP2] and also failed to cite Linnainmaa.[BP1] Compare the 1967-68 work of Amari:[GD1-3] to my knowledge the first to propose and implement stochastic gradient descent[STO51-52] reverse mode gradient descent method now known as backpropagation[BP1]); see also Tsypkin's work of 1966.[GDa-b] Linnainmaa's backpropagation method was well-known.[BP5][DL1-2][DLC] It wasn't created by "lots of different people" as Hinton suggested[AOI][HIN][R11] one person who published first[BP1] and therefore should get the credit.

      Critique of 2018 Turing Award Boltzmann Machine (BM)[BM] a learning.[HIN] Recently, however, I learnt through a reader that even the BM paper[BM] did not cite prior relevant work by Sherrington & Kirkpatrick[SK75] and Glauber.[G63] (Compare related work.[H86][H88][S93]) multilayer perceptrons with arbitrarily many layers.[DEEP1-2][HIN] Sec. II V & X.[MIR](Sec. 1)[R8]

      As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert[M69] at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800[DL1-2]) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.[DEEP2][GD1-3][CNN1][DL1-2]

      Critique of 2018 Turing Award Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).[Drop1-3] Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,[DAN1] Sec. D above. Back then, the only really of deep CNNs through GPUs.[GPUCNN1,3,5][R6] Already before ImageNet 2012,[R6] fast deep CNN called DanNet a monopoly on winning computer vision competitions.[GPUCNN5] It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011[GPUCNN2][DAN,DAN1][R6] long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award since the late 1980s.[BW][BRI][BOU] LSTM (1990s-2005)[LSTM0-6] and CTC[CTC] (2006), which were applied to speech in 2007.[LSTM4][LSTM14] CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.[BW][BRI][BOU][HYB12] See also Sec. A.

      Critique of 2018 Turing Award 5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.[SNT] Bengio[NPM] characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures[PO87][PO90]). In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. Bengio's team[ATT14] has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). attention-based Transformers[TR1-6] are FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. My FWP of 1991[FWP0-1] (now often called keys and values for self-attention).[TR1-6][FWP] the 2010s,[DEC] Transformers[TR1-2] a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly[LSTM13,17] linear Transformers or Performers[TR5-6] which are formally equivalent to my 1991 FWPs (apart from normalization).[FWP6][FWP] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves.

      See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper[ATT2] his own work:[ATT3]

      Critique of 2018 Turing Award GANs[GAN0-1] (2010-2014) are actually a simple application[AC] of the adversarial curiosity (AC) principle from 1990[AC90,90b][AC20] (see also surveys[AC09-10]). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis[GAN1] (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,[GAN1] a well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.[AC20][AC] early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20]) Bengio et al. neither cited the original work[AC90,90b][AC20] nor corrected their erroneous claims[GAN1] about the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,[AV1] their NIPS 2014 paper[GAN1] and some of the erroneous claims it made about my prior work.[AC20] Goodfellow eventually admitted that PM is adversarial (his paper[GAN1] still claims the opposite), but emphasized that it's not generative. However, the even earlier AC[AC90,90b][AC10][AC20] is both adversarial and generative (its generator contains probabilistic units[AC90] like in StyleGANs[GAN2]). When the authors[GAN1] I published one myself in the hopes of correcting the annals of history.[AC20] that they are instances of my earlier work.[R2][AC20] vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. was settled in favor of Sepp.[VAN1] However, even after a common publication,[VAN3] Bengio published papers[VAN4][XAV] are poor indicators of truly pioneering work.[NAT1] (Margin note: Bengio states[YB20] that in 2018 he one must at least clarify it later,[DLC] Bengio also claims[YB20] that in 1995 my publications on exactly this topic date back to 1991-93.[UN0-2][UN] which I started in 1987[META1][META] long before Bengio that he did it before me.[R3] Bengio also writes[YB20] that in Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"[LSTMGRU] for a variant of our vanilla LSTM architecture[LSTM2] (2000) which he did not cite although our work[LSTM2] was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) Hinton's paper[UN4] (2006) appeared long after my earlier work on this[UN0-2] the first NNs shown to solve very deep problems (see Sec. II above).[UN] It was published in 1991-92[UN1] when compute was about 1000 times more expensive than in 2006. survey (2015),[DL3][DLC] See also Sec. II & III. compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) Hinton[DIST2] (2006) did not cite my much earlier original work on this (1991),[UN1][UN] not even in his later patent application fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) Hinton[ATT3] (2010) our much earlier work on this[ATT1][ATT] although he was both reviewer and editor of my summary[ATT2] (1990; see Sec. XVI above).

      The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.[R4] Remarkably, three of them are related to the 1991 paper[UN1][UN] which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.[MIR] See Sec. I for additional related issues of credit assignment.

      Critique of 2018 Turing Award LeCun's team has made important contributions to CNNs since 1989.[CNN2,4] However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.[MIR] at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance three times worse performance).[DAN1] Again see Sec. D. Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93] at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)[GPUCNN5,7,8] (before the similar AlexNet won ImageNet 2012[GPUCNN5][R6] and the similar VGG network[GPUCNN9] won ImageNet 2014). mitosis detection.[MGC][GPUCNN5,7,8] Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,[AM16][DL1] All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See Sec. A, B, VI, XI.

      Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari's work on SGD for MLPs of 1967-68[GD1-2a]). recent work.[DL3,DL3a][DLC] In 1960, Kelley already had a precursor of the algorithm.[BPA] Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"[DL1] (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.[DL2][BP4]

      Critique of 2018 Turing Award However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)[DEEP1-2] and Amari[GD1-2] (and also Fukushima[CNN1][DL2]) had long before LeCun. See Sec. D & II & XIII & V.

      Critique of 2018 Turing Award LeCun et al. neither cited the origins[BP1] (1970) of this widely used type of automatic differentiation for differentiable networks of modules[DL2][BP4-5][DLC] for such systems.[S80] See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work;[PO87-90] compare the important work of Baldi and colleagues.[BA96-03]

      (Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).[UN2] For example, our adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) see "100 Authors against Einstein."[AH1] ad hominem attacks[AH2-3][HIN] "If you cannot dispute a fact-based message, attack the messenger himself."[HIN] Science has a well-established way of dealing with plagiarism (which may be unintentional[PLAG1][CONN21] or not[FAKE2]) award can ever change that.[HIN] and their co-workers have contributed useful improvements of deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] whom they did not cite, in contrast to ACM's Code of Ethics and Professional Conduct[ACM18] II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:[DLC][HIN] to self-correction,"[SV20] as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video[VID2] Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article[NYT3] Google's original 2016 paper on Google Translate[WU] mentions LSTM over 50 times (see Sec. B). In ad hominem style,[AH2-3] claiming credit he doesn't deserve for many, many things",[NYT1] without LeCun also called the GANs of Bengio's team[GAN1] GANs are variations of my work in 1990.[AC90,90b][AC20][R2] According to Bloomberg,[AV2] Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."[FAKE] LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2] Our LSTM paper[LSTM1] has got more citations than any paper by Bengio or LeCun,[R5] Hinton's most cited paper (2012) is the one on GPU-based CNNs.[GPUCNN4][R5] It follows our earlier work on supervised deep NNs (2010)[MLP1] unsupervised pre-training for deep NNs by myself [UN][UN0-3] and later championed by Hinton;[UN4][VID1] see Sec. D). Hinton (2012)[GPUCNN4] characterizes our deep and fast DanNet (2011)[GPUCNN1-3] as AlexNet won one;[R6] see Sec. D, XIV. The highly cited VGG network (2014)[GPUCNN9] Hinton's 2nd most cited paper[RUM][R5] of Hinton's paper,[RUM] adding citations for a book by Rumelhart & McClelland[R5]). Backpropagation is a previously invented method[BP1] whose origins of Ivakhnenko whom he has never cited;[DEEP1-2][R7-R8] see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),[GAN1] which are instances of my artificial curiosity (1990)[AC90,90b][AC20][R2] which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)[UN4] by ours[UN0-2][UN] were preceded by Hanson's[Drop1-3] As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,[DL3a] again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,[MOST] of deep learning MLPs since 1965[DEEP1-2][GD1-2a] (see Sec. II, XX) and backpropagation (1960-70)[BPA][BP1] (see Sec. XIX, XII) and convolutional NNs since 1979[CNN1-4] (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.[DEC] Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.

      As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] depth that really learned.[DEEP1-2][R8] Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.[GD1-2a] A few years later, modern backpropagation unintentional[PLAG1][CONN21] or intentional.[FAKE2]

      Yes, this critique is also an implicit critique of certain other awards to LBH.[HIN] reddit.com/r/MachineLearning[R1-R12] (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.[MIR]

      Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:[LECP] "... else cites."[LECP] weights and an adaptive output layer.[R62] So Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs)[ELM1] revisionist narrative of ELMs[ELM2][CONN21] self-proclaimed "deep learning conspiracy"[DLC1-2]

      Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,[HIN] as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1]

      Creative Commons LicenseThanks publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network.[AMH3] The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.[AMH1] [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing PDF. Spatial Averaging.[CNN1] Spatial Averaging.[CNN1] PDF. PDF. PDF. Since November 2021: Comments on version 1 of the present report[T21v1] in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.[UN-UN2] More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to 1991.[UN1-2][UN] II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. In fact, the ELM concept goes back to Rosenblatt's work around 1960.[R62] used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J.  Schmidhuber (AI Blog, 26 March 2021). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections linear Transformers or Performers[TR5-6] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. PDF. Probably the first paper on using stochastic gradient descent[STO51-52] reverse mode of automatic differentiation or backpropagation[BP1]). OCR-based PDF scan of pages 94-135 (see pages 119-120). Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.[GD1] (S. Amari, personal communication, 2021.) Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas[AC][AC90, AC90b][AC20] of GANs. Description of GANs that does not cite the original work of 1990[AC][AC90,AC90b][AC20][R2] (also containing wrong claims about Predictability Minimization[PM0-2][AC20]). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.[HW3] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs[MLP1] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. Link. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).[AC90] Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021. Link. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed[MLP1-2] unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

      Deep Learning: Our Miraculous Year 1990-1991 Menu directory status & updates copyrights Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning AI Blog
      @SchmidhuberAI This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021). 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] (see Executive Summary I, V, II, XII, XIX, XXI, XIII, XIV, XX, XVII). (A) speech recognition, (B) natural language processing, (C) robotics, (D) computer vision, (VII) medicine, astronomy, materials science. A, B, C, D, VII, XVII, VI, XVI). II, V, XX, XVIII) with Dr. Bengio & Dr. Hinton (see Sec. XVII, I). I respond to LBH's recent ACM article (July 2021). expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). Abstract & Outline (~300 words), Introduction (~300 words), Critique of LBH's ACM article (Turing Lecture) of July 2021[DL3a] Executive summary of what's wrong with ACM's laudation (~1,000 words), 21 comments on 21 claims by ACM (~8,000 words), Conclusion and Acknowledgments (~2,000 words). All backed up by over 250 references (~9,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. science is self-correcting."[SV20] they are mine or other people's.[DL1-2][HIN][NASC1-9] The present page is offered as a resource for all good computer scientists who share this inclination. and to fight plagiarism, collusion rings,[LIT21] and systemic academic corruption in all of their more and less subtle forms.[FAKE] Sec. 2 LBH's 2021 ACM article[DL3a] which necessitated an extension of the first version of this post.[T20a][R12] ACM's official justification[T19] of the 2018 A.M. Turing Award[R1] After the Executive Summary in Sec. 3, Sec. 4 will split ACM's full text[T19] into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI. Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.[DL3a] LBH's well-known earlier omissions.[DLC][HIN][T20a] LBH claim to "briefly describe the origins of deep learning"[DL3a] without even mentioning the world's first working deep learning nets by Ivakhnenko and Lapa in 1965[DEEP1-2][R8] (see Sec. II). this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, the first really deep feedforward NN.[HW1-3] (see Sec. D, VI). were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs;[LSTM0-17] later our Highway Nets[HW1-3] brought it to feedforward NNs. from 2007[LSTM4,14] based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule[Drop1-2] (see Sec. XIV). von der Malsburg who introduced ReLUs in 1973[CMB] (see Sec. XIV). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). XVIII). already in 1965[DEEP1-2][R8] (see Sec. II). earlier fast weights of von der Malsburg (1981) and Feldman (1982).[FAST,FASTa-b][FWP] described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). dedicate an extra section to attention-based Transformers,[TR1-6] citing Bengio's team (2014) for "soft attention"[ATT14] without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers[FWP,FWP0-2,6][ATT] (see Sec. XVII-1, XVI). LBH claim that Bengio's team[NPM] of text compression[SNT] (see Sec. XVI, XVII-1). LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)[GAN0-1] without mentioning that GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). In summation, LBH have repeatedly chosen to ignore the previous well-known critiques[DLC][HIN][T20a] and deep learning surveys,[DL1-2] and deep learning (e.g., Sec. I), ACM lauds Numerous references can be found under the relevant section links I-XXI which adhere to the sequential order of ACM's text[T19] Sec. II: it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. Sec. I contains 4 subsections A, B, C, D A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were Hinton (2012) and Bengio (XV) our revolutionary CTC-LSTM which was soon on most smartphones. Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI): (soon used for several billions of was also based on our LSTM. Sec. C: Robotics. most visible breakthroughs Sec. D: Computer Vision XVIII & XIV & XI & VI) and applied to speech. All before LeCun's CNN work (XVIII). deep NNs pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. Sec. XIV: deep & fast CNN (where LeCun participated), Sec. XI: ACM mentions GPU-accelerated NNs deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first XVIII: Fukushima and Waibel (see Sec. D). VII: ACM explicitly mentions medicine and first to win medical imaging competitions Sec. XII & XIX & XXI: Modern backpropagation XIII & II & V III & IX & X & XX): Sec. XX: ACM credits LeCun for work on Sec. XXI: ACM credits LeCun for work on XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences. CTC-LSTM A & B). XVI: ACM We started this in 1990-93 long before LBH Sec. XVII: Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using and other topics.[R2-R6] Sec. IV is on Turing (1936) and his predecessors Critique of LBH's ACM article (Turing Lecture) of July 2021. Sec. Conclusion: In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications) Sec. II & III & V & XII & XIII & XVII & XIV & XIX & XX & XXI. In what follows, ACM's full text [T19] is split into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI.

      Critique of 2018 Turing Award LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1-2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),[GPUNN][DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991).[FWP0-2,6] [DL1-2][R2-R8] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] This may explain some of ACM's misattributions.[T19] II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,[DEC] they were academia and industry,[DL4] mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)[LSTM0-6] vanishing gradient problem student Sepp Hochreiter in 1991.[VAN1] This happened long before the similar work of Bengio (see Sec. XVII).[MIR] (Sec. 3,Sec. 4) LSTM was refined with my student Felix Gers[LSTM2] through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).[CTC] Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). Markov models (HMMs)[BW][BRI][BOU] (Sec. XV). Hinton et al. (2012) still used the old hybrid approach[HYB12] and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer[LSTM4][LSTM14] as a postdoc in Hinton's lab.[LSTM8] CTC-LSTM dramatically improved Google's speech recognition.[GSR][GSR15][DL4] on-device speech recognition[GSR19] (not any longer on the server) LSTM[MIR](Sec. 4) (see Sec. VI & XI & XV). of text[SNT] (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] See also Sec. VI & XI & XV. tailored by Bengio's team.[ATT14][FWP] However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.[LSTM-RL][RPG][LSTMPG] In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,[AM16] predicting what's going on in nuclear fusion reactors, and so on.[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).[JOU17] Apparently the first LSTM journal paper[LSTM1][R5] is now the most frequently cited D. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4] The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979).[CNN1] The popular downsampling variant called max-pooling was introduced by Weng et al. (1993).[CNN3] In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. LeCun's team later contributed improvements of CNNs, especially for images[CNN2,4] (see Sec. XVIII). Finally, my own team showed in 2010[MLP1] unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton[VID1] who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet[GPUCNN3] of Hinton's student Krizhevsky won the ImageNet[IM09] 2012 contest[GPUCNN4-5][R6] (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the work of 2011.[MIR](Sec. 19) ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) which currently gets more citations per year[MOST] Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.

      Critique of 2018 Turing Award appeared long before the 1980s. were proposed already in the 1940s/50s[MC43][K56] (but don't forget prior work in physics since the 1920s[L20][I25][K41][W45]). deep convolutional NN architecture was proposed in the 1970s.[CNN1] NNs without hidden layers learned in 1958[R58] regression and the method of least squares[DL1-2]). about deeper adaptive NNs[R61,R62] layers (already containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born. Ivakhnenko did not call it an NN, but that's what it was.[MIR](Sec. 1)[R8] LBH failed to cite this. XIII & III & V & VIII & IX & X. LBH & co-authors, e.g., Sejnowski[S20] (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, as mentioned above, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method.[DEEP1-2][DL2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) (but see a 1989 paper[MOZ]). However, it became really deep in 1991 in my lab,[UN-UN3] which has See Sec. 1 of the overview:[MIR] First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.[UN2][DL1][UN] (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000[LSTM17] more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) III. Note that LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

      Critique of 2018 Turing Award by others (Sec. III).[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4] deep learning multilayer perceptrons (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), GPU-accelerated NNs (2004),[GPUNN][GPUCNN5] and other foundations.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",[DL7] referring to Hinton's[UN4] and Bengio's[UN5] we had this type of deep learning already in 1991;[UN][UN1-2] see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs[DEEP1-2] which Hinton,[UN4] Bengio,[UN5] and LBH[DL3,DL3a] did not cite either. See Sec. X.

      Critique of 2018 Turing Award my comments systematically track the sequential order of ACM's claims.[T19]

      ACM's statement on Turing is greatly misleading, like some of its other statements.[T19] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a] Much of early AI in the 1940s-70s was actually about theorem proving[ZU48][NS56]

      In 1936, Turing Turing Machine.[TUR] He rederived the above-mentioned result,[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing,[POS] my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.[HIN]) open problem "P=NP?" in his famous letter to John von Neumann (1956).[GOD56][URQ10] Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936[ZU36-38][Z36][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Zuse also created the first high-level programming language in the early 1940s.[BAU][KNU] conditional jump instruction.[RO98]

      Critique of 2018 Turing Award that learn internal representations (1965),[DEEP1-2][R8] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1943-56)[MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] (2004),[GPUNN][GPUCNN5] record-breaking deep supervised NNs (2010)[MLP1-2] and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991),[FWP0-2,6] and more.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11] II & I & III & XIII & X & XVII & XII & XVIII & XX.

      Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011[MLP1-2][DAN][DAN1][GPUCNN5][R6] and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).[GPUCNN5,8] and were able to greatly improve steel defect detection.[ST] All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky won ImageNet 2012.[GPUCNN5][R6] mitosis detection.[MGC][GPUCNN5,8] approach of D & XI).

      Critique of 2018 Turing Award without citing them.[DL1][DLC][HIN][R2-R4][R7-R8] V & XII & XIX & II & III & XIII & XVII & X & I.

      Critique of 2018 Turing Award who failed to cite them, even in later work.[HIN][DLC][DL1-2][DEEP1-2][CMB][R7-R8] See Sec. II & III & XIII & V & X & XIV & I.

      Critique of 2018 Turing Award first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL[DL6,6a] was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II) II & III & XIII & V & I.

      Critique of 2018 Turing Award ignored by LBH's papers[HIN][R7-R8][R2-R5] (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).

      ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),[GPUNN][GPUCNN5] made GPU-based NNs fast and deep enough an important benchmark record,[MLP1-2] unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.[VID1] our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5] vision (explicitly mentioned by ACM) for the first time[R6] (see Sec. D).

      Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.[LSTM1-4][CTC] In particular, as mentioned in Sec. A, such as HMMs.[BW][BOU][BRI][HYB12] As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.

      Critique of 2018 Turing Award ACM's statement is "less wrong" than Honda's[HIN](Sec. I) but still (and apparently even other award committees[HIN](Sec. I) backpropagation by Rumelhart et al. (1985-86)[RUM] (1982).[BP2] And the article[RUM] even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),[BP1] Kelley already had a precursor thereof in the field of control theory;[BPA] see also later work of the early 1960s.[BPB][BPC][R7] internal representations in hidden layers of NNs.[RUM] But this was essentially just an experimental analysis of a known method.[BP1-2] And history of backpropagation can be found at Scholarpedia[DL2] and in my award-winning survey.[DL1] Also see Sec. XIX, II.

      Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.[BP1] recent debate:[HIN] It is true that in 2018, Hinton[AOI] Rumelhart[RUM] with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.[HIN] Neither in a popular book[AOI] nor in other recent work[DL3,DL3a] did he cite Linnainmaa (1970),[BP1] the true creator.[BP4-5] that his 2015 survey[DL3] does cite Werbos (1974) who however described the method correctly only later in 1982[BP2] and also failed to cite Linnainmaa[BP1] (compare Amari's work of 1977[BP6]). Linnainmaa's method was well-known.[BP5][DL1-2][DLC] It wasn't created by "lots of different people" as Hinton suggested[AOI][HIN][R11] one person who published first[BP1] and therefore should get the credit.

      Critique of 2018 Turing Award Boltzmann Machine (BM)[BM] a learning.[HIN] Recently, however, I learnt through a reader that even the BM paper[BM] did not cite prior relevant work by Sherrington & Kirkpatrick[SK75] and Glauber.[G63] (Compare related work.[H86][H88][S93]) multilayer perceptrons with arbitrarily many layers.[DEEP1-2][HIN] Sec. II V & X.[MIR](Sec. 1)[R8]

      As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert[M69] at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800[DL1-2]) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.[DEEP2][BP6][CNN1][DL1-2]

      Critique of 2018 Turing Award Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).[Drop1-2] Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,[DAN1] Sec. D above. Back then, the only really of deep CNNs through GPUs.[GPUCNN1,3,5][R6] Already before ImageNet 2012,[R6] fast deep CNN called DanNet a monopoly on winning computer vision competitions.[GPUCNN5] It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011[GPUCNN2][DAN,DAN1][R6] long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award since the late 1980s.[BW][BRI][BOU] LSTM (1990s-2005)[LSTM0-6] and CTC[CTC] (2006), which were applied to speech in 2007.[LSTM4][LSTM14] CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.[BW][BRI][BOU][HYB12] See also Sec. A.

      Critique of 2018 Turing Award 5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.[SNT] Bengio[NPM] characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures[PO87][PO90]). In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. Bengio's team[ATT14] has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). attention-based Transformers[TR1-6] are FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. My FWP of 1991[FWP0-1] (now often called keys and values for self-attention).[TR1-6][FWP] the 2010s,[DEC] Transformers[TR1-2] a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly[LSTM13,17] linear Transformers or Performers[TR5-6] which are formally equivalent to my 1991 FWPs (apart from normalization).[FWP6][FWP] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves.

      See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper[ATT2] his own work:[ATT3]

      Critique of 2018 Turing Award GANs[GAN0-1] (2010-2014) are actually a simple application[AC] of the adversarial curiosity (AC) principle from 1990[AC90,90b][AC20] (see also surveys[AC09-10]). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis[GAN1] (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,[GAN1] a well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.[AC20][AC] early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20]) Bengio et al. neither cited the original work[AC90,90b][AC20] nor corrected their erroneous claims[GAN1] about the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,[AV1] their NIPS 2014 paper[GAN1] and some of the erroneous claims it made about my prior work.[AC20] Goodfellow eventually admitted that PM is adversarial (his paper[GAN1] still claims the opposite), but emphasized that it's not generative. However, the even earlier AC[AC90,90b][AC10][AC20] is both adversarial and generative (its generator contains probabilistic units[AC90] like in StyleGANs[GAN2]). When the authors[GAN1] I published one myself in the hopes of correcting the annals of history.[AC20] that they are instances of my earlier work.[R2][AC20] vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. was settled in favor of Sepp.[VAN1] However, even after a common publication,[VAN3] Bengio published papers[VAN4][XAV] are poor indicators of truly pioneering work.[NAT1] (Margin note: Bengio states[YB20] that in 2018 he one must at least clarify it later,[DLC] Bengio also claims[YB20] that in 1995 my publications on exactly this topic date back to 1991-93.[UN0-2][UN] which I started in 1987[META1][META] long before Bengio that he did it before me.[R3] Bengio also writes[YB20] that in Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"[LSTMGRU] for a variant of our vanilla LSTM architecture[LSTM2] (2000) which he did not cite although our work[LSTM2] was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) Hinton's paper[UN4] (2006) appeared long after my earlier work on this[UN0-2] the first NNs shown to solve very deep problems (see Sec. II above).[UN] It was published in 1991-92[UN1] when compute was about 1000 times more expensive than in 2006. survey (2015),[DL3][DLC] See also Sec. II & III. compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) Hinton[DIST2] (2006) did not cite my much earlier original work on this (1991),[UN1][UN] not even in his later patent application fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) Hinton[ATT3] (2010) our much earlier work on this[ATT1][ATT] although he was both reviewer and editor of my summary[ATT2] (1990; see Sec. XVI above).

      The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.[R4] Remarkably, three of them are related to the 1991 paper[UN1][UN] which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.[MIR] See Sec. I for additional related issues of credit assignment.

      Critique of 2018 Turing Award LeCun's team has made important contributions to CNNs since 1989.[CNN2,4] However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.[MIR] at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance three times worse performance).[DAN1] Again see Sec. D. at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)[GPUCNN5,7,8] (before the similar AlexNet won ImageNet 2012[GPUCNN5][R6] and the similar VGG network[GPUCNN9] won ImageNet 2014). mitosis detection.[MGC][GPUCNN5,7,8] Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,[AM16][DL1] All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See Sec. A, B, VI, XI.

      Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari's work of 1977[BP6]). recent work.[DL3,DL3a][DLC] In 1960, Kelley already had a precursor of the algorithm.[BPA] Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"[DL1] (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.[DL2][BP4]

      Critique of 2018 Turing Award However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)[DEEP1-2] (and also Fukushima[CNN1][DL2]) had long before LeCun. See Sec. D & II & XIII & V.

      Critique of 2018 Turing Award LeCun et al. neither cited the origins[BP1] (1970) of this widely used type of automatic differentiation for differentiable networks of modules[DL2][BP4-5][DLC] for such systems.[S80] See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work.[PO87-90]

      (Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).[UN2] For example, our adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) see "100 Authors against Einstein."[AH1] ad hominem attacks[AH2-3][HIN] "If you cannot dispute a fact-based message, attack the messenger himself."[HIN] award can ever change that.[HIN] and their co-workers have contributed useful improvements of deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] whom they did not cite II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:[DLC][HIN] to self-correction,"[SV20] as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video[VID2] Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article[NYT3] Google's original 2016 paper on Google Translate[WU] mentions LSTM over 50 times (see Sec. B). In ad hominem style,[AH2-3] claiming credit he doesn't deserve for many, many things",[NYT1] without LeCun also called the GANs of Bengio's team[GAN1] GANs are variations of my work in 1990.[AC90,90b][AC20][R2] According to Bloomberg,[AV2] Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."[FAKE] LBH, who called themselves the deep learning conspiracy,[DLC] Our LSTM paper[LSTM1] has got more citations than any paper by Bengio or LeCun,[R5] Hinton's most cited paper (2012) is the one on GPU-based CNNs.[GPUCNN4][R5] It follows our earlier work on supervised deep NNs (2010)[MLP1] unsupervised pre-training for deep NNs by myself [UN][UN0-3] and later championed by Hinton;[UN4][VID1] see Sec. D). Hinton (2012)[GPUCNN4] characterizes our deep and fast DanNet (2011)[GPUCNN1-3] as AlexNet won one;[R6] see Sec. D, XIV. The highly cited VGG network (2014)[GPUCNN9] Hinton's 2nd most cited paper[RUM][R5] of Hinton's paper,[RUM] adding citations for a book by Rumelhart & McClelland[R5]). Backpropagation is a previously invented method[BP1] whose origins of Ivakhnenko whom he has never cited;[DEEP1-2][R7-R8] see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),[GAN1] instances of my artificial curiosity (1990)[AC90,90b][AC20][R2] which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)[UN4] by ours[UN0-2][UN] were preceded by Hanson's[Drop1-2] As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,[DL3a] again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,[MOST] of deep learning MLPs since 1965[DEEP1-2] (see Sec. II, XX) and backpropagation (1960-70)[BPA][BP1] (see Sec. XIX, XII) and convolutional NNs since 1979[CNN1-4] (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.[DEC] Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.

      As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] depth that really learned.[DEEP1-2][R8] Five years later, modern backpropagation

      Yes, this critique is also an implicit critique of certain other awards to LBH.[HIN] reddit.com/r/MachineLearning[R1-R12] (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.[MIR]

      Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:[LECP] "... else cites."[LECP]

      Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,[HIN] as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2]

      Creative Commons LicenseThanks to many expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing PDF. Spatial Averaging.[CNN1] PDF. PDF. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.[UN-UN2] More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to 1991.[UN1-2][UN] II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J.  Schmidhuber (AI Blog, 26 March 2021). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections linear Transformers or Performers[TR5-6] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. PDF. PDF. HTML. Pictures (German). PDF. Preprint: arXiv:1811.12143. PDF. PDF. Like [FWP0-2]. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing the basic ideas[AC][AC90, AC90b][AC20] of GANs. Description of GANs that does not cite the original work of 1990[AC][AC90,AC90b][AC20][R2] (also containing wrong claims about Predictability Minimization[PM0-2][AC20]). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. first deep learner to win a medical imaging contest (2012). HTML. [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. PDF. North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.[HW3] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. Preprint arXiv:1704.04760 PDF. PDF. arXiv:1607.06450, 2016. A New Publishing Model in Computer Science. Local copy (HTML only). [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs[MLP1] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both citing our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] Preprint arXiv:1611.01578 (PDF), 2017. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. Link. NY Times article NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. PDF. HTML. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle Based on TR FKI-126-90 (1990).[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).[AC90] Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. arXiv:1112.5309 [cs.AI] First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. This experimental analysis of backpropagation did not cite the origin of the method,[BP1-4] also known as the reverse mode of automatic differentiation. Link. The Past, Present and Future of Artificial Intelligence. PDF. PDF. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. The first version of the present critique. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. But in 2010, our team showed[MLP1-2] unsupervised pre-training is not necessary Youtube video, 2018. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). a general, practical, program-controlled computer. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.

      Deep Learning: Our Miraculous Year 1990-1991 Menu directory status & updates copyrights 1991: Neural nets learn to program neural nets with fast weights (1991) - by Juergen Schmidhuber AI Blog
      Twitter: @SchmidhuberAI Traditionally this is done with recurrent NNs (RNNs) published.[FWP0-1] the fast weights of another NN (see Sec. 1). In 1991, one of them[FWP0-1] (now often called keys and values for self-attention; Sec. 2). The very similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] to the 1991 Fast Weight Programmers[MOST] (see this tweet). In 1993, I also introduced the attention terminology[FWP2] now used in this context[ATT] (Sec. 4), and RNNs that program themselves (Sec. 3). famous vanishing gradient problem aka deep learning problem (analyzed a few months later in 1991[VAN1]) through additive fast weight changes (Sec. 5). additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5) Annus Mirabilis of deep learning.[MIR] brand new, improved version[FWP6] of the 1991 fast weight update rule (Sec. 6). reinforcement learning through neuroevolution[FWP5] (2005-, Sec. 7), goal-conditioned policy generators (2022),[GGP] metalearning machines that learn to learn[FWPMETA1-9] (1992-2022, Sec. 8).

      Goedel Machine As I have frequently emphasized since 1990,[AC90][PLAN][META] artificial neural network (NN) universal self-referential formal systems,[GOD][GOD34] I built NNs whose outputs are changes of programs or weight matrices of other NNs[FWP0-2] (Sec. 1, 2, 3), their own weight change algorithms or learning algorithms[FWPMETA1-5] (Sec. 8). gradient descent procedure[BP1-4][BPA][R7]) can compute a direction in program space where one may find a better program,[AC90] better program-modifying program.[FWP0-2][FWPMETA1-5] started in 1965 layers.[DEEP1-2] Their activation functions were Kolmogorov-Gabor polynomials which include the now popular multiplicative gates,[DL1-2] fast weights. von der Malsburg was the first to explicitly emphasize the importance of NNs with rapidly changing weights.[FAST] The second paper on this was published by Feldman in 1982.[FASTa] The weights of a 1987 NN were sums of weights with a large learning rate and weights with a small rate[FASTb][T22] (but have nothing to do with the NN-programming NNs discussed below). Fast Weight Programmers (FWPs) were published in 1991-93[FWP0-2] (Sec. 1, 2, 3, 4). attention[ATT] (Sec. 4) and Transformers[TR1-6] (Sec. 2, 3, 4, 5).

      End-To-End Differentiable Fast Weights: NNs Learn to Program NNs (1991) on 26 March 1991, slow NN that learns by backpropagation[BP1-4] to rapidly modify the fast weights of another NN,[FWP0] essentially published in Neural Computation.[FWP1] attention[ATT] (Sec. 4) That is, I separated storage and control like in traditional computers, but in a fully neural way (rather than in a hybrid fashion[PDA1][PDA2][DNC]). Synthetic Gradients.[NAN1-5] recurrent NNs (RNNs) One of the FWPs of 1991[FWP0-1] is illustrated in the figure. There is A disadvantage addressed in Sec. 2 is that the slow net needs many output units if the fast net is large.

      Slow neural net programs fast neural net through additive outer products

      The Fast Weight Programmer[FWP0-1] depicted in Sec. 1 has a slow net unit for each fast weight. However, Section 2 of the same 1991 paper[FWP0] linear[TR5-6] Transformers[TR1-2] or attention[ATT] (compare Sec. 4). to the fast weight (which then may be normalized by a squashing function[FWP0]). The second order tensor products.[FWP0-3a] linear Transformers).[FWP6][TR5-6] The highly successful Transformers of 2017[TR1-2] can be viewed as a combination of my additive outer product fast weight principle[FWP0-2] NN-programmed fast weights (Sec. 5 & 1). linear Transformers (2020-21)[TR5-6] abandoned the softmax, essentially resurrecting the original 1991 system.[FWP0-1] Compare Sec. 6. go back at least to Hebb's informal rule (1949)[HE49] and Steinbuch's Learning Matrix around 1960.[ST61-63][AMH1-2][KOH72][LIT74][PAL80][KOS88] since 1991.[FWP0-3a][TR5-6] I offered the FWPs of 1991[FWP0-1] as an sequence-processing recurrent NNs (RNNs) (Sec. 1), the computationally most powerful NNs of them all.[UN][MIR](Sec. 0) Modern Transformers are also viewed as RNN alternatives, despite their limitations.[TR3-4] The slow net and the fast net of the 1991 system[FWP0-1] in Sec. 2 were feedforward NNs (FNNs), like most current Transformers.[TR1-6] I collapsed all of this into a single RNN that could rapidly reprogram all of its own fast weights through additive outer product-based weight changes.[FWP2] One motivation reflected by the title of the paper[FWP2] of the same size: O(H2) instead of O(H), where H is the number of hidden units. This motivation and a variant of the method was republished over two decades later.[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) See also our more recent work on FWPs since 2017,[FWP3-3a][FWPMETA7][FWP6] and compare a recent study.[RA21] 4. Attention terminology of 1993 End-to-End Differentiable Sequential Neural Attention 1990-93. Juergen Schmidhuber Today, everybody is talking about attention when it comes to describing the principles of Transformers.[TR1-2] The additive outer products[FWP0-1] of the Fast Weight Programmers described in Sec. 2 and Sec. 3 Similarly, the attention weights or self-attention weights (see also[FWP4b-d]) NN-programmed fast weights (Sec. 5).[FWP0-1], Sec. 9 & Sec. 8 of [MIR], Sec. XVII of [T22] 1993 paper[FWP2] which internal spotlights of attention Fast Weight Programmers.[FWP2][ATT] Apart from possible normalization/squashing,[FWP0] are additive (Sec. 1 & 2). do not suffer during sequence learning from the famous vanishing gradient by my brilliant student Sepp Hochreiter a few months later in his 1991 diploma thesis.[VAN1]

      LSTM and both of them dating back to 1991, our miraculous year of deep learning.[MIR] Basic Long Short-Term Memory[LSTM1] solves the problem by adding at every time step That is, the core of LSTM is operating in a linear additive activation space (ignoring LSTM's multiplicative gates).[LSTM1][VAN1][MIR](Sec. 4 & Sec. 8) Additive FWPs[FWP0-2] (Sec. 1 & 2), however, solve the problem through a dual approach, By favoring additive operations yielding non-vanishing first derivatives and error flow,[VAN1] Transformers[TR1-6] also follow the additive approach.[FWP0-2] (compare Sec. 2 and Sec. 4 on attention terminology since 1993).

      Highway Networks:
LSTM's traditional additive activation-based  approach<sup><small><small><a href=[LSTM1-13] is mirrored in the LSTM-inspired Highway Network (May 2015),[HW1][HW1a][HW3] the first working really deep It is essentially a feedforward version of LSTM[LSTM1] with forget gates.[LSTM2] Residual Net or ResNet[HW2] (Dec 2015). Remarkably, both of these dual approaches of 1991 have become successful. the mid 2010s,[DEC] major IT companies overwhelmingly used smartphones.[DL4] rapidly learn to solve quickly[LSTM13] while plain Transformers can't yet.[TR4] unsupervised pre-training of deep NNs.[UN0-UN2][MIR](Sec. 1) dates back to 1991[UN] Recent work of February 2021[FWP6] mechanisms[TR5-6] and Fast Weight Programmers[FWP0-2] (Sec. 2).[FWP4a][R4][MIR](Sec. 8)[T22](Sec. XVII, item H3) variants.[TR5-6] Building on previous work[FWPMETA7] on FWPs (Sec. 1, 2, 3, 8), we replace the 1991 elementary programming instruction based on additive outer products[FWP0-2] by a delta rule-like[WID] language modeling tasks.[FWP6] Our code is public. work of June 2021[FWP7] (also with Robert Csordas) points out that the original FWP formulation of 1991[FWP0-1] is more general than the one of linear Transformers: a slow NN continually reprograms the weights of a fast NN with Our code is public.

      Reinforcement learning robotino double pole balancer with neuroevolution for fast weights as shown in 2005 with my former postdoc Faustino Gomez[FWP5] (now CEO of NNAISENSE) Our 2005 paper on deep RL[DL6,6a] was actually the first machine learning numerous weights of large NNs through very compact codes.[KO0-2][CO1-4] Here we exploited that the Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small. Compressed Network Search[CO2] unsupervised pre-training.

      Recent work of 2022[GGP] with

      self-referential weight matrix My first work on metalearning machines that learn to learn was published in 1987.[META][R3] metalearning in a very general way. In references[FWPMETA1-5] since 1992, the slow NN and the fast NN (Sec. 1) are recurrent and identical. The RNN can see its own errors or reward signals called eval(t+1) in the image.[FWPMETA5]

      The 1993 FWP of Sec. 3[FWP2] also was an RNN RNN above,[FWPMETA1-5] it used outer products between key patterns and value patterns (Sec. 2) to manipulate used gradient descent in LSTM networks[LSTM1] instead of traditional functions of two variables[HO1] (more on LSTM and fast weights in Sec. 5). In 2020, Imanol et al. augmented an LSTM with an associative fast weight memory.[FWPMETA7] partially observable environments.[FWPMETA7] Our recent MetaGenRL (2020)[METARL10] meta-learns See the blog post of my PhD student Louis Kirsch. outer-product-like fast weights encoded in the activations of LSTMs.[FWPMETA6] variables[FWP2] (Sec. 3). VS-ML can also learn to implement the backpropagation learning algorithm[BP1-4] purely in the end-to-end differentiable forward dynamics of RNNs.[FWPMETA6]

      In 2022, we also published at ICML a modern self-referential weight matrix (SWRM)[FWPMETA8] based on the 1992 SWRM.[FWPMETA1-5] self-improvement (compare this tweet). A modern self-referential weight matrix (2022) based on the one of 1992 There is another version of this article This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Creative Commons License J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on long-term planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network. The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.[AMH1] [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] PDF. PDF. PDF. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The More. Deep Learning. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? neural networks learning to control dynamic external memories.[PDA1-2][FWP0-1] J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] In 1993, he introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. See tweet of 2022. PDF. "Transformer with linearized self-attention."[FWP] PDF. HTML. Pictures (German). See tweet of 2022 for 30-year anniversary. PDF. Preprint: arXiv:1811.12143. PDF. PDF. Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Preprint: arXiv:2202.05780. Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022). Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. PDF. PDF. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. More. PDF. PDF. J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, 2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. Preprint arXiv:1608.05343, 2016. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732. the 1991 publication on what's now called "Transformers with linearized self-attention."[FWP0-6][TR5-6] attention terminology in 1993.[ATT][FWP2][R4] See tweet of 2022 for 30-year anniversary. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. PDF. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 15 June 1991 (advisor J. Schmidhuber). PDF. Metalearning or Learning to Learn Since 1987. Juergen Schmidhuber. Transformers with linearized self-attention in Neural Computation 1992, equivalent to fast weight programmers (apart from normalization), separating storage and control. Key/value was called FROM/TO. The attention terminology was introduced at ICANN 1993. Juergen Schmidhuber. Menu directory status & updates copyrights
      https://people.idsia.ch/~juergen/deep-learning-history.html AI Blog
      @SchmidhuberAI
      arXiv:2212.11279 is dominated by artificial neural networks (NNs) and deep learning,[DL1-4] hyperlinks to relevant overview sites from my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning, and supplements my previous deep learning survey[DL1] mentioning my own team's work, because (as of 2022) the most cited NNs are based on it.[MOST] Sec. 1: Introduction
      Sec. 2: 1676: The Chain Rule For Backward Credit Assignment
      Sec. 3: Circa 1800: First Neural Net (NN) / Linear Regression / Shallow Learning
      Sec. 4: 1920-1925: First Recurrent NN (RNN) Architecture. ~1972: First Learning RNNs
      Sec. 5: 1958: Multilayer Feedforward NN (without Deep Learning)
      Sec. 6: 1965: First Deep Learning
      Sec. 7: 1967-68: Deep Learning by Stochastic Gradient Descent
      Sec. 8: 1970: Backpropagation. 1982: For NNs. 1960: Precursor.
      Sec. 9: 1979: First Deep Convolutional NN (1969: Rectified Linear Units)
      Sec. 10: 1980s-90s: Graph NNs / Stochastic Delta Rule (Dropout) / More RNNs / Etc
      Sec. 11: Feb 1990: Generative Adversarial Networks / Artificial Curiosity / NN Online Planners
      Sec. 12: April 1990: NNs Learn to Generate Subgoals / Work on Command
      Sec. 13: March 1991: NNs Learn to Program NNs. Transformers with Linearized Self-Attention
      Sec. 14: April 1991: Deep Learning by Self-Supervised Pre-Training. Distilling NNs
      Sec. 15: June 1991: Fundamental Deep Learning Problem: Vanishing/Exploding Gradients
      Sec. 16: June 1991: Roots of Long Short-Term Memory / Highway Nets / ResNets
      Sec. 17: 1980s-: NNs for Learning to Act Without a Teacher
      Sec. 18: It's the Hardware, Stupid!
      Sec. 19: But Don't Neglect the Theory of AI (Since 1931) and Computer Science
      Sec. 20: The Broader Historic Context from Big Bang to Far Future
      Sec. 21: Acknowledgments
      Sec. 22: 555+ Partially Annotated References (many more in the award-winning survey[DL1])
      quite erroneous ideas about the origins of the universe (see the final section

      A history of AI written in the 1980s would have emphasized topics such as theorem proving,[GOD][GOD34][ZU48][NS56] logic programming, expert systems, and heuristic search.[FEI63,83][LEN83] an old area of research seeing renewed interest. Practical AI dates back at least to 1914, when Leonardo Torres y Quevedo (see below) built the first working chess end game player[BRU1-4] any type of computation-based AI.[GOD][BIB3][GOD21,a,b] emphasis on topics such as support vector machines and kernel methods,[SVM1-4] Bayesian (actually Laplacian or possibly Saundersonian[STI83-85]) reasoning[BAY1-8][FI22] and other concepts of probability theory and statistics,[MM1-5][NIL98][RUS95] decision trees,e.g.,[MIT97] ensemble methods,[ENS1-4] swarm intelligence,[SW1] and evolutionary computation.[EVO1-7]([TUR1],unpublished) Why? Because back then such techniques drove many successful AI applications.

      A history of AI written in the 2020s must emphasize concepts such as the even older chain rule[LEI07] and deep nonlinear artificial neural networks (NNs) trained by gradient descent,[GD'] in particular, feedback-based recurrent networks, which are general computers whose programs are weight matrices.[AC90] Why? Because many of the most famous and most commercial recent AI applications depend on them.[DL4] MACY conferences (1946-1953)[MACY51] and the 1951 Paris conference on calculating machines and human thought, now often viewed as the first conference on AI.[AI51][BRO21][BRU4] modern AI based on "deep learning" with NNs.[DL1-2][DEC] minimize pain, maximize pleasure, drive cars, etc.[MIR](Sec. 0)[DL1-4]

      The present piece also debunks a frequently repeated, misleading "history of deep learning"[S20][DL3,3a] which ignores most of the pioneering work mentioned below.[T22] See Footnote 6. The title image of the present article is a reaction to an erroneous piece of common knowledge which says[T19] that the use of NNs "as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s," although such NNs appeared long before the 1980s.[T22] on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] Finally,

      Leibniz, father of computer science circa 1670, publishes the chain rule in 1676

      In 1676, Gottfried Wilhelm Leibniz textbook on Leibniz' differential calculus.[LEI07-10][L84]

      Cauchy This answer is used by the technique of gradient descent (GD), apparently first proposed by Augustin-Louis Cauchy in 1847[GD'] (and much later by Jacques Hadamard[GD'']; the stochastic version called SGD is due to Herbert Robbins and Sutton Monro (1951)[STO51-52]).

      Footnote 1. In 1684, Leibniz was also the first to publish "modern" calculus;[L84][SON18][MAD05][LEI21,a,b] later Isaac Newton was also credited for his unpublished work.[SON18] Their priority dispute,[SON18] however, did not encompass the chain rule.[LEI07-10] Of course, both were building on earlier work: in the 2nd century B.C., Archimedes (perhaps the greatest scientist ever[ARC06]) paved the way for infinitesimals Sangamagrama and colleagues of the Indian Kerala school.[MAD86-05] "the world's first computer scientist"[LA14]) also laid foundations of modern computer science. He and the first with an internal memory.[BL16] He described the principles of binary computers (1679)[L79][L03][LA14][HO66][LEI21,a,b] His formal Algebra of Thought (1686)[L86][WI48] was deductively equivalent[LE18] to the much later Boolean Algebra (1847).[BOO] all possible questions through computation;[WI48]

      Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L'Hopital (1696).[CONN21] doing this).[T22] It was not published until 1970, as discussed below.[BP1,4,5]

      In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795

      In 1805, Adrien-Marie Legendre published what's now often called a linear neural network (NN). Later Johann Carl Friedrich Gauss was also credited for earlier unpublished work on this done circa 1795.[STI81]

      In 1795, Gauss used what's now called a linear neural net, but Legendre published this first in 1805. Gauss is often called the greatest mathematician since antiquity Rosenblatt's perceptron (1958)[R58] combined a linear NN as above with an output threshold function to obtain a pattern classifier (compare his more advanced work on multi-layer networks discussed below). Joseph[R61] Widrow & Hoff's similar Adaline learned in 1962.[WID62]

      In 1924, Ernst Ising published the first recurrent network architecture: the Ising Model or Lenz-Ising model analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.[L20][I24,I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first learning RNNs (see below). were also discussed in 1943 by neuroscientists Warren McCulloch und Walter Pitts[MC43] and formally analyzed in 1956 by Stephen Cole Kleene.[K56]

      In 1972, Shun-Ichi Amari made the Ising recurrent net adaptive. This was the first published learning artificial recurrent neural network

      In 1972, Shun-Ichi Amari made the Lenz-Ising recurrent architecture adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights.[AMH1] See also Stephen Grossberg's work on biological networks,[GRO69] David Marr's[MAR71] and Teuvo Kohonen's[KOH72] work, and Kaoru Nakano's learning RNN.[NAK72]

      Alan Turing

      10 years later, the Amari network was republished (and its storage capacity analyzed).[AMH2] Some called it the Hopfield Network (!) or Amari-Hopfield Network.[AMH3] sequence-processing generalization thereof.[AMH1] learning RNNs. This, however, was first published many decades later,[TUR1] which explains the obscurity of his thoughts here.[TUR21] (Margin note: it has been pointed out that the famous "Turing Test" should actually be called the "Descartes Test."[TUR3,a,b][TUR21])

      Today, the most popular RNN is the Long Short-Term Memory (LSTM) mentioned below, which has become the most cited NN of the 20th century.[MOST]

      In 1958, Frank Rosenblatt had  multilayer perceptrons whose last layer learned

      In 1958, Frank Rosenblatt not only combined linear NNs and threshold functions (see the section on shallow learning since 1800), he also had more interesting, deeper multilayer perceptrons (MLPs).[R58] because only the last layer learned,[DL1] Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.[ELM1-2][CONN21][T22]

      MLPs were also discussed in 1961 by Karl Steinbuch[ST61-95] and Roger David Joseph[R61] (1961). See also Oliver Selfridge's multilayer Pandemonium[SE59] (1959). wrote about "back-propagating errors" in an MLP with a hidden layer[R62] although he did not yet have a general deep learning algorithm for deep MLPs. What's now called backpropagation is quite different and was first published in 1970, as discussed below.[BP1-BP5][BPA-C]

      Today, the most popular FNN is a version of the LSTM-based Highway Net (mentioned below) called ResNet,[HW1-3] which has become the most cited NN of the 21st century.[MOST]

      In 1965, Alexey Ivakhnenko & Valentin Lapa introduced  the first working deep learning algorithm for deep MLPs with arbitrarily many hidden layers multiplicative gates).[DEEP1-2][DL1-2][FDL] A paper of 1971[DEEP2] highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] first introduced to Machine Learning much later by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] (Margin note: our 2005 paper on deep learning[DL6,6a] was the first machine learning publication with the word combination "learn deep" in the title.[T22])

      In 1967-68, Shun-Ichi Amari trained deep MLPs by stochastic gradient descent

      Ivakhnenko and Lapa (1965, see above) end-to-end fashion from scratch by stochastic gradient descent (SGD),[GD1] a method proposed in 1951 by Robbins & Monro.[STO51-52]

      Amari's implementation[GD2,GD2a] (with his student Saito) learned internal representations in a five layer MLP with two modifiable layers, which was trained to classify

      See also Iakov Zalmanovich Tsypkin's even earlier work on gradient descent-based on-line learning for non-linear systems.[GDa-b]

      Remarkably, as mentioned above, Amari also published learning RNNs in 1972.[AMH1]

      who invented backpropagation?

      In 1970, Seppo Linnainmaa was the first to publish what's now known as backpropagation, the famous algorithm for credit assignment in networks of differentiable nodes,[BP1,4,5]

      In 1960, Henry J. Kelley had a precursor of backpropagation in the field of control theory

      In 1982, Paul Werbos proposed to use the method to train NNs,[BP2] extending ideas in his 1974 thesis.

      In 1960, Henry J. Kelley already had a precursor of backpropagation in the field of control theory;[BPA] see also later work of the early 1960s by Stuart Dreyfus and Arthur E. Bryson.[BPB][BPC][R7] Unlike Linnainmaa's general method,[BP1] the systems of the 1960s[BPA-C]

      Backpropagation is essentially an efficient way of implementing Leibniz's chain rule[LEI07-10] (1676) (see above) for deep networks. Cauchy's gradient descent[GD'] uses this to such that the NN behaves more and more like some teacher, which could be a human, or another NN,[UN-UN2] or something else. had just become accessible in wealthier academic labs. An experimental analysis of the known method[BP1-2] yield useful internal representations in hidden layers of NNs.[RUM] At least for supervised learning, backpropagation is generally more efficient than Amari's above-mentioned deep learning through the more general SGD method (1967), which learned useful internal representations in NNs about 2 decades earlier.[GD1-2a]

      It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a training method for deep NNs. Before 2010, many thought that the training of NNs with many layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3] (see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1] postdoc Dan Ciresan[MLP1-2] pre-training for important applications.[MLP2]

      10-year anniversary of supervised deep learning breakthrough (2010)

      Our system set a new performance record[MLP1] on Jung & Oh in 2004[GPUNN]). A reviewer called this a "wake-up call to the machine learning community." researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky neither cited this work nor corrected his book later.[HIN](Sec. I)[T22] (such as the Boltzmann machine[BM][HIN][SK75][G63][T22]) without relating them to the original work,[DLC][S20][T22] although the true history is well-known. in the 1960s-70s, especially outside of the Anglosphere.[DEEP1-2][GD1-3][CNN1][DL1-2][T22] Blatant misattribution and unintentional[PLAG1][CONN21] or intentional[FAKE2] plagiarism are still tainting the entire field of deep learning.[T22] Scientific journals "need to make clearer and firmer commitments to self-correction,"[SV20] as is already the standard in other scientific fields.

      In 1979, Kunihiko Fukushima introduced the convolutional neural network (CNN) architecture. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN).[CNN1-4] Neocognitron.[CNN1] rectified linear units (ReLUs) for NNs (1969).[RELU1] They are now widely used in CNNs and other NNs.

      In 1987, NNs with convolutions were combined by Alex Waibel with weight sharing and backpropagation (see above),[BP1-2] and applied to speech.[CNN1a] Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990[CNN3a] and by Juan Weng et al. for higher-dimensional CNNs in 1993.[CNN3] Yann LeCun's team has contributed improvements of CNNs, especially for images.[CNN2,4][T22] Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93]

      History of computer vision contests won by deep CNNs since 2011 CNNs (Dan Ciresan et al., 2011).[GPUCNN1,3,5] Our fast GPU-based[GPUNN][GPUCNN5] CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] In 2011, DanNet became the first pure deep CNN to win computer vision contests.[GPUCNN2-3,5] Competition[GPUCNN5] ICDAR 2011 Chinese handwriting DanNet[GPUCNN1-3] IJCNN 2011 traffic signs DanNet[DAN,DAN1][R6] ISBI 2012 image segmentation DanNet[GPUCNN3a] ICPR 2012 medical imaging DanNet[GPUCNN8] ImageNet 2012 AlexNet[GPUCNN4] MICCAI 2013 Grand Challenge DanNet[GPUCNN8] ImageNet 2014 VGG Net[GPUCNN9] ResNet,[HW2] a
      Highway Net[HW1]
      with open gates
      winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), and were able to greatly improve steel defect detection.[ST] CVPR paper on DanNet[GPUCNN3] 5 months later, the similar GPU-accelerated AlexNet won the ImageNet[IM09] 2012 contest.[GPUCNN4-5][R6] Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the DanNet of 2011.[MIR](Sec. 19)[MOST]

      ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) and currently the most cited NN,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net (see below) is actually the feedforward net version of our vanilla LSTM (see below).[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). NNs with rapidly changing "fast weights" were introduced by v.d. Malsburg (1981) and others.[FAST,a,b] Deep learning architectures that can manipulate structured data such as graphs[T22] were our graph NN-like, Transformer-like Fast Weight Programmers of 1991[FWP0-1][FWP6][FWP] which learn to continually rewrite mappings from inputs to outputs (addressed below), and the work of Baldi and colleagues.[BA96-03] Today, graph NNs are used in numerous applications.

      Werbos,[BP2][BPTT1] Williams,[BPTT2][CUB0-2] and others[ROB87][BPTT3][DL1] analyzed ways of implementing gradient descent[GD'][STO51-52][GDa-b][GD1-2a] in RNNs. Kohonen's self-organising maps became popular.[KOH82-89] in space and time.[BB2][NAN1-4][NHE][HEL] See overviews[MIR](Sec. 15, Sec. 17) and recent renewed interest in such methods.[NAN5][FWPMETA6][HIN22] version of this became popular under the moniker "dropout."[Drop1-4][GPUCNN4] Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first published in 1990 in Munich under the moniker Artificial Curiosity.[AC90-20][GAN1] Two dueling NNs (a probabilistic generator and a predictor) are trying to maximize each other's loss in a minimax game.[AC](Sec. 1) (using stochastic units[AC90] like in the much later StyleGANs[GAN2]). the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain.[AC90] (The world model can also be used for continual online action planning.[AC90][PLAN2-3][PLAN])

      Artificial Curiosity & Creativity Since 1990-91

      4 years before a 2014 paper on GANs,[GAN1] my well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a given set.[AC20][AC][T22](Sec. XVII) early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20] Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another has been widely used for exploration in Reinforcement Learning[SIN5][OUD13][PAT17][BUR18] for synthesis of realistic images,[GAN1,2] although the latter domain was recently taken over by Rombach et al.'s Latent Diffusion, another method published in Munich,[DIF1] building on Jarzynski's earlier work in physics from the previous millennium[DIF2] and more recent papers.[DIF3-5] Predictability Minimization for creating disentangled representations of partially redundant data, applied to images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7) which is now considered a remaining grand challenge.[LEC] The early 1990s, however, saw first exceptions: NNs that learn to decompose complex spatio-temporal observation sequences into compact but meaningful chunks[UN0-3] (see further below), and NN-based planners of hierarchical action sequences for compositional learning,[HRL0] as discussed next. This work injected concepts of traditional "symbolic" hierarchical AI[NS59][FU77] into end-to-end differentiable "sub-symbolic" NNs. end-to-end differentiable NN-based subgoal generators for Hierarchical Reinforcement Learning (HRL).[HRL0] Soon afterwards, this was also done with recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][PHD][MIR](Sec. 10) problem."[LEC]

      Compare other NNs that have "worked on command" since April 1990, in particular, for learning selective attention,[ATT0-3] artificial curiosity and self-invented problems,[PP][PPa,1,2][AC] upside-down reinforcement learning[UDRL1-2] and its generalizations.[GGP] Recently, Transformers[TR1] have been all the rage, e.g., generating human-sounding texts.[GPT3] Transformers with "linearized self-attention"[TR5-6] were first published in March 1991[FWP0-1][FWP6][FWP] These so-called "Fast Weight Programmers" or "Fast Weight Controllers"[FWP0-1] separated storage and control like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way (rather than in a hybrid fashion[PDA1-2][DNC]). The "self-attention" in standard Transformers[TR1-4] combines this with a projection and softmax (using attention terminology like the one I introduced in 1993[ATT][FWP2][R4]).

      26 March 1991: Neural nets learn to program neural nets with fast weights—like today's Transformer variants. 2021: New stuff!

      Today's Transformers heavily use unsupervised pre-training[UN0-3] (see next section), another deep learning methodology Annus Mirabilis of 1990-1991.[MIR][MOST]

      The 1991 fast weight programmers 1992[FWPMETA1-9][HO1] extended my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience. This became very popular in the 2010s[DEC] when computers were a million times faster. layers of neurons or many subsequent computational stages.[MIR] ones[DL1-2] (but see a 1989 paper[MOZ]). of arbitrary depth.[DL1] Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0) scales:[LEC] the Neural Sequence Chunker[UN0] or Neural History Compressor.[UN1] First Very Deep Learner of 1991 "very deep learning" tasks of depth > 1000[UN2] (requiring Neural History Compressor.[UN3] (See also recent work on unsupervised NN-based abstraction.[OBJ1-5]) More than a decade after this work,[UN1] called Deep Belief Networks (DBNs).[UN4] (or negative log probability) of the data representation in the level below.[HIN][T22][MIR] using my NN distillation procedure of 1991.[UN0-1][MIR] NN distillation was also republished many years later,[DIST2][MIR][HIN][T22] and is widely used today. used by Transformers[TR1-6] for Transformers with linearized self-attention were also first published[FWP0-6] in Annus Mirabilis of 1990-1991,[MIR][MOST] together with unsupervised/self-supervised pre-training for deep learning.[UN0-3] See the previous section. Sepp Hochreiter's Analysis of the Fundamental Deep Learning Problem (1991) Deep learning is hard because of the Fundamental Deep Learning Problem his diploma thesis which I had the pleasure to supervise.[VAN1] First he implemented the Neural History Compressor above but then did much more: In both cases, learning fails (compare[VAN2]). This analysis led to basic principles of what's now called LSTM (see below). Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 diploma thesis,[VAN1] which I consider one of the most important documents in the history of machine learning. It also provided essential insights for overcoming the problem, through basic principles (such as constant error flow) of what we called LSTM in a tech report of 1995.[LSTM0] After the main peer-reviewed publication in 1997[LSTM1][25y97] (now the most cited NN article of the 20th century[MOST]), application of LSTM to speech (2004).[LSTM10] 2005 saw the first publication of LSTM with full backpropagation through time and of bi-directional LSTM[LSTM3] (now widely used). Recurrent Neural Networks, especially LSTM Another milestone of 2006 was the training method "Connectionist Temporal Classification" or CTC[CTC] for simultaneous alignment and recognition of sequences. Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). NNs and traditional approaches such as Hidden Markov Models (HMMs).[BW][BRI][BOU][HYB12][T22] three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). LSTM was soon used for everything that involves sequential data such as speech[LSTM10-11][LSTM4][DL1] and videos. Google's speech recognition on the Android smartphones.[GSR15] Many other companies adopted this.[DL4] on-device speech recognition of 2019 (now on your phone, not on the server) LSTM. In 1995, we already had an excellent neural probabilistic text model[SNT] whose basic concepts were Nakamura and Shikano's 1989 word category prediction model.[NPMa] In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] achieve only 10 billion clicks),[FB17][DL4] Apple's Quicktype on roughly 1 billion iPhones,[DL4] the voice of Amazon's Alexa,[DL4] image caption generation[DL4] & automatic email answering[DL4] etc. Business Week called LSTM "arguably the most commercial AI achievement."[AV1] have "LSTM" in their title.[DEC]

      Highway Networks:
our <a href=Highway Network[HW1] (previous NNs had at most a few tens of layers). Microsoft's ResNet[HW2] (which won the ImageNet 2015 contest) is a version thereof The earlier Highway Nets perform roughly as well as their ResNet versions on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks where the pure residual layers do not work as well.[NDR]

      Deep learning is all about NN depth.[DL1] LSTMs brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to feedforward Net version called ResNet the most cited NN of the 21st.[MOST] (Citations, however, are a highly questionable measure of true impact.[NAT1]) Reinforcement Learning (RL),[KAE96][BER96][TD3][UNI][GM3][LSTMPG] expected cumulative reward signals.[DL1] formulated in the general RL framework.[UNI] Monte Carlo (tree) search (MC, 1949),[MOC1-5] dynamic programming (DP, 1953),[BEL53] artificial evolution (1954),[EVO1-7]([TUR1],unpublished) alpha-beta-pruning (1959),[S59] control theory and system identification (1950s),[KAL59][GLA85] stochastic gradient descent (SGD, 1951),[STO51-52] and universal search techniques (1973).[AIT7] system identification,[WER87-89][MUN87][NGU89] DP and its online variant called Temporal Differences (TD),[TD1-3] artificial evolution,[EVONN1-3] and policy gradients.[GD1][PG1-3] Many additional references on this can be found in Sec. 6 of the 2015 survey.[DL1]

      When there is a Markovian interface[PLAN3] RL with DP/TD/MC-based FNNs can be very successful, as shown in 1994[TD2] (master-level backgammon player) and the 2010s[DM1-2a] (superhuman players for Go, chess, and other games). history of previous inputs, our combinations of RL algorithms and LSTM[LSTM-RL][RPG] have become standard, in particular, our LSTM trained by policy gradients (2007).[RPG07][RPG][LSTMPG]

      Deep Reinforcement Learning with Policy Gradients for Long Short-Term Memory (LSTM) ) For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] commonsense reasoning[MAR15] and learning to think.[PLAN4-5] time scales?[LEC] We published answers to these questions in 1990-91: self-supervised neural history compressors[UN][UN0-3] learn to represent percepts at multiple levels of abstraction and multiple time scales (see above), while end-to-end differentiable NN-based subgoal generators[HRL3][MIR](Sec. 10) learn hierarchical action plans through gradient descent (see above). More sophisticated ways of learning to think in abstract ways were published in 1997[AC97][AC99][AC02] and 2015-18.[PLAN4-5] century[SHA7a][RAU1] by Heron of Alexandria Highlights of over 2000 years of computing history. Juergen Schmidhuber was perhaps the first machine with a stored program.[BAN][KOE1] It used pins on

      2021: 375th birthday of Leibniz, father of computer science. Juergen Schmidhuber. Wilhelm Schickard, In 1673, the already mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) designed the first machine (the step reckoner) that could perform all four arithmetic operations, and the first with a memory.[BL16] cards (1679),[L79][L03][LA14][HO66] and published the chain rule[LEI07-10] (see above), essential ingredient of deep learning and modern AI.

      Leonardo Torres y Quevedo, the  20th century's first pioneer of practical AI Leonardo Torres y Quevedo (mentioned in the introduction) became it at the 1951 Paris AI conference.[AI51][BRO21][BRU4] Konrad Zuse The corresponding patent of 1936[ZU36-38][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Unlike Babbage, Zuse used Leibniz' principles of binary computation (1679)[L79][LA14][HO66][L03] This greatly simplified the hardware.[LEI21,a,b] Church[CHU] (1935), Turing[TUR] (1936), and Post[POS] (1936). conditional jump instruction.[RO98]

      1941: Konrad Zuse builds first working general computer; patent application 1936. Juergen Schmidhuber. John Atanasoff (the "father of tube-based computing"[NASC6a]). Julius Edgar Lilienfeld in 1925.[LIL1-2] used to break the Nazi code.[NASC6] someone other than Zuse (1941)[RO98] was Howard Aiken's decimal MARK I (US, 1944). and the 1948 upgrade of ENIAC, which was reprogrammed by entering numerical instruction codes into read-only memory.[HAI14b] with several transistors on a common substrate (granted in 1952).[IC49-14] In 1959, Robert Noyce presented a monolithic IC.[IC14] ICs/GPUs of today (2022) contain many billions of transistors (almost all of them of Lilienfeld's 1925 FET type[LIL1-2]). Moore's Law which states that the number of transistors[LIL1-2] raw computational power of all human brains combined.[RAW] According to Bremermann (1982),[BRE] as previously noted back in 2004.[OOPS2][ZUS21] are actually light beams).[DL2] are expected to become even much more important than they are today.[DL2] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]

      1931: Theoretical Computer Science & AI Theory Founded by Goedel. Juergen Schmidhuber. He combined Georg Cantor's diagonalization trick[CAN] with the foundational work by Gottlob Frege[FRE] (who introduced the first formal language in 1879), Thoralf Skolem[SKO23] (who introduced primitive recursive functions in 1923) and Jacques Herbrand[GOD86] (who identified Gottfried Wilhelm Leibniz[L86][WI48] (see above), deductively equivalent[LE18] to the later Boolean Algebra of 1847.[BOO] In 1936, Alan M. Turing Turing Machine.[TUR] He rederived the above-mentioned result.[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing.[POS] the world's first working programmable general-purpose computer,[ZU36-38][RO98][ZUS21] the first high-level programming language.[BAU][KNU] 1945[KNU] in 1948.[ZU48] Compare Newell & Simon's later work on theorem proving (1956).[NS56] In 1964, Ray Solomonoff combined Bayesian (actually Laplacian[STI83-85]) probabilistic reasoning and theoretical computer science[GOD][CHU][TUR][POS] of learning to predict future data from past observations.[AIT1][AIT10] With Andrej Kolmogorov, he founded the theory of Kolmogorov complexity or algorithmic information theory (AIT),[AIT1-22] going beyond traditional information theory[SHA48][KUL] this concept,[AIT7][AIT5][AIT12-13][AIT16-17] as well as applications to NNs.[KO2][CO1-3]

      In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation grant[UNI]) augmented Solomonoff's universal predictor[AIT1][AIT10] environments.[AIT20,22] He also derived the asymptotically fastest algorithm for all well-defined computational problems,[AIT21] a beautiful pattern of exponential acceleration in it,[OMG] which I have presented in many talks since then, and which also made it into Sibylle Berg's award-winning book "GRM: Brainfuck."[OMG2] intervals: just a few decades or centuries or at most millennia.[OMG1] The most important events since the beginning of the universe seem to be neatly aligned on a timeline of exponential acceleration converging in an Omega point in the year 2040 or so (J Schmidhuber, 2014) Heron of Alexandria[RAU1] in the 1st century). The telephone (e.g., Meucci 1857, Reis 1860, Bell 1876)[NASC3] Haber-Bosch process for creating artificial fertilizer, without which the world could feed at most 4 billion people.[HAB1-2] first truly self-driving cars robot cars were driving in highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-learning or learning to learn,[META] to learn better learning algorithms through experience (now a very popular topic[DEC]). And then came our Miraculous Year 1990-91[MIR] at TU Munich, the root of today's most cited NNs[MOST] and of modern deep learning through artificial curiosity and generative adversarial NNs for agents that invent their own problems (see above),[AC90-AC20][PP-PP2][SA17] Transformers with linearized self-attention (see above),[FWP0-6][TR5-6] distilling teacher NNs into student NNs (see above),[UN][UN0-3] at multiple levels of abstraction and multiple time scales (see above),[HRL0-2][LEC] and other exciting stuff. Much of this has become very popular, and improved the lives of billions of people.[DL4][DEC][MOST] (take all of this with a grain of salt, though[OMG1]). lab for decades[AC][AC90,AC90b]) will quickly improve themselves, restricted only by the fundamental limits of computability and physics. it,[ACM16][FA15][SP16][SA17] make more and bigger AIs. Those who don't won't have an impact.[ACM16][FA15][SP16] the simplest and fastest way of computing all possible metaverses or computable universes. Juergen Schmidhuber, 1997

      Creative Commons License Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21] [ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN] [UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22] publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 555+ References (and many more in the survey[DL1]) In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal paper on Long Short-Term Memory, the (and basis of the most cited NN of the 21st). all possible metaverses 3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments. meta-reinforcement learning. 5. Journal paper on hierarchical Q-learning. 8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. general system systems with intrinsic motivation,[AC90-AC95] the system also See later publications.[AC99][AC02] PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. H. Bruderer[BRU4] calls that the first conference on AI. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network,[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.[L20][I25][T22] Mentions the recurrent Ising model[L20][I25]on which the (uncited) Amari network[AMH1,2] is based. The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.[AMH1] [AMH2] did not cite [AMH1]. [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite. In fact, Hinton was the reviewer of a 1990 paper[ATT2] his own work:[ATT3] attentional component (the fixation controller)." See [MIR](Sec. 9)[R4]. arXiv/1409.0473, 2014-16. This work on soft "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.[FWP,FWP0-2,6][ATT] J.  Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around Bloomberg, May 15, 2018. PDF. HTML. PDF. by Sherrington & Kirkpatrick[SK75] & Glauber[G63] nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)[DEEP1-2][HIN] nor Amari's work (1967-68)[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors[S20][DLC] failed to cite the prior art.[T22] formal Algebra of Thought (1686)[L86][WI48] was deductively equivalent[LE18] to the much later Precursor of modern backpropagation.[BP1-5] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] Link. IEEE Spectrum, 2021. Link. English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1-5] and weight-sharing PDF. Spatial Averaging.[CNN1] Spatial Averaging.[CNN1] PDF. PDF. PDF. Inverse, 2016. Link. Since November 2021: Comments on version 1 of the report[T22] in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] Now everybody is using this approach. J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. the artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The 1991 NN distillation procedure,[UN0-2][MIR](Sec. 2) More. Deep Learning. HTML. A "survey" of deep learning that does not mention the pioneering works of deep learning [T22]. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22]. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). The deep reinforcement learning & neuroevolution developed in Schmidhuber's lab solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to Schmidhuber's work of 1991.[UN1-2][UN] [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC2] "Deep Learning Conspiracy" (Nature 521 p 436). it). More on this under [T22]. J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. Tweet of 2022. arxiv:1312.5602. Link. the first sentence of the abstract of the earlier tech report version[DM1] was created earlier by Jan Koutnik et al. in Schmidhuber's lab.[CO2] and PhDs in computer science. More. Alphastar has a "deep LSTM core." Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007). Preprint arXiv:2112.10752, LMU Munich, 2021. neural networks learning to control dynamic external memories.[PDA1-2][FWP0-1] arXiv:1808.03578, 2018. arXiv:1808.03578, 2018. Conf. on Neural Networks, Vol. 2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.[R62][T22] This overview does not mention that the "ELM" concept goes back to Rosenblatt's work in the 1950s.[R62][T22] Link. used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) [FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep Learning Timeline 1960-2013. PDF. J.  Schmidhuber (AI Blog, 26 March 2021, updated 2022). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa,b] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections Transformers with linearized self-attention[TR5-6] In 1993, he introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. See tweet of 2022. PDF. normalization).[FWP] PDF. HTML. Pictures (German). See tweet of 2022 for 30-year anniversary. PDF. Preprint: arXiv:1811.12143. PDF. PDF. Very similar to [FWP0-2], in both motivation [FWP2] and execution. This work on "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.[FWP,FWP0-2,6][ATT] Preprint: arXiv:2003.08165. PDF. HTML overview. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. Preprint: arXiv:2106.06295 (June 2021). PDF. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here. Preprint arXiv:2012.14905 [cs.LG], 2020. Report arXiv:2011.07831 [cs.AI], 2020. Preprint: arXiv:2202.05780. PDF. Probably the first paper on using stochastic gradient descent[STO51-52] reverse mode of automatic differentiation or backpropagation[BP1]). OCR-based PDF scan of pages 94-135 (see pages 119-120). Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.[GD1] (S. Amari, personal communication, 2021.) Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022). arXiv:cs/0309048 (2003). More. PDF. Cognitive Computation 1(2):177-193, 2009. PDF. More. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. Alphr Technology, Jul 2015, or 9to5google, Jul 2015 WIRED, Sep 2016, siliconANGLE, Sep 2016 Blog post, Internet Archive, 2010. A blog post describing basic ideas[AC][AC90,AC90b][AC20] of GANs. A description of GANs that does not cite Schmidhuber's original GAN principle of 1990[AC][AC90,AC90b][AC20][R2][T22] (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization[PM0-2][AC20][T22]). Link. This was number 1 on Hacker News. Frankfurter Allgemeine Zeitung, 16/6/2021. Preprint arXiv/2005.14165. for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. win four important computer vision competitions 2011-2012 before others won any PDF. HTML overview. competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. PDF. DanNet,[DAN,DAN1][R6] to win computer vision contests in 2011[GPUCNN2-3,5] (AlexNet and VGG Net[GPUCNN9] followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)[RELU1] and dropout (a variant of Hanson 1990 stochastic delta rule)[Drop1-4] but neither cites the original work[RELU1][Drop1] nor the basic CNN architecture (Fukushima, 1979).[CNN1] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. PDF. PDF. [GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). first deep learner to win a medical imaging contest (2012). Link. J.  Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according to Nature, 1999). The Haber-Bosch process has often been called the most important invention of the 20th century[HAB1] PDF. PDF. Bengio claimed[YB20] Schmidhuber's publications on exactly this topic date back to 1991-93.[UN0-2][UN] An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.[NHE] [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also [T22]. previous related work.[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6] PDF. what Y. LeCun called an "open problem" in 2022.[LEC] North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. PDF. This work did not cite Schmidhuber's gradient-based subgoal generators for hierarchical reinforcement learning (1990).[HRL0-2] PDF. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. Link. arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] More. arxiv:1612.07771 (2016). Also at ICLR 2017. This work did not cite the earlier LSTM[LSTM0-6] trained by Connectionist Temporal Classification (CTC, 2006).[CTC] CTC-LSTM was successfully applied to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]) and became the first superior end-to-end neural speech recogniser that outperformed the state of the art, dramatically improving Google's speech recognition.[GSR][GSR15][DL4] Markov models (HMMs).[BW][BRI][BOU] [HYB12] still used the old hybrid approach and did not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.[LSTM8] Ernst Ising and Wilhelm Lenz in the 1920s.[L20][I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs.[AMH1-2] Who Invented the IC? Preprint arXiv:1704.04760 PDF. PDF. Mathematischen Schriften, ed. C. Gerhardt, Berlin 1879, vol.7, p.223. English link. Link. arXiv:1607.06450, 2016. [LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years See tweet1. LeCun also listed the "5 best ideas 2012-2022" without mentioning that See tweet2. [LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science. Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021. [LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der Informatik. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. PDF. Preprint: arxiv:1506.07452. PDF. J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent PDF. are actually a variant of the vanilla LSTM architecture[LSTM2] (2000) which the authors did not cite although this work[LSTM2] was the one that introduced gated recurrent units. Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) Preprint arXiv:1805.04908. Architectures. Preprint arXiv:1703.03906 A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. better GP methods through Meta-Evolution. More. [MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The Computation 22(12): 3207-3220, 2010. ArXiv Preprint. (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs[MLP1] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers). Annus Mirabilis of 1990-1991.[MIR] PDF. PDF. Preprint arXiv:1608.05343, 2016. Preprint arXiv:1611.01578 (PDF), 2017. Compare the earlier Neural Architecture Search of Bayer et al. (2009) for LSTM-like topologies.[LSTM7] [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. Letter, Science, vol 336, p 1639, June 2012. See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC6a] J. Schmidhuber. Comment on "Biography: The ABC of computing" by J. Gilbey, Nature 468 p 760-761 (2010). Link. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. HTML. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732. Link. excellent 1995 neural probabilistic text model.[SNT] See also Nakamura and Shikano's 1989 word category prediction model.[NPMa] Compare Konrad Zuse's much earlier 1948 work on theorem proving[ZU48] the first high-level programming language.[BAU][KNU] NY Times article Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF). arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. Link. J. Schmidhuber (Blog, 2006). Is History Converging? Again? history's exponential acceleration since the Big Bang.[OMG] Preprint arXiv/1606.06724. Preprint arXiv/1708.03498. Preprint arXiv/1802.10353. Preprint arXiv/2010.03635. Preprint arXiv/2011.12930. PDF. HTML. HTML overview. OOPS source code in crystalline format. PDF. HTML. Link. J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN principle Based on TR FKI-126-90 (1990).[AC90] More. PDF. Partially based on TR FKI-126-90 (1990).[AC90] Report arXiv:1210.0118 [cs.AI], 2015. One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018. Preprint: arXiv:1809.01999. Github: World Models. minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. 1991. PDF. More. PDF. More. Link. arXiv:1112.5309 [cs.AI] PDF. First Experiments with PowerPlay. arXiv:1210.8385 [cs.AI]. [R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. in 1987[META1][META] long before Bengio [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [R9] Reddit/ML, 2019. We [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun [R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers Although these MLPs did not yet have deep learning, because only the last layer learned,[DL1] Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.[ELM1-2][CONN21][T22] J. Schmidhuber (AI Blog, 2001). Raw Computing Power. Preprint arXiv/1311.2524, Nov 2013. Preprint arXiv/1703.06870, 2017. PDF. The first paper on policy gradients for LSTM. This approach has become very important in reinforcement learning.[LSTMPG] This experimental analysis of backpropagation did not cite the origin of the method,[BP1-5] also known as the reverse mode of automatic differentiation. the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)[DEEP1-2][HIN] as well as Amari's work (1967-68)[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors[DL3,3a] failed to cite the prior art.[T22] Link. A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII) in the 1960s-70s, especially outside of the Anglosphere.[DEEP1-2][GD1-3][CNN1][DL1-2][T22] The Past, Present and Future of Artificial Intelligence. Link. PDF. Much later this was called a probabilistic language model.[T22] PDF. Link. ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T22] debunks this justification. [T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22]. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. Debunking [T19] and [DL3a] . the 1991 publication on what's now called "Transformers with linearized self-attention."[FWP0-6][TR5-6] attention terminology in 1993.[ATT][FWP2][R4] See tweet of 2022 for 30-year anniversary. Link. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. The Turing Test. YouTube video, 2022. Preprint arXiv/1912.02875, 5 Dec 2019. Preprint arXiv/1912.02877, 5 Dec 2019. J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised PDF. By 1993, the approach solved problems of depth 1000 [UN2] neural knowledge distillation procedure The systems of 1991 allowed for much deeper learning than previous methods. More. 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. approaches are now widely used. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. can be found here (depth > 1000). 2006. PDF. It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)[UN0-3] the first NNs shown to solve very deep problems. (or negative log probability) of the data representation in the level below.[HIN][T22][MIR] This can greatly facilitate very deep downstream learning.[UN0-3] The comment under reference[UN4] applies here as well. Theory of Universal Learning Machines & Universal AI. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. Results are essentially identical to those of Schmidhuber's diploma student Sepp Hochreiter (1991).[VAN1] Even after a common publication,[VAN3] the first author of [VAN2] published papers[VAN4] that cited only their own [VAN2] but not the original work. PDF. [VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link. Link. Youtube video [see 28:16]. However, in 2010, Schmidhuber's team in Switzerland showed[MLP1-2] unsupervised pre-training is not necessary Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). Schmidhuber's publications on exactly this topic date back to 1991-93.[UN0-2][UN] already in 1995.[SNT] a general, practical, program-controlled computer. architecture [NEU45]. PDF. J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application. J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936. Weltwoche, Nr. 33.21, 19 August 2021. PDF. Menu directory status & updates copyrights Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award AI Blog
      Twitter: @SchmidhuberAI
      (v1: 24 Sep 2021, v2: 31 Dec 2021) Versions since 2021 archived in the Internet Archive This is a point-for-point critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021). deep learning survey,[DL1] and can also be seen as a short history of the deep learning revolution, at least as far as ACM's erroneous laudation and the Turing Lecture are concerned. 2015 survey of deep learning[DL1] June 2020 article[T20a][R12] version 1 of the present report. (see Executive Summary I, V, II, XII, XIX, XXI, XIII, XIV, XX, XVII). (A) speech recognition, (B) natural language processing, (C) robotics, (D) computer vision, (VII) medicine, astronomy, materials science. A, B, C, D, VII, XVII, VI, XVI). II, V, XX, XVIII) with Dr. Bengio & Dr. Hinton (see Sec. XVII, I). I respond to LBH's recent ACM article (July 2021). expands material in my Critique of the 2019 Honda Prize[HIN] (~3,000 words). Abstract & Outline (~300 words), Introduction (~300 words), Critique of LBH's ACM article (Turing Lecture) of July 2021[DL3a] Executive summary of what's wrong with ACM's laudation (~1,000 words), 21 comments on 21 claims by ACM (~8,000 words), Conclusion (~2,000 words). All backed up by over 300 references (over 10,000 words). The text contains numerous hyperlinks to relevant overview sites from the AI Blog. science is self-correcting."[SV20] they are mine or other people's.[DL1-2][HIN][NASC1-9] The present page is offered as a resource for all good computer scientists who share this inclination. and to fight plagiarism,[FAKE2] collusion rings,[LIT21] and systemic academic corruption in all of their more and less subtle forms.[FAKE] Sec. 2 LBH's 2021 ACM article[DL3a] which necessitated an extension of the first version of this post.[T20a][R12] ACM's official justification[T19] of the 2018 A.M. Turing Award[R1] After the Executive Summary in Sec. 3, Sec. 4 will split ACM's full text[T19] into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI. Most of the critiques are based on references to original papers and material from the AI Blog.[AIB][MIR][DEC][HIN] publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.[DL3a] LBH's well-known earlier omissions.[DLC][HIN][T20a] LBH claim to "briefly describe the origins of deep learning"[DL3a] without even mentioning the world's first working deep learning nets by Ivakhnenko and Lapa in 1965[DEEP1-2][R8] (see Sec. II). this class of methods was pioneered in 1991[UN-UN2] (see Sec. II, III). Highway Net, the first really deep feedforward NN.[HW1-3] (see Sec. D, VI). were all driven by my lab:[MOST] In 1991, I had the first very deep NNs based on unsupervised pre-training;[UN-UN2] LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs;[LSTM0-17] later our Highway Nets[HW1-3] brought it to feedforward NNs. from 2007[LSTM4,14] based on LSTM[LSTM0-6] (1990s-2005) and CTC (2006).[CTC] our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years[GSR][GSR15-19][DL4] (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B). LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule[Drop1-3] (see Sec. XIV). perceptrons through stochastic gradient descent[GD1-3] (without reverse mode backpropagation[BP1]). Fukushima who introduced ReLUs in 1969[RELU1-2] (see Sec. XIV). called AlexNet,[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet[GPUCNN1-3,5-8][DAN] did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011[GPUCNN1-8][R5-6] (see Sec. XIV). XVIII). already in 1965[DEEP1-2][R8] (see Sec. II). earlier fast weights of von der Malsburg (1981) and Feldman (1982).[FAST,FASTa-b][FWP] described in the 1991-93 papers on Fast Weight Programmers and linear Transformers[FWP0-1,6] (see Sec. XVI, XVII-2). dedicate an extra section to attention-based Transformers,[TR1-6] citing Bengio's team (2014) for "soft attention"[ATT14] without citing the much earlier original work of 1991-1993 on soft attention and linear Transformers[FWP,FWP0-2,6][ATT] (see Sec. XVII-1, XVI). LBH claim that Bengio's team[NPM] of text compression[SNT] (see Sec. XVI, XVII-1). LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)[GAN0-1] without mentioning that GANs are instances of the Adversarial Curiosity Principle of 1990[AC90-20][MIR](Sec. 5) (see Sec. XVII). In summation, LBH have repeatedly chosen to ignore the previous well-known critiques[DLC][HIN][T20a] and deep learning surveys,[DL1-2] and ACM's peer review process failed to catch this. ACM's Code of Ethics and Professional Conduct[ACM18] states: "Computing and deep learning (e.g., Sec. I), ACM lauds Numerous references can be found under the relevant section links I-XXI which adhere to the sequential order of ACM's text[T19] Sec. II: it became really deep in 1991 in my lab, unsupervised pre-training of NNs, supervised LSTM. Sec. I contains 4 subsections A, B, C, D A: Speech Recognition (see also Sec. VI & XI & XV): The first superior end-to-end neural speech recognition combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), which were Hinton (2012) and Bengio (XV) our revolutionary CTC-LSTM which was soon on most smartphones. Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI): (soon used for several billions of was also based on our LSTM. Sec. C: Robotics. most visible breakthroughs Sec. D: Computer Vision XVIII & XIV & XI & VI) and applied to speech. All before LeCun's CNN work (XVIII). deep NNs pre-training (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for superior computer vision in 2011, winning 4 image recognition contests in a row is an open-gated version of our earlier Highway Nets. Sec. XIV: deep & fast CNN (where LeCun participated), Sec. XI: ACM mentions GPU-accelerated NNs deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 (DanNet) was the first XVIII: Fukushima and Waibel (see Sec. D). The first application of CNNs with backpropagation to biomedical/biometric images is due to Baldi and Chauvin.[BA93] VII: ACM explicitly mentions medicine and first to win medical imaging competitions Sec. XII & XIX & XXI: Modern backpropagation XIII & II & V III & IX & X & XX): Sec. XX: ACM credits LeCun for work on Sec. XXI: ACM credits LeCun for work on XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences. CTC-LSTM A & B). XVI: ACM We started this in 1990-93 long before LBH Sec. XVII: Artificial Curiosity vanishing gradients (1991), metalearning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), learning sequential attention with NNs (1990), fast weight programmers using and other topics.[R2-R6] Sec. IV is on Turing (1936) and his predecessors Critique of LBH's ACM article (Turing Lecture) of July 2021. Sec. Conclusion: In the recent decade of deep learning, (speech recognition, language translation, etc.) on billions of devices (also healthcare applications) Sec. II & III & V & XII & XIII & XVII & XIV & XIX & XX & XXI. In what follows, ACM's full text [T19] is split into 21 parts I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI.

      Critique of 2018 Turing Award LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1-2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2] vanishing gradients (1991)[VAN1] & Long Short-Term Memory or LSTM (Sec. A), GPU-accelerated NNs (2004),[GPUNN][DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991).[FWP0-2,6] [DL1-2][R2-R8] Often LBH failed to cite essential prior work, even in their later surveys.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5, R7-R8] This may explain some of ACM's misattributions.[T19] II & III & V & XIII & X & XVII & XII & XVIII & XX. The deep NNs By the 2010s,[DEC] they were academia and industry,[DL4] mentioned by ACM (labeled as A, B, C, D) below: Long Short-Term Memory or LSTM (1990s-2005)[LSTM0-6] vanishing gradient problem student Sepp Hochreiter in 1991.[VAN1] This happened long before the similar work of Bengio (see Sec. XVII).[MIR] (Sec. 3,Sec. 4) LSTM was refined with my student Felix Gers[LSTM2] through "forget gates" based on end-to-end-differentiable fast weights.[MIR](Sec. 8)[FWP,FWP0-1] (A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).[CTC] Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). Markov models (HMMs)[BW][BRI][BOU] (Sec. XV). Hinton et al. (2012) still used the old hybrid approach[HYB12] and did not compare it to CTC-LSTM. became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer[LSTM4][LSTM14] as a postdoc in Hinton's lab.[LSTM8] CTC-LSTM dramatically improved Google's speech recognition.[GSR][GSR15][DL4] on-device speech recognition[GSR19] (not any longer on the server) LSTM[MIR](Sec. 4) (see Sec. VI & XI & XV). of text[SNT] (see Sec. XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,[LSTM13] See also Sec. VI & XI & XV. tailored by Bengio's team.[ATT14][FWP] However, such attention mechanisms also have their roots in my lab (1991);[FWP][FWP0-2,6] see Sec. XVI. C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.[LSTM-RL][RPG][LSTMPG] In the 2010s, For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by PG.[DM3] OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG] Apart from A, B, C above, in healthcare, chemistry, molecular design, lip reading, speech synthesis,[AM16] predicting what's going on in nuclear fusion reactors, and so on.[DEC][DL4] was being used for LSTM (only 5% for the CNNs of Sec. D).[JOU17] Apparently the first LSTM journal paper[LSTM1][R5] is now the 20th century D. Computer Vision was revolutionized in the 2010s by a particular feedforward neural net (NN) called the convolutional NN (CNN).[CNN1-4] The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979),[CNN1] who also introduced the now widely used rectified linear units (ReLUs) in 1969.[RELU1] In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel did not call this CNNs but TDNNs. called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990[CNN3a] and by Weng et al. for higher-dimensional CNNs in 1993.[CNN3] Since 1989, LeCun's team has contributed improvements of CNNs, especially for images[CNN2,4] (see Sec. XVIII). Finally, my own team showed in 2010[MLP1] unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton[VID1] who said that "nobody in their right mind would ever suggest" this. Then we Our fast GPU-based CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] CNNs of 2006.[GPUCNN] winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition[DAN1] in an international contest (where LeCun's team took a distant second place, with DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), CVPR paper on DanNet[GPUCNN3] of Hinton's student Krizhevsky won the ImageNet[IM09] 2012 contest[GPUCNN4-5][R6] (now also without unsupervised pre-training, citing DanNet). Our CNN image scanners were 1000 times faster than previous methods.[SCAN] The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-3] further extended the work of 2011.[MIR](Sec. 19) ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) and currently the most cited neural network,[MOST] is a version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net is actually the feedforward net version of vanilla LSTM.[LSTM2] It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers). See also Sec. XVIII & XIV & XI & VI.

      Critique of 2018 Turing Award appeared long before the 1980s. The first non-learning recurrent NN (RNN) architecture (the Lenz-Ising model) was analyzed by physicists in the 1920s.[L20][I25][K41][W45] were also discussed in 1943 by McCulloch and Pitts[MC43] and formally analyzed in 1956 by Kleene.[K56] In 1972, Amari reused the Lenz-Ising model to build a learning RNN, later sometimes called the Hopfield network or Amari-Hopfield Network.[AMH1-3] artificial evolution[TUR1] and single adaptive layer learned in 1958[R58] (Joseph[R61] Widrow & Hoff's similar Adaline learned in 1962.[WID62] regression and the method of least squares[DL1-2] multilayer perceptrons (MLPs) were discussed by Steinbuch[ST61-95] (1961), Joseph[R61] (1961), and Rosenblatt[R62] (1962), who wrote about "back-propagating errors" in an MLP with a hidden layer,[R62] but did not yet have a general deep learning algorithm for deep MLPs (what's now called backpropagation is quite different and was first published by Linnainmaa in 1970[BP1-BP5][BPA-C]). Compare also Selfridge's multilayer Pandemonium[SE59] (1959). containing the now popular multiplicative gates).[DEEP1-2][DL1-2] A paper of 1971[DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,[DL2] especially in Eastern Europe, where much of Machine Learning was born.[MIR](Sec. 1)[R8] LBH failed to cite this, just like they failed to cite Amari,[GD1] who in 1967 proposed stochastic gradient descent[STO51-52] (SGD) for MLPs and whose implementation[GD2,GD2a] (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin's work[GDa-b]). deep convolutional NN architecture was first introduced in the 1970s;[CNN1] his very popular ReLU already in 1969.[RELU1-2] XIII, III, V, VIII, IX, and X. LBH & co-authors, e.g., Sejnowski[S20] (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert[M69] researchers took a fresh look at the problem in the 1980s."[S20] However, as mentioned above, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method[DEEP1-2][DL2] (and then also by Amari's SGD for MLPs[GD1-2]). Minsky was apparently unaware of this and failed to correct it later.[HIN](Sec. I) (but see a 1989 paper[MOZ]). However, it became really deep in 1991 in my lab,[UN-UN3] which has See Sec. 1 of the overview:[MIR] First Very Deep NNs, Based on Unsupervised Pre-Training (1991). "Very Deep Learning" tasks of depth > 1000.[UN2][DL1][UN] (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000[LSTM17] more.) drove the shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).[HIN](Sec. II)[MIR] (Sec. 19) III. Note that LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs; Highway Nets[HW1-3] brought it to feedforward NNs.[MOST]

      Critique of 2018 Turing Award by others (Sec. III).[DLC][DEEP1-2][BP1][DL1-2][R7-R8][R2-R4] deep learning multilayer perceptrons (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC90,90b][AC20] unsupervised pre-training for deep NNs,[UN1-2] the vanishing gradient problem (1991)[VAN1] & solutions to it (Sec. A), GPU-accelerated NNs (2004),[GPUNN][GPUCNN5] and other foundations.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DLC][HIN][MIR](Sec. 21) II & V & XIII & IX & X & XVII & XII & XVIII & XX & I. deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006",[DL7] referring to Hinton's[UN4] and Bengio's[UN5] we had this type of deep learning already in 1991;[UN][UN1-2] see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs[DEEP1-2] which Hinton,[UN4] Bengio,[UN5] and LBH[DL3,DL3a] did not cite either. See Sec. X.

      Critique of 2018 Turing Award my comments systematically track the sequential order of ACM's claims.[T19]

      ACM's statement on Turing is greatly misleading, like some of its other statements.[T19] any type of computation-based AI.[GOD][BIB3][MIR](Sec. 18)[GOD21,21a] Much of early AI in the 1940s-70s was actually about theorem proving[ZU48][NS56]

      In 1936, Turing Turing Machine.[TUR] He rederived the above-mentioned result,[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936, Emil Post published yet another independent universal model of computing,[POS] my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections.[HIN]) open problem "P=NP?" in his famous letter to John von Neumann (1956).[GOD56][URQ10] Likewise, Konrad Zuse (1910-1995) created the world's first working programmable general-purpose computer 1935-41. His patent application of 1936[ZU36-38][Z36][RO98][ZUS21] predating Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Zuse also created the first high-level programming language in the early 1940s.[BAU][KNU] conditional jump instruction.[RO98]

      Critique of 2018 Turing Award that learn internal representations (1965),[DEEP1-2][R8] stochastic gradient descent for multilayer perceptrons (1967),[GD1-3] modern backpropagation (1970),[BP1,2][R7] architectures of recurrent NNs (1925-56)[I25][MC43][K56] and convolutional NNs (1979),[CNN1] principles of generative adversarial NNs and artificial curiosity (1990),[AC][AC90,90b][AC10][AC20] unsupervised pre-training for deep NNs (1991),[UN1-2][UN] vanishing gradients (1991)[VAN1] & solutions to it (Sec. A),[LSTM0-17][CTC] (2004),[GPUNN][GPUCNN5] record-breaking deep supervised NNs (2010)[MLP1-2] and contest-winning deep CNNs (2011),[DAN][DAN1][GPUCNN5] NNs with over 100 layers (2015),[HW1-3][R5] transformer-like[TR1-6][FWP] attention[FWP][ATT] through fast weight programmers (1991),[FWP0-2,6] and more.[DL1-2][R2-R8] Often LBH failed to cite essential prior work.[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2-R5,R7,R8,R11] II & I & III & XIII & X & XVII & XII & XVIII & XX.

      Critique of 2018 Turing Award "advances in natural language processing" and in speech supervised NNs and CNNs achieved by our group 2010-2011[MLP1-2][DAN][DAN1][GPUCNN5][R6] and through Highway Net-like NNs (2015),[HW1-3][R5] although the principles of CNNs were invented and developed by others since the 1970s.[CNN1-4] See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93] DanNet[DAN][DAN1][GPUCNN5] the first NN to win a medical imaging contest through deep learning (Sept 2012, on cancer detection).[GPUCNN5,8] and were able to greatly improve steel defect detection.[ST] All of this happened before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky[GPUCNN4-5][R6] and the VGG network[GPUCNN9] mitosis detection.[MGC][GPUCNN5,8] approach of D & XI).

      Critique of 2018 Turing Award without citing them.[DL1][DLC][HIN][R2-R4][R7-R8] V & XII & XIX & II & III & XIII & XVII & X & I.

      Critique of 2018 Turing Award who failed to cite them, even in later work.[HIN][DLC][DL1-2][DEEP1-2][RELU1-2][R7-R8] See Sec. II & III & XIII & V & X & XIV & I.

      Critique of 2018 Turing Award first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).[DL2] To my knowledge, LBH have never cited them. (Margin note: our 2005 paper on deep RL[DL6,6a] was the first machine learning LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",[DL7] referring to their unsupervised pre-training methods of 2006. See Sec. III. others built careers on this notion long before LBH recognized this.[DEEP1-2][CNN1][HIN][R8][DL1][DLC] Even deep learning through unsupervised pre-training was introduced by others.[UN1-3][R4][HIN](Sec. II) II & III & XIII & V & I.

      Critique of 2018 Turing Award ignored by LBH's papers[HIN][R7-R8][R2-R5] (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII).

      ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),[GPUNN][GPUCNN5] made GPU-based NNs fast and deep enough an important benchmark record,[MLP1-2] unsupervised pre-training (pioneered by myself in 1991) is not necessary to train deep NNs, contrary to Hinton's claims.[VID1] our CNNs were deep and fast enough[DAN][DAN1][GPUCNN5] vision (explicitly mentioned by ACM) for the first time[R6] (see Sec. D).

      Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.[LSTM1-4][CTC] In particular, as mentioned in Sec. A, such as HMMs.[BW][BOU][BRI][HYB12] As mentioned in Sec. B and XVI, the first superior end-to-end neural machine translation was also based on LSTM.

      Critique of 2018 Turing Award ACM's statement is "less wrong" than Honda's[HIN](Sec. I) but still (and apparently even other award committees[HIN](Sec. I)) backpropagation by Rumelhart et al. (1985-86)[RUM] (1982).[BP2] And the article[RUM] even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),[BP1] Kelley already had a precursor thereof in the field of control theory;[BPA] see also later work of the early 1960s.[BPB][BPC][R7] internal representations in hidden layers of NNs.[RUM] But this was essentially just an experimental analysis of a known method.[BP1-2] And history of backpropagation can be found at Scholarpedia[DL2] and in my award-winning survey.[DL1] Also see Sec. XIX, II.

      Some claim that "backpropagation is just the chain rule of Leibniz (1676) & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.[BP1] recent debate:[HIN] It is true that in 2018, Hinton[AOI] Rumelhart[RUM] with the "invention" of backpropagation. for "creating" the method and for other things he didn't do.[HIN] Neither in a popular book[AOI] nor in other recent work[DL3,DL3a] did he cite Linnainmaa (1970),[BP1] the true creator.[BP4-5] that his 2015 survey[DL3] does cite Werbos (1974) who however described the method correctly only later in 1982[BP2] and also failed to cite Linnainmaa.[BP1] Compare the 1967-68 work of Amari:[GD1-3] to my knowledge the first to propose and implement stochastic gradient descent[STO51-52] reverse mode gradient descent method now known as backpropagation[BP1]); see also Tsypkin's work of 1966.[GDa-b] Linnainmaa's backpropagation method was well-known.[BP5][DL1-2][DLC] It wasn't created by "lots of different people" as Hinton suggested[AOI][HIN][R11] one person who published first[BP1] and therefore should get the credit.

      Critique of 2018 Turing Award Boltzmann Machine (BM)[BM] a learning.[HIN] Recently, however, I learnt through a reader that even the BM paper[BM] did not cite prior relevant work by Sherrington & Kirkpatrick[SK75] and Glauber.[G63] (Compare related work.[H86][H88][S93]) multilayer perceptrons with arbitrarily many layers.[DEEP1-2][HIN] Sec. II V & X.[MIR](Sec. 1)[R8]

      As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert[M69] at the problem in the 1980s."[S20] However, the 1969 book[M69] addressed a "deep learning problem" (a limitation of Gauss & Legendre's shallow learning around 1800[DL1-2]) that had already been solved four years prior (see Sec. II), also in the 1970s, especially outside of the Anglosphere.[DEEP2][GD1-3][CNN1][DL1-2]

      Critique of 2018 Turing Award Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).[Drop1-3] Hinton's 2012 paper and his later patent did not cite this either. as we showed already in 2011 in a contest where LeCun's team participated as well,[DAN1] Sec. D above. Back then, the only really of deep CNNs through GPUs.[GPUCNN1,3,5][R6] Already before ImageNet 2012,[R6] fast deep CNN called DanNet a monopoly on winning computer vision competitions.[GPUCNN5] It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011[GPUCNN2][DAN,DAN1][R6] long before the similar system of Hinton's student. See Sec. D as well as Sec. 19 of the overview.[MIR]

      Critique of 2018 Turing Award since the late 1980s.[BW][BRI][BOU] LSTM (1990s-2005)[LSTM0-6] and CTC[CTC] (2006), which were applied to speech in 2007.[LSTM4][LSTM14] CTC-LSTM is end-to-end-neural and thus very different from (and superior to) the hybrid methods since the late 1980s.[BW][BRI][BOU][HYB12] See also Sec. A.

      Critique of 2018 Turing Award 5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.[SNT] Bengio[NPM] characterizes it only briefly as "related" (see also Pollack's earlier work on embeddings of words and other structures[PO87][PO90]). In the 2010s, was actually the LSTM of our team,[LSTM0-6] which Bloomberg called the "arguably the most commercial AI achievement."[AV1][MIR](Sec. 4) See Sec. B. Bengio's team[ATT14] has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (see Sec. B). adaptive neural sequential attention: end-to-end-differentiable "soft" attention in the latent space of Fast Weight Programmers (FWPs),[FWP2][FWP] and "hard" attention (in observation space) in the context of RL[ATT][ATT0-1] (1990). attention-based Transformers[TR1-6] are FWPs of 1991[FWP0-1] which have become a popular alternative to RNNs. My FWP of 1991[FWP0-1] (now often called keys and values for self-attention).[TR1-6][FWP] the 2010s,[DEC] Transformers[TR1-2] a traditional LSTM domain (see Sec. B). rapidly learn to solve quickly[LSTM13,17] linear Transformers or Performers[TR5-6] which are formally equivalent to my 1991 FWPs (apart from normalization).[FWP6][FWP] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves.

      See[MIR](Sec. 9)[R4] for my related priority dispute on attention with Hinton. He was the reviewer of my 1990 paper[ATT2] his own work:[ATT3]

      Critique of 2018 Turing Award GANs[GAN0-1] (2010-2014) are actually a simple application[AC] of the adversarial curiosity (AC) principle from 1990[AC90,90b][AC20] (see also surveys[AC09-10]). This principle is now widely used for exploration in RL (e.g., Sec. C) and for image synthesis[GAN1] (also mentioned by ACM in Sec. XVIII). predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,[GAN1] a well-known 2010 survey[AC10] summarised the generative adversarial NNs of 1990 as follows: a whether the controller's (or generator's) output is in a given set.[AC20][AC] early adversarial machine learning settings[S59][H90] neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20]) Bengio et al. neither cited the original work[AC90,90b][AC20] nor corrected their erroneous claims[GAN1] about the other (1991).[PM1-2][AC20][R2][MIR](Sec. 5) Bloomberg,[AV1] their NIPS 2014 paper[GAN1] and some of the erroneous claims it made about my prior work.[AC20] Goodfellow eventually admitted that PM is adversarial (his paper[GAN1] still claims the opposite), but emphasized that it's not generative. However, the even earlier AC[AC90,90b][AC10][AC20] is both adversarial and generative (its generator contains probabilistic units[AC90] like in StyleGANs[GAN2]). When the authors[GAN1] I published one myself in the hopes of correcting the annals of history.[AC20] that they are instances of my earlier work.[R2][AC20] vanishing gradient problem,[MIR](Sec. 3)[VAN1] Bengio published his own,[VAN2] without citing Sepp. was settled in favor of Sepp.[VAN1] However, even after a common publication,[VAN3] Bengio published papers[VAN4][XAV] are poor indicators of truly pioneering work.[NAT1] (Margin note: Bengio states[YB20] that in 2018 he one must at least clarify it later,[DLC] Bengio also claims[YB20] that in 1995 my publications on exactly this topic date back to 1991-93.[UN0-2][UN] which I started in 1987[META1][META] long before Bengio that he did it before me.[R3] Bengio also writes[YB20] that in Regarding attention-based Transformers,[TR1-6] Bengio[DL3a] cites his own team (2014) for "soft attention" without citing my much earlier original work of 1991-1993 on soft attention and linear Transformers.[FWP,FWP0-2,6] Bengio has also heavily used our LSTM (see Sec. A-C), "gated recurrent units (GRU)"[LSTMGRU] for a variant of our vanilla LSTM architecture[LSTM2] (2000) which he did not cite although our work[LSTM2] was the one that introduced gated recurrent units. In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method. learn to count[LSTMGRU2] nor learn simple non-regular languages;[LSTMGRU2] they according to Google Brain.[LSTMGRU3]) unsupervised pre-training for deep NNs.[UN0-4][HIN](Sec. II)[MIR](Sec. 1) Hinton's paper[UN4] (2006) appeared long after my earlier work on this[UN0-2] the first NNs shown to solve very deep problems (see Sec. II above).[UN] It was published in 1991-92[UN1] when compute was about 1000 times more expensive than in 2006. survey (2015),[DL3][DLC] See also Sec. II & III. compressing or distilling one NN into another.[UN0-2][DIST1-2][MIR](Sec. 2) Hinton[DIST2] (2006) did not cite my much earlier original work on this (1991),[UN1][UN] not even in his later patent application fast weight programmers[FWP][FWP0-4a] through tensor-like outer products (1991-2016) and their motivation[FWP2][FWP4a][MIR](Sec. 8) (see also Sec. XVI above). learning sequential attention with NNs.[MIR](Sec. 9) Hinton[ATT3] (2010) our much earlier work on this[ATT1][ATT] although he was both reviewer and editor of my summary[ATT2] (1990; see Sec. XVI above).

      The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.[R4] Remarkably, three of them are related to the 1991 paper[UN1][UN] which in many ways started what people now call deep learning, going beyond Most of them go back to work of 1990-91.[MIR] See Sec. I for additional related issues of credit assignment.

      Critique of 2018 Turing Award LeCun's team has made important contributions to CNNs since 1989.[CNN2,4] However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).[CNN1] NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.[CNN1a] Waibel called this TDNN and All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 1990-1991.[MIR] at IJCNN 2011 in Silicon Valley, our DanNet[DAN][GPUCNN1-3] won the superhuman performance three times worse performance).[DAN1] Again see Sec. D. Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.[BA93] at ICPR 2012, our DanNet[GPUCNN1-3] won the medical imaging contest (Sept 2012, on detection of mitosis/cancer)[GPUCNN5,7,8] (before the similar AlexNet won ImageNet 2012[GPUCNN5][R6] and the similar VGG network[GPUCNN9] won ImageNet 2014). mitosis detection.[MGC][GPUCNN5,7,8] Many major companies are using it now. See Sec. D & VII. ACM also explicitly mentions speech recognition, speech synthesis,[AM16][DL1] All of these fields were heavily shaped in the 2010s by our non-CNN methods.[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17] See Sec. A, B, VI, XI.

      Critique of 2018 Turing Award As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)[BP2-4] (see also Amari's work on SGD for MLPs of 1967-68[GD1-2a]). recent work.[DL3,DL3a][DLC] In 1960, Kelley already had a precursor of the algorithm.[BPA] Furthermore, many besides LeCun have worked "to speed up backpropagation algorithms"[DL1] (ACM's wording). More on the history of backpropagation can be found at Scholarpedia.[DL2][BP4]

      Critique of 2018 Turing Award However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)[DEEP1-2] and Amari[GD1-2] (and also Fukushima[CNN1][DL2]) had long before LeCun. See Sec. D & II & XIII & V.

      Critique of 2018 Turing Award LeCun et al. neither cited the origins[BP1] (1970) of this widely used type of automatic differentiation for differentiable networks of modules[DL2][BP4-5][DLC] for such systems.[S80] See also Sec. XIX & XII. before LeCun who did not cite them. See also Pollack's even earlier relevant work;[PO87-90] compare the important work of Baldi and colleagues.[BA96-03]

      (Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).[UN2] For example, our adaptive subgoal generators (1991)[HRL0-2] were trained through end-to-end-differentiable chains of such modules.[MIR](Sec. 10) planning and reinforcement learning with recurrent neural world models (1990).[PLAN][MIR](Sec. 11) Same for my linear transformer-like fast weight programmers[FWP0-2][FWP][ATT][MIR](Sec. 8) since 1991 (see Sec. XVI) see "100 Authors against Einstein."[AH1] ad hominem attacks[AH2-3][HIN] "If you cannot dispute a fact-based message, attack the messenger himself."[HIN] Science has a well-established way of dealing with plagiarism (which may be unintentional[PLAG1][CONN21] or not[FAKE2]) award can ever change that.[HIN] and their co-workers have contributed useful improvements of deep learning methods.[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS] whom they did not cite, in contrast to ACM's Code of Ethics and Professional Conduct[ACM18] II, V, XII, XIX, XXI, XIII, XIV, XI, and XX, and 2). Sec. I, A, B, C, D, XVII, VI, and XVI). As emphasized earlier:[DLC][HIN] to self-correction,"[SV20] as is already the standard in other scientific fields. in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video[VID2] Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article[NYT3] Google's original 2016 paper on Google Translate[WU] mentions LSTM over 50 times (see Sec. B). In ad hominem style,[AH2-3] claiming credit he doesn't deserve for many, many things",[NYT1] without LeCun also called the GANs of Bengio's team[GAN1] GANs are variations of my work in 1990.[AC90,90b][AC20][R2] According to Bloomberg,[AV2] Bengio has simply "denied my claims" without backing up his denial by any facts; see Sec. XVII. and forcefully contradict public figures who promote it."[FAKE] LBH, who called themselves the deep learning conspiracy,[DLC][DLC1-2] Our LSTM paper[LSTM1] has got more citations than any paper by Bengio or LeCun,[R5] Hinton's most cited paper (2012) is the one on GPU-based CNNs.[GPUCNN4][R5] It follows our earlier work on supervised deep NNs (2010)[MLP1] unsupervised pre-training for deep NNs by myself [UN][UN0-3] and later championed by Hinton;[UN4][VID1] see Sec. D). Hinton (2012)[GPUCNN4] characterizes our deep and fast DanNet (2011)[GPUCNN1-3] as AlexNet won one;[R6] see Sec. D, XIV. The highly cited VGG network (2014)[GPUCNN9] Hinton's 2nd most cited paper[RUM][R5] of Hinton's paper,[RUM] adding citations for a book by Rumelhart & McClelland[R5]). Backpropagation is a previously invented method[BP1] whose origins of Ivakhnenko whom he has never cited;[DEEP1-2][R7-R8] see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014),[GAN1] which are instances of my artificial curiosity (1990)[AC90,90b][AC20][R2] which he did not cite; see Sec. XVII. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-)[UN4] by ours[UN0-2][UN] were preceded by Hanson's[Drop1-3] As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,[DL3a] again heavily citing LBH without Consult the Executive Summary and Sec. I-XXI of this critique for more. So virtually all the algorithms that have attracted have their conceptual and technical roots in my labs in Munich and Lugano,[MOST] of deep learning MLPs since 1965[DEEP1-2][GD1-2a] (see Sec. II, XX) and backpropagation (1960-70)[BPA][BP1] (see Sec. XIX, XII) and convolutional NNs since 1979[CNN1-4] (see Sec. XVIII, D). Our LSTM (1990s, see Sec. A, B; also for RL, 2003-, see Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII). our own unsupervised pre-training of deep NNs (1991, see Sec. II & III) for recurrent NNs in the 1990s → our LSTM (see Sec. A-C) and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D). our LSTM brought essentially unlimited depth to gradient-based supervised recurrent NNs in the 1990s; our Highway Nets[HW1-3] brought it to feedforward NNs in May 2015.[MOST] superior computer vision (2011, see Sec. D, XVIII), medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.[DEC] speech recognition (with our CTC, 2007-15, see Sec. A), machine translation (2016, see Sec. B), robotics & video game players (2018-19, see Sec. C), and many other applications.[DEC] Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to linear Transformers (now popular in NLP). I, A, B, C, D, VII, XVIII.

      As mentioned earlier,[MIR](Sec. 21) it is not always clear[DLC] depth that really learned.[DEEP1-2][R8] Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.[GD1-2a] A few years later, modern backpropagation unintentional[PLAG1][CONN21] or intentional.[FAKE2]

      Yes, this critique is also an implicit critique of certain other awards to LBH.[HIN] reddit.com/r/MachineLearning[R1-R12] (the largest machine learning forum with back then over 800k subscribers), many of them influenced by my overview.[MIR]

      Dr. LeCun himself is well aware of the challenges to scientific integrity in our field:[LECP] "... else cites."[LECP] weights and an adaptive output layer.[R62] So Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs)[ELM1] revisionist narrative of ELMs[ELM2][CONN21] self-proclaimed "deep learning conspiracy"[DLC1-2]

      Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,[HIN] as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,[NASC1-2] the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th century.[NASC9] AI scientists and AI historians equipped with artificial curiosity[SA17][AC90-AC20][PP-PP2][R1]

      Creative Commons LicenseThanks publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks (more). PDF. More. PDF. PDF. PDF. PDF. (More on artificial scientists and artists.) IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] (more). Preprint arXiv/1906.04493. ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: Link. Link. [AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book. Blog of Werner Vogels, CTO of Amazon (Nov 2016): PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network.[AMH3] The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.[AMH1] [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).[FWP] Today, both types are very popular. PDF. PDF. More. PS. (PDF.) arXiv/1409.0473, 2014-16. Bloomberg, May 15, 2018. Bloomberg, May 17, 2018. PDF. HTML. PDF. Precursor of modern backpropagation.[BP1-4] PDF. Link. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] English version: [CNN1+]. More in Scholarpedia. Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing PDF. Spatial Averaging.[CNN1] Spatial Averaging.[CNN1] PDF. PDF. PDF. Since November 2021: Comments on version 1 of the present report[T21v1] in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive. PDF. Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE]. J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named 1st superhuman result in 2011.[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. our artificial neural network called DanNet [DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The [DIST1] J. Schmidhuber, 1991.[UN-UN2] More. Deep Learning. HTML. [DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By greatly improved (CTC-based) on-device speech recognition (on the phone, not the server) LSTM. PDF. J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation? Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep NNs[UN4] (2006) although this type of deep learning dates back to 1991.[UN1-2][UN] II & XVII & III. [DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). arxiv:1312.5602. Link. Alphastar has a "deep LSTM core." arXiv:1808.03578, 2018. In fact, the ELM concept goes back to Rosenblatt's work around 1960.[R62] used LSTM over 4 billion automatic translations per day (The Verge, August 4, 2017); Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) PDF. J.  Schmidhuber (AI Blog, 26 March 2021). alternative[FWP0-1] to recurrent NNs. the fast weights[FAST,FASTa] of Such Fast Weight Programmers[FWP0-6,FWPMETA1-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1] (now often called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with projections linear Transformers or Performers[TR5-6] In 1993, I introduced the attention terminology[FWP2] now used in this context,[ATT] and RNNs that program themselves. PDF.