Aberdeen, D. (2003). Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University. Abounadi, J., Bertsekas, D.,Borkar, V. S. (2002). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control,Optimization, 40(3):681–698. Akaike, H. (1970). Statistical predictor identification. Ann. Inst. Statist. Math., 22:203–217. Allender, A. (1992). Application of time-bounded Kolmogorov complexity in complexity theory. InWatanabe,O., editor, Kolmogorov complexity,computational complexity, pages 6–22. EATCSMonographson Theoretical Computer Science, Springer. Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combinatorialenvironment. In IEEE 1st International Conference on Neural Networks, San Diego, volume 2, pages609–618. Almeida, L. B., Almeida, L. B., Langlois, T., Amaral, J. D., Redol, R. A. (1997). On-line step sizeadaptation. Technical report, INESC, 9 Rua Alves Redol, 1000. Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Trans. EC, 16(3):299–307. Amari, S., Cichocki, A., Yang, H. (1996). A new learning algorithm for blind signal separation. In Touretzky, D. S., Mozer, M. C., Hasselmo, M. E., editors, Advances in Neural InformationProcessing Systems, volume 8. The MIT Press. Amari, S.,Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. NeuralComputation, 5(1):140–153. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276. An, G. (1996). The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3):643–674. Andrade, M., Chacon, P., Merelo, J., Moran, F. (1993). Evaluation of secondary structure of proteinsfrom uv circular dichroism spectra using an unsupervised learning neural network. Protein Engineering,6(4):383–390. Ash, T. (1989). Dynamic node creation in backpropagation neural networks. Connection Science,1(4):365–375. Atick, J. J., Li, Z., Redlich, A. N. (1992). Understanding retinal color coding from first principles. Neural Computation, 4:559–572. Baird, H. (1990). Document Image Defect Models. In Proceddings, IAPR Workshop on Syntactic andStructural Pattern Recognition, Murray Hill, NJ. Baird, L.,Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances inneural information processing systems 12 (NIPS), pages 968–974. MIT Press. Baird, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. In IEEE WorldCongress on Computational Intelligence, volume 4, pages 2448–2453. IEEE. Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In InternationalConference on Machine Learning, pages 30–37. Bakker, B. (2002). Reinforcement learning with Long Short-TermMemory. In Dietterich, T. G., Becker, S.,and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, pages 1475–1482. MIT Press, Cambridge, MA. Bakker, B.,Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discoveryand subpolicy specialization. In et al., F. G., editor, Proc. 8th Conference on Intelligent AutonomousSystems IAS-8, pages 438–445, Amsterdam, NL. IOS Press. Bakker, B., Zhumatiy, V., Gruener, G., Schmidhuber, J. (2003). A robot that reinforcement-learns toidentify,memorize important previous observations. In Proceedings of the 2003 IEEE/RSJ InternationalConference on Intelligent Robots,Systems, IROS 2003, pages 430–435. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., Soda, G. (2001). Bidirectional dynamics for proteinsecondary structure prediction. Lecture Notes in Computer Science, 1828:80–104. Baldi, P.,Hornik, K. (1989). Neural networks,principal component analysis: Learning from exampleswithout local minima. Neural Networks, 2:53–58. Baldi, P.,Pollastri, G. (2003). The principled design of large-scale recursive neural networkarchitectures–DAG-RNNs,the protein structure prediction problem. J. Mach. Learn. Res., 4:575–602. Baldi, P.,Sadowski, P. (2013). Understanding dropout. In Burges, C. J. C., Bottou, L., Welling, M.,Ghahramani, Z., Weinberger, K. Q., editors, Advances in Neural Information Processing Systems(NIPS), volume 26, pages 2814–2822. Ballard, D. H. (1987). Modular learning in neural networks. In Proc. AAAI, pages 279–284. Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic search basedfunction optimization,competitive learning. Technical Report CMU-CS-94-163, Carnegie MellonUniversity. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3):295–311. Barlow, H. B., Kaushal, T. P., Mitchison, G. J. (1989). Finding minimum entropy codes. NeuralComputation, 1(3):412–423. Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE 1st Annual Conference onNeural Networks, volume IV, pages 115–121. IEEE. Bartlett, P. L.,Baxter, J. (2011). Infinite-horizon policy-gradient estimation. arXiv preprintarXiv:1106.0665. Barto, A. G.,Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. DiscreteEvent Dynamic Systems, 13(4):341–379. Barto, A. G., Singh, S., Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collectionsof skills. In Proceedings of International Conference on Developmental Learning (ICDL), pages112–119. MIT Press, Cambridge, MA. Barto, A. G., Sutton, R. S., Anderson, C. W. (1983). Neuronlike adaptive elements that can solvedifficult learning control problems. IEEE Transactions on Systems,Man, Cybernetics, SMC-13:834–846. Battiti, R. (1989). Accelerated backpropagation learning: two optimization methods. Complex Systems,3(4):331–342. Battiti, T. (1992). First-,second-order methods for learning: Between steepest descent,Newton’smethod. Neural Computation, 4(2):141–166. Baxter, J.,Bartlett, P. (1999). Direct Gradient-Based Reinforcement Learning. Technical report, ResearchSchool of Information Sciences,Engineering, Australian National University. Bayer, J., Osendorfer, C., Chen, N., Urban, S., van der Smagt, P. (2013). On fast dropout,itsapplicability to recurrent networks. arXiv preprint arXiv:1311.0701. Bayer, J., Wierstra, D., Togelius, J., Schmidhuber, J. (2009). Evolving memory cell structures forsequence learning. In Proc. ICANN (2), pages 755–764. Becker, S. (1990). Unsupervised learning procedures for neural networks. Technical report, Departmentof Computer Science, University of Toronto, Ontario. Becker, S. (1991). Unsupervised learning procedures for neural networks. International Journal of NeuralSystems, 2(1 & 2):17–33. Becker, S.,Le Cun, Y. (1989). Improving the convergence of back-propagation learning with secondorder methods. In Touretzky, D., Hinton, G., Sejnowski, T., editors, Proc. 1988 ConnectionistModels Summer School, pages 29–37, Pittsburg 1988. Morgan Kaufmann, San Mateo. Behnke, S. (2003). Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of LectureNotes in Computer Science. Springer. Bell, A. J.,Sejnowski, T. J. (1995). An information-maximization approach to blind separation andblind deconvolution. Neural Computation, 7(6):1129–1159. Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., Moulines, E. (1997). A blind source separationtechnique using second-order statistics. Signal Processing, IEEE Transactions on, 45(2):434–444. Bengio, Y. (1991). Artificial Neural Networks,their Application to Sequence Recognition. PhD thesis,McGill University, (Computer Science), Montreal, Qc., Canada. Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations,Trends in Machine Learning,V2(1). Now Publishers. Bengio, Y., Courville, A., Vincent, P. (2013). Representation learning: A review,new perspectives. Pattern Analysis,Machine Intelligence, IEEE Transactions on, 35(8):1798–1828. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H. (2007). Greedy layer-wise training of deepnetworks. In Cowan, J. D., Tesauro, G., Alspector, J., editors, Advances in Neural InformationProcessing Systems 19 (NIPS), pages 153–160. MIT Press. Bengio, Y., Simard, P., Frasconi, P. (1994). Learning long-term dependencies with gradient descent isdifficult. IEEE Transactions on Neural Networks, 5(2):157–166. Beringer, N., Graves, A., Schiel, F., Schmidhuber, J. (2005). Classifying unprompted speech byretraining LSTM nets. In Duch, W., Kacprzyk, J., Oja, E., Zadrozny, S., editors, Artificial NeuralNetworks: Biological Inspirations - ICANN 2005, LNCS 3696, pages 575–581. Springer-Verlag BerlinHeidelberg. Bertsekas, D. P. (2001). Dynamic Programming,Optimal Control. Athena Scientific. Bertsekas, D. P.,Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont,MA. Biegler-K¨onig, F.,B¨armann, F. (1993). A learning algorithm for multilayered neural networks basedon linear least squares problems. Neural Networks, 6(1):127–131. Bishop, C.M. (1993). Curvature-driven smoothing: A learning algorithm for feed-forward networks. IEEETransactions on Neural Networks, 4(5):882–884. Bishop, C. M. (2006). Pattern Recognition,Machine Learning. Springer. Blair, A. D.,Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5):1127–1142. Bluche, T., Louradour, J., Knibbe, M., Moysset, B., Benzeghiba, F., Kermorvant., C. (2014). TheA2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation. In InternationalWorkshop on Document Analysis Systems. Bobrowski, L. (1978). Learning processes in multilayer threshold nets. Biological Cybernetics, 31:1–6. Bod´en, M., Wiles, J. (2000). Context-free,context-sensitive dynamics in recurrent neural networks. Connection Science, 12(3-4):197–210. Bodenhausen, U.,Waibel, A. (1991). The tempo 2 algorithm: Adjusting time-delays by supervisedlearning. In Lippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advances in Neural InformationProcessing Systems 3, pages 155–161. Morgan Kaufmann. Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications `a la reconnaissancede la parole. PhD thesis, Universit´e de Paris XI. Bourlard, H.,Morgan, N. (1994). Connnectionist Speech Recognition: A Hybrid Approach. KluwerAcademic Publishers. Boutilier, C.,Poole, D. (1996). Computing optimal policies for partially observable Markov decisionprocesses using compact representations. In Proceedings of the AAAI, Portland, OR. Bradtke, S. J., Barto, A. G., Kaelbling, P. (1996). Linear least-squares algorithms for temporal differencelearning. In Machine Learning, pages 22–33. Brafman, R. I.,Tennenholtz, M. (2002). R-MAX—a general polynomial time algorithm for nearoptimalreinforcement learning. Journal of Machine Learning Research, 3:213–231. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., Shafait, F. (2013). High-performance OCR for printedEnglish,Fraktur using LSTM networks. In Document Analysis,Recognition (ICDAR), 2013 12thInternational Conference on, pages 683–687. IEEE. Broyden, C. G. et al. (1965). A class of methods for solving nonlinear simultaneous equations. Math. Comp, 19(92):577–593. Bryson, A.,Ho, Y. (1969). Applied optimal control: optimization, estimation, control. BlaisdellPub. Co. Bryson, A. E. (1961). A gradient method for optimizing multi-stage allocation processes. In Proc. HarvardUniv. Symposium on digital computers,their applications. Bryson, Jr., A. E.,Denham, W. F. (1961). A steepest-ascent method for solving optimum programmingproblems. Technical Report BR-1303, Raytheon Company, Missle,Space Division. Buntine, W. L.,Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5:603–643. Cardoso, J.-F. (1994). On the performance of orthogonal source separation algorithms. In Proc. EUSIPCO,pages 776–779. Carter, M. J., Rudolph, F. J., Nucci, A. J. (1990). Operational fault tolerance of CMAC networks. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 340–347. SanMateo, CA: Morgan Kaufmann. Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neuralnetworks,finite state machine extraction. Neural Computation, 8(6):1135–1178. Cauwenberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning,optimization. In Lippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advances in Neural InformationProcessing Systems 5, pages 244–244. Morgan Kaufmann. Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of theACM, 13:547–569. Chalup, S. K.,Blair, A. D. (2003). Incremental training of first order recurrent neural networks topredict a context-sensitive language. Neural Networks, 16(7):955–972. Chellapilla, K., Puri, S., Simard, P. (2006). High performance convolutional neural networks fordocument processing. In International Workshop on Frontiers in Handwriting Recognition. Church, A. (1936). An unsolvable problem of elementary number theory. American Journal of Mathematics,58:345–363. Ciresan, D. C., Giusti, A., Gambardella, L. M., Schmidhuber, J. (2012a). Deep neural networkssegment neuronal membranes in electron microscopy images. In Advances in Neural Information ProcessingSystems NIPS, pages 2852–2860. Ciresan, D. C., Giusti, A., Gambardella, L. M., Schmidhuber, J. (2013). Mitosis detection in breastcancer histology images with deep neural networks. In MICCAI, volume 2, pages 411–418. Ciresan, D. C., Meier, U., Gambardella, L. M., Schmidhuber, J. (2010). Deep big simple neural netsfor handwritten digit recogntion. Neural Computation, 22(12):3207–3220. Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., Schmidhuber, J. (2011a). Flexible, high performanceconvolutional neural networks for image classification. In Intl. Joint Conference on ArtificialIntelligence IJCAI, pages 1237–1242. Ciresan, D. C., Meier, U., Masci, J., Schmidhuber, J. (2011b). A committee of neural networks fortraffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921. Ciresan, D. C., Meier, U., Masci, J., Schmidhuber, J. (2012b). Multi-column deep neural network fortraffic sign classification. Neural Networks, 32:333–338. Ciresan, D. C., Meier, U., Schmidhuber, J. (2012c). Multi-column deep neural networks for imageclassification. In IEEE Conference on Computer Vision,Pattern Recognition CVPR 2012. Longpreprint arXiv:1202.2745v1 [cs.CV]. Ciresan, D. C., Meier, U., Schmidhuber, J. (2012d). Transfer learning for Latin,Chinese characterswith deep neural networks. In International Joint Conference on Neural Networks, pages 1301–1306. Ciresan, D. C.,Schmidhuber, J. (2013). Multi-column deep neural networks for offline handwrittenChinese character classification. Technical report, IDSIA. arXiv:1309.0261. Clune, J., Stanley, K. O., Pennock, R. T., Ofria, C. (2011). On the performance of indirect encodingacross the continuum of regularity. Trans. Evol. Comp, 15(3):346–367. Coates, A., Huval, B., Wang, T.,Wu, D. J., Ng, A. Y., Catanzaro, B. (2013). Deep learning with COTSHPC systems. In Proc. International Conference on Machine learning (ICML’13). Cochocki, A.,Unbehauen, R. (1993). Neural networks for optimization,signal processing. JohnWiley & Sons, Inc. Comon, P. (1994). Independent component analysis – a new concept? Signal Processing, 36(3):287–314. Connor, J., Martin, D. R., Atlas, L. E. (1994). Recurrent neural networks,robust time seriesprediction. IEEE Transactions on Neural Networks, 5(2):240–254. Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings of the 3rd AnnualACM Symposium on the Theory of Computing (STOC’71)., page 151158. ACM, New York. Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. InGrefenstette, J., editor, Proceedings of an International Conference on Genetic Algorithms,TheirApplications, Carnegie-Mellon University, July 24-26, 1985, Hillsdale NJ. Lawrence Erlbaum Associates. Craven, P.,Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correctdegree of smoothing by the method of generalized cross-validation. Numer. Math., 31:377–403. Cuccu, G., Luciw, M., Schmidhuber, J., Gomez, F. (2011). Intrinsically motivated evolutionary searchfor vision-based reinforcement learning. In Proceedings of the 2011 IEEE Conference on Developmentand Learning,Epigenetic Robotics IEEE-ICDL-EPIROB, volume 2, pages 1–7. IEEE. Dahl, G., Yu, D., Deng, L., Acero, A. (2012). Context-dependent pre-trained deep neural networks forlarge-vocabulary speech recognition. Audio, Speech, Language Processing, IEEE Transactions on,20(1):30–42. Dahl, G. E., Sainath, T. N., Hinton, G. E. (2013). Improving Deep Neural Networks for LVCSR usingRectified Linear Units,Dropout. In Acoustics, Speech,Signal Processing (ICASSP), 2013 IEEEInternational Conference on, pages 8609–8613. IEEE. D’Ambrosio, D. B.,Stanley, K. O. (2007). A novel generative encoding for exploiting neural networksensor,output geometry. In Proceedings of the Conference on Genetic,Evolutionary Computation(GECCO), pages 974–981. Dayan, P.,Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E., andTouretzky, D. S., editors, Advances in Neural Information Processing Systems 5, pages 271–278.MorganKaufmann. Dayan, P.,Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8):1385–1403. Dayan, P., Hinton, G. E., Neal, R. M., Zemel, R. S. (1995). The Helmholtz machine. Neural Computation,7:889–904. Dayan, P.,Zemel, R. (1995). Competition,multiple cause models. Neural Computation, 7:565–579. de Vries, B.,Principe, J. C. (1991). A theory for neural networks with time delays. In Lippmann, R. P.,Moody, J. E., Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages162–168. Morgan Kaufmann. Deco, G.,Parra, L. (1997). Non-linear feature extraction by redundancy reduction in an unsupervisedstochastic neural network. Neural Networks, 10(4):683–691. DeMers, D.,Cottrell, G. (1993). Non-linear dimensionality reduction. In Hanson, S. J., Cowan, J. D.,and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 580–587.MorganKaufmann. Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society, B, 39. Deng, L.,Yu, D. (2014). Deep Learning: Methods,Applications. NOW Publishers. Dickmanns, E. D., Behringer, R., Dickmanns, D., Hildebrandt, T., Maurer, M., Thomanek, F., Schiehlen, J. (1994). The seeing passenger car ’VaMoRs-P’. In Proc. Int. Symp. on Intelligent Vehicles’94, Paris, pages 68–73. Dietterich, T. G. (2000a). Ensemble methods in machine learning. In Multiple classifier systems, pages1–15. Springer. Dietterich, T. G. (2000b). Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. (JAIR), 13:227–303. Director, S. W.,Rohrer, R. A. (1969). Automated network design - the frequency-domain case. IEEETrans. Circuit Theory, CT-16:330–337. Doya, K., Samejima, K., ichi Katagiri, K., Kawato, M. (2002). Multiple model-based reinforcementlearning. Neural Computation, 14(6):1347–1369. Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical Analysisand Applications, 5(1):30–45. Dreyfus, S. E. (1973). The computational solution of optimal control problems with time lag. IEEETransactions on Automatic Control, 18(4):383–385. Duchi, J., Hazan, E., Singer, Y. (2011). Adaptive subgradient methods for online learning,stochasticoptimization. The Journal of Machine Learning, 12:2121–2159. Egorova, A., Gloye, A., G¨oktekin, C., Liers, A., Luft, M., Rojas, R., Simon, M., Tenchio, O., Wiesel,F. (2004). FU-Fighters Small Size 2004, Team Description. RoboCup 2004 Symposium: Papers andTeam Description Papers. CD edition. Elman, J. L. (1988). Finding structure in time. Technical Report CRL 8801, Center for Research inLanguage, University of California, San Diego. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S. (2010). Why doesunsupervised pre-training help deep learning? J. Mach. Learn. Res., 11:625–660. Eubank, R. L. (1988). Spline smoothing,nonparametric regression. In Farlow, S., editor, Self-Organizing Methods in Modeling. Marcel Dekker, New York. Euler, L. (1744). Methodus inveniendi. Faggin, F. (1992). Neural network hardware. In Neural Networks, 1992. IJCNN., International JointConference on, volume 1, page 153. Fahlman, S. E. (1988). An empirical study of learning speed in back-propagation networks. TechnicalReport CMU-CS-88-162, Carnegie-Mellon Univ. Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In Lippmann, R. P., Moody,J. E., Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 190–196. Morgan Kaufmann. Farabet, C., Couprie, C., Najman, L., LeCun, Y. (2013). Learning hierarchical features for scenelabeling. Pattern Analysis,Machine Intelligence, IEEE Transactions on, 35(8):1915–1929. Fern´andez, S., Graves, A., Schmidhuber, J. (2007). An application of recurrent neural networks todiscriminative keyword spotting. In Proc. ICANN (2), pages 220–229. Fernandez, S., Graves, A., Schmidhuber, J. (2007). Sequence labelling in structured domains withhierarchical recurrent neural networks. In Proceedings of the 20th International Joint Conference onArtificial Intelligence (IJCAI). Field, D. J. (1987). Relations between the statistics of natural images,the response properties of corticalcells. Journal of the Optical Society of America, 4:2379–2394. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6:559–601. Fletcher, R.,Powell, M. J. (1963). A rapidly convergent descent method for minimization. The ComputerJournal, 6(2):163–168. Fogel, L., Owens, A., Walsh, M. (1966). Artificial Intelligence through Simulated Evolution. Willey,New York. F¨oldi´ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics,64:165–170. F¨oldi´ak, P.,Young, M. P. (1995). Sparse coding in the primate cortex. In Arbib, M. A., editor, TheHandbook of Brain Theory,Neural Networks, pages 895–898. The MIT Press. F¨orster, A., Graves, A., Schmidhuber, J. (2007). RNN-based Learning of Compact Maps for EfficientRobot Localization. In 15th European Symposium on Artificial Neural Networks, ESANN, pages 537–542, Bruges, Belgium. Franzius, M., Sprekeler, H., Wiskott, L. (2007). Slowness,sparseness lead to place, head-direction,and spatial-view cells. PLoS Computational Biology, 3(8):166. Frinken, V., Zamora-Martinez, F., Espana-Boquera, S., Castro-Bleda, M. J., Fischer, A., Bunke, H. (2012). Long-short term memory neural networks language modeling for handwriting recognition. InPattern Recognition (ICPR), 2012 21st International Conference on, pages 701–704. IEEE. Fritzke, B. (1994). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S., andLeen, T. K., editors, NIPS, pages 625–632. MIT Press. Fu, K. S. (1977). Syntactic Pattern Recognition,Applications. Berlin, Springer. Fukada, T., Schuster, M., Sagisaka, Y. (1999). Phoneme boundary estimation using bidirectionalrecurrent neural networks,its applications. Systems,Computers in Japan, 30(4):20–30. Fukushima, K. (1979). Neural network model for a mechanism of pattern recognition unaffected by shiftin position - Neocognitron. Trans. IECE, J62-A(10):658–665. Fukushima, K. (1980). Neocognitron: A self-organizing neural network for a mechanism of pattern recognitionunaffected by shift in position. Biological Cybernetics, 36(4):193–202. Fukushima, K. (2011). Increasing robustness against background noise: visual pattern recognition by aNeocognitron. Neural Networks, 24(7):767–778. Fukushima, K. (2013a). Artificial vision by multi-layered neural networks: Neocognitron,its advances. Neural Networks, 37:103–119. Fukushima, K. (2013b). Training multi-layered neural network Neocognitron. Neural Networks, 40:18–31. Gauss, C. F. (1821). Theoria combinationis observationum erroribus minimis obnoxiae (Theory of thecombination of observations least subject to error). Geman, S., Bienenstock, E., Doursat, R. (1992). Neural networks,the bias/variance dilemma. Neural Computation, 4:1–58. Gers, F. A.,Schmidhuber, J. (2000). Recurrent nets that time,count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 3, pages189–194. IEEE. Gers, F. A.,Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free,contextsensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340. Gers, F. A., Schmidhuber, J., Cummins, F. (2000). Learning to forget: Continual prediction withLSTM. Neural Computation, 12(10):2451–2471. Gers, F. A., Schraudolph, N., Schmidhuber, J. (2002). Learning precise timing with LSTM recurrentnetworks. Journal of Machine Learning Research, 3:115–143. Ghavamzadeh, M.,Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In Proceedings ofthe Twentieth Conference on Machine Learning (ICML-2003), pages 226–233. Gherrity, M. (1989). A learning algorithm for analog fully recurrent neural networks. In IEEE/INNSInternational Joint Conference on Neural Networks, San Diego, volume 1, pages 643–644. Girshick, R., Donahue, J., Darrell, T., Malik, J. (2013). Rich feature hierarchies for accurate objectdetection,semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley,ICSI. Gisslen, L., Luciw, M., Graziano, V., Schmidhuber, J. (2011). Sequential constant size compressor forreinforcement learning. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google,Mountain View, CA, pages 31–40. Springer. Glasmachers, T., Schaul, T., Sun, Y., Wierstra, D., Schmidhuber, J. (2010). Exponential Natural EvolutionStrategies. In Proceedings of the Genetic,Evolutionary Computation Conference (GECCO),pages 393–400. ACM. Glorot, X., Bordes, A., Bengio, Y. (2011). Deep sparse rectifier networks. In AISTATS, volume 15,pages 315–323. Gloye, A.,Wiesel, F., Tenchio, O., Simon,M. (2005). Reinforcing the driving quality of soccer playingrobots by anticipation. IT - Information Technology, 47(5). G¨odel, K. (1931). ¨Uber formal unentscheidbare S¨atze der Principia Mathematica und verwandter SystemeI. Monatshefte f¨ur Mathematik und Physik, 38:173–198. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization,Machine Learning. Addison-Wesley, Reading, MA. Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics ofcomputation, 24(109):23–26. Golub, G., Heath, H., Wahba, G. (1979). Generalized cross-validation as a method for choosing a goodridge parameter. Technometrics, 21:215–224. Gomez, F. J. (2003). Robust Nonlinear Control through Neuroevolution. PhD thesis, Department of ComputerSciences, University of Texas at Austin. Gomez, F. J.,Miikkulainen, R. (2003). Active guidance for a finless rocket using neuroevolution. InProc. GECCO 2003, Chicago. Gomez, F. J.,Schmidhuber, J. (2005). Co-evolving recurrent neurons learn deep memory POMDPs. In Proc. of the 2005 conference on genetic,evolutionary computation (GECCO), Washington, D. C. ACM Press, New York, NY, USA. Gomez, F. J., Schmidhuber, J., Miikkulainen, R. (2008). Accelerated neural evolution through cooperativelycoevolved synapses. Journal of Machine Learning Research, 9(May):937–965. Goodfellow, I., Mirza, M., Da, X., Courville, A., Bengio, Y. (2014). An Empirical Investigation ofCatastrophic Forgetting in Gradient-Based Neural Networks. TR arXiv:1312.6211v2. Goodfellow, I. J., Courville, A., Bengio, Y. (2011). Spike-and-slab sparse coding for unsupervisedfeature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models. Goodfellow, I. J., Courville, A. C., Bengio, Y. (2012). Large-scale feature learning with spike-and-slabsparse coding. In Proceedings of the 29th International Conference on Machine Learning. Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y. (2013). Maxout networks. InInternational Conference on Machine Learning (ICML). Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural InformationProcessing Systems (NIPS), pages 2348–2356. Graves, A., Eck, D., Beringer, N., Schmidhuber, J. (2003). Isolated digit recognition with LSTMrecurrent networks. In First International Workshop on Biologically Inspired Approaches to AdvancedInformation Technology, Lausanne. Graves, A., Fernandez, S., Gomez, F. J., Schmidhuber, J. (2006). Connectionist temporal classification:Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings of the 23rdInternational Conference on Machine Learning, pages 369–376. Graves, A., Fernandez, S., Liwicki, M., Bunke, H., Schmidhuber, J. (2008). Unconstrained on-linehandwriting recognition with recurrent neural networks. In Platt, J., Koller, D., Singer, Y., Roweis,S., editors, Advances in Neural Information Processing Systems 20, pages 577–584. MIT Press, Cambridge,MA. Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J. (2009). A novelconnectionist system for improved unconstrained handwriting recognition. IEEE Transactions on PatternAnalysis,Machine Intelligence, 31(5). Graves, A., Mohamed, A.-r., Hinton, G. E. (2013). Speech recognition with deep recurrent neuralnetworks. In Acoustics, Speech,Signal Processing (ICASSP), 2013 IEEE International Conferenceon, pages 6645–6649. IEEE. Graves, A.,Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM andother neural network architectures. Neural Networks, 18(5-6):602–610. Graves, A.,Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrentneural networks. In Advances in Neural Information Processing Systems 21, pages 545–552. MIT Press,Cambridge, MA. Graziano, M. (2009). The Intelligent Movement Machine: An Ethological Perspective on the PrimateMotor System. Oxford University Press, USA. Griewank, A. (2012). Documenta Mathematica - Extra Volume ISMP, pages 389–400. Grondman, I., Busoniu, L., Lopes, G. A. D., Babuska, R. (2012). A survey of actor-critic reinforcementlearning: Standard,natural policy gradients. Systems, Man, Cybernetics, Part C: Applicationsand Reviews, IEEE Transactions on, 42(6):1291–1307. Grossberg, S. (1969). Some networks that can learn, remember, reproduce any number of complicatedspace-time patterns, I. Journal of Mathematics,Mechanics, 19:53–91. Grossberg, S. (1976a). Adaptive pattern classification,universal recoding, 1: Parallel development andcoding of neural feature detectors. Biological Cybernetics, 23:187–202. Grossberg, S. (1976b). Adaptive pattern classification,universal recoding, 2: Feedback, expectation,olfaction, illusions. Biological Cybernetics, 23. Gruau, F., Whitley, D., Pyeatt, L. (1996). A comparison between cellular encoding,direct encodingfor genetic neural networks. NeuroCOLT Technical Report NC-TR-96-048, ESPRIT Working Group inNeural,Computational Learning, NeuroCOLT 8556. Gr ¨uttner, M., Sehnke, F., Schaul, T., Schmidhuber, J. (2010). Multi-Dimensional Deep Memory Atari-Go Players for Parameter Exploring Policy Gradients. In Proceedings of the International Conferenceon Artificial Neural Networks ICANN, pages 114–123. Springer. Guyon, I., Vapnik, V., Boser, B., Bottou, L., Solla, S. A. (1992). Structural risk minimization forcharacter recognition. In Lippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advances inNeural Information Processing Systems 4, pages 471–479. Morgan Kaufmann. Hadamard, J. (1908). M´emoire sur le probl`eme d’analyse relatif `a l’´equilibre des plaques ´elastiques encastr´ees. M´emoires pr´esent´es par divers savants `a l’Acad´emie des sciences de l’Institut de France:´Extrait. Imprimerie nationale. Hansen, N., M¨uller, S. D., Koumoutsakos, P. (2003). Reducing the time complexity of the derandomizedevolution strategy with covariance matrix adaptation (cma-es). Evolutionary Computation,11(1):1–18. Hansen, N.,Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195. Hanson, S. J.,Pratt, L. Y. (1989). Comparing biases for minimal network construction with backpropagation. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 1, pages177–185. San Mateo, CA: Morgan Kaufmann. Hashem, S.,Schmeiser, B. (1992). Improving model accuracy using optimal linear combinations oftrained neural networks. IEEE Transactions on Neural Networks, 6:792–794. Hassibi, B.,Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Lippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advances in Neural InformationProcessing Systems 5, pages 164–171. Morgan Kaufmann. Hastie, T. J.,Tibshirani, R. J. (1990). Generalized additive models. Monographs on Statisics andApplied Probability, 43. Hawkins, J.,George, D. (2006). Hierarchical Temporal Memory - Concepts, Theory, Terminology. Numenta Inc. Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. Heemskerk, J. N. (1995). Overview of neural hardware. Neurocomputers for Brain-Style Processing. Design, Implementation,Application. Hertz, J., Krogh, A., Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City. Hestenes, M. R.,Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems. Journalof research of the National Bureau of Standards, 49:409–436. Hihi, S. E.,Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies. In Touretzky, D. S., Mozer, M. C., Hasselmo, M. E., editors, Advances in Neural InformationProcessing Systems 8, pages 493–499. MIT Press. Hinton, G.,Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507. Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1):185–234. Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M. (1995). The wake-sleep algorithm for unsupervisedneural networks. Science, 268:1158–1160. Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen,P., Sainath, T. N., Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97. Hinton, G. E.,Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B, 352:1177–1190. Hinton, G. E., Osindero, S., Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. NeuralComputation, 18(7):1527–1554. Hinton, G. E.,Sejnowski, T. E. (1986). Learning,relearning in Boltzmann machines. In ParallelDistributed Processing, volume 1, pages 282–317. MIT Press. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. R. (2012b). Improvingneural networks by preventing co-adaptation of feature detectors. Technical Report arXiv:1207.0580. Hinton, G. E.,van Camp, D. (1993). Keeping neural networks simple. In Proceedings of the InternationalConference on Artificial Neural Networks, Amsterdam, pages 11–18. Springer. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f ¨urInformatik, Lehrstuhl Prof. Brauer, Technische Universit¨at M¨unchen. Advisor: J. Schmidhuber. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J. (2001a). Gradient flow in recurrent nets: thedifficulty of learning long-term dependencies. In Kremer, S. C.,Kolen, J. F., editors, A Field Guideto Dynamical Recurrent Neural Networks. IEEE Press. Hochreiter, S.,Obermayer, K. (2005). Sequence classification for protein analysis. In Snowbird Workshop,Snowbird, Utah. Computational,Biological Learning Society. Hochreiter, S.,Schmidhuber, J. (1997a). Flat minima. Neural Computation, 9(1):1–42. Hochreiter, S.,Schmidhuber, J. (1997b). Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Based on TR FKI-207-95, TUM (1995). Hochreiter, S.,Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation,11(3):679–714. Hochreiter, S., Younger, A. S., Conwell, P. R. (2001b). Learning to learn using gradient descent. InLecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-2001), pages87–94. Springer: Berlin, Heidelberg. Holden, S. B. (1994). On the Theory of Generalization,Self-Structuring in Linearly Weighted ConnectionistNetworks. PhD thesis, Cambridge University, Engineering Department. Holland, J. H. (1975). Adaptation in Natural,Artificial Systems. University of Michigan Press, AnnArbor. Hopfield, J. J. (1982). Neural networks,physical systems with emergent collective computationalabilities. Proc. of the National Academy of Sciences, 79:2554–2558. Hubel, D. H.,Wiesel, T. (1962). Receptive fields, binocular interaction, functional architecture inthe cat’s visual cortex. Journal of Physiology (London), 160:106–154. Huffman, D. A. (1952). A method for construction of minimum-redundancy codes. Proceedings IRE,40:1098–1101. Hutter, M. (2002). The fastest,shortest algorithm for all well-defined problems. International Journalof Foundations of Computer Science, 13(3):431–443. (On J. Schmidhuber’s SNF grant 20-61847). Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin. (On J. Schmidhuber’s SNF grant 20-61847). Hyvarinen, A., Hoyer, P., Oja, E. (1999). Sparse code shrinkage: Denoising by maximum likelihoodestimation. In Kearns, M., Solla, S. A., Cohn, D., editors, Advances in Neural InformationProcessing Systems 12. MIT Press. Hyv¨arinen, A.,Oja, E. (2000). Independent component analysis: algorithms,applications. Neuralnetworks, 13(4):411–430. ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images. (2012). IPAL Laboratoryand TRIBVN Company,Pitie-Salpetriere Hospital,CIALAB of Ohio State Univ.,http://ipal.cnrs.fr/ICPR2012/. Igel, C. (2003). Neuroevolution for reinforcement learning using evolution strategies. In Reynolds, R.,Abbass, H., Tan, K. C., Mckay, B., Essam, D., Gedeon, T., editors, Congress on EvolutionaryComputation (CEC 2003), volume 4, pages 2588–2595. IEEE. Indermuhle, E., Frinken, V., Bunke, H. (2012). Mode detection in online handwritten documentsusing BLSTM neural networks. In Frontiers in Handwriting Recognition (ICFHR), 2012 InternationalConference on, pages 302–307. IEEE. Indermuhle, E., Frinken, V., Fischer, A., Bunke, H. (2011). Keyword spotting in online handwrittendocuments containing text,non-text using BLSTM neural networks. In Document Analysis andRecognition (ICDAR), 2011 International Conference on, pages 73–77. IEEE. Jaakkola, T., Singh, S. P., Jordan, M. I. (1995). Reinforcement learning algorithm for partially observableMarkov decision problems. In Tesauro, G., Touretzky, D. S., Leen, T. K., editors, Advances inNeural Information Processing Systems 7, pages 345–352. MIT Press. Jackel, L., Boser, B., Graf, H.-P., Denker, J., LeCun, Y., Henderson, D., Matan, O., Howard, R., Baird,H. (1990). VLSI implementation of electronic neural networks:,example in character recognition. In IEEE, editor, IEEE International Conference on Systems, Man, Cybernetics, pages 320–322, LosAngeles, CA. Jacob, C., Lindenmayer, A., Rozenberg, G. (1994). Genetic L-System Programming. In ParallelProblem Solving from Nature III, Lecture Notes in Computer Science. Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural Networks,1(4):295–307. Jaeger, H. (2001). The ”echo state” approach to analysing,training recurrent neural networks. TechnicalReport GMD Report 148, German National Research Center for Information Technology. Jaeger, H. (2002). Short term memory in echo state networks. GMD-Report 152, GMD - German NationalResearch Institute for Computer Science. Jaeger, H. (2004). Harnessing nonlinearity: Predicting chaotic systems,saving energy in wirelesscommunication. Science, 304:78–80. Jim, K., Giles, C. L., Horne, B. G. (1995). Effects of noise on convergence,generalization inrecurrent networks. In Tesauro, G., Touretzky, D., Leen, T., editors, Advances in Neural InformationProcessing Systems (NIPS) 7, page 649. San Mateo, CA: Morgan Kaufmann. Jodogne, S. R.,Piater, J. H. (2007). Closed-loop learning of visual control policies. J. ArtificialIntelligence Research, 28:349–391. Jordan,M. I. (1986). Serial order: A parallel distributed processing approach. Technical Report ICS Report8604, Institute for Cognitive Science, University of California, San Diego. Jordan, M. I. (1988). Supervised learning,systems with excess degrees of freedom. Technical ReportCOINS TR 88-27, Massachusetts Institute of Technology. Jordan, M. I.,Rumelhart, D. E. (1990). Supervised learning with a distal teacher. Technical ReportOccasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology. Jordan, M. I.,Sejnowski, T. J. (2001). Graphical models: Foundations of neural computation. MITpress. Jutten, C.,Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based onneuromimetic architecture. Signal Processing, 24(1):1–10. Kaelbling, L. P., Littman, M. L., Cassandra, A. R. (1995). Planning,acting in partially observablestochastic domains. Technical report, Brown University, Providence RI. Kaelbling, L. P., Littman, M. L., Moore, A. W. (1996). Reinforcement learning: a survey. Journal ofAI research, 4:237–285. Kalinke, Y.,Lehmann, H. (1998). Computation in recurrent neural networks: From counters to iteratedfunction systems. In Antoniou, G.,Slaney, J., editors, Advanced Topics in Artificial Intelligence,Proceedings of the 11th Australian Joint Conference on Artificial Intelligence, volume 1502 of LNAI,Berlin, Heidelberg. Springer. Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954. Kerlirzin, P.,Vallet, F. (1993). Robustness in multilayer perceptrons. Neural Computation, 5(1):473–482. Kimura, H., Miyazaki, K., Kobayashi, S. (1997). Reinforcement learning in POMDPs with functionapproximation. In ICML, volume 97, pages 152–160. Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation system. Complex Systems, 4:461–476. Klapper-Rybicka, M., Schraudolph, N. N., Schmidhuber, J. (2001). Unsupervised learning in LSTMrecurrent neural networks. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial NeuralNetworks (ICANN-2001), pages 684–691. Springer: Berlin, Heidelberg. Kohl, N.,Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics,Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on,volume 3, pages 2619–2624. IEEE. Kohonen, T. (1972). Correlation matrix memories. Computers, IEEE Transactions on, 100(4):353–359. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics,43(1):59–69. Kohonen, T. (1988). Self-Organization,Associative Memory. Springer, second edition. Kolmogorov, A. N. (1965a). On the representation of continuous functions of several variables by superpositionof continuous functions of one variable,addition. Doklady Akademii. Nauk USSR, 114:679–681. Kolmogorov, A. N. (1965b). Three approaches to the quantitative definition of information. Problems ofInformation Transmission, 1:1–11. Kompella, V. R., Luciw, M. D., Schmidhuber, J. (2012). Incremental slow feature analysis: Adaptivelow-complexity slow feature updating from high-dimensional input streams. Neural Computation,24(11):2994–3024. Korkin, M., de Garis, H., Gers, F., Hemmi, H. (1997). CBM (CAM-Brain Machine) - a hardware toolwhich evolves a neural net module in a fraction of a second,runs a million neuron artificial brain inreal time. Koutn´ık, J., Cuccu, G., Schmidhuber, J., Gomez, F. (July 2013). Evolving large-scale neural networksfor vision-based reinforcement learning. In Proceedings of the Genetic,Evolutionary ComputationConference (GECCO), pages 1061–1068, Amsterdam. ACM. Koutn´ık, J., Gomez, F., Schmidhuber, J. (2010). Evolving neural networks in compressed weight space. In Proceedings of the 12th annual conference on Genetic,evolutionary computation, pages 619–626. Koutn´ık, J., Greff, K., Gomez, F., Schmidhuber, J. (2014). A Clockwork RNN. Technical ReportarXiv:1402.3511 [cs.NE], The Swiss AI Lab IDSIA. Kramer, M. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChEJournal, 37:233–243. Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutionalneural networks. In Advances in Neural Information Processing Systems (NIPS 2012), page 4. Krogh, A.,Hertz, J. A. (1992). A simple weight decay can improve generalization. In Lippman, D. S.,Moody, J. E., Touretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages950–957. Morgan Kaufmann. Kurzweil, R. (2012). How to Create a Mind: The Secret of Human Thought Revealed. Lagoudakis, M. G.,Parr, R. (2003). Least-squares policy iteration. JMLR, 4:1107–1149. Lang, K.,Waibel, A., Hinton, G. E. (1990). A time-delay neural network architecture for isolated wordrecognition. Neural Networks, 3:23–43. Lange, S.,Riedmiller, M. (2010). Deep auto-encoder neural networks in reinforcement learning. InNeural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8. Lapedes, A.,Farber, R. (1986). A self-optimizing, nonsymmetrical neural net for content addressablememory,pattern recognition. Physica D, 22:247–259. Larraanaga, P.,Lozano, J. A. (2001). Estimation of Distribution Algorithms: A New Tool for EvolutionaryComputation. Kluwer Academic Publishers, Norwell, MA, USA. Le, Q. V., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In Proc. ICML’12. LeCun, Y. (1985). Une proc´edure d’apprentissage pour r´eseau `a seuil asym´etrique. Proceedings of Cognitiva85, Paris, pages 599–604. LeCun, Y. (1988). A theoretical framework for back-propagation. In Touretzky, D., Hinton, G., andSejnowski, T., editors, Proceedings of the 1988 Connectionist Models Summer School, pages 21–28,CMU, Pittsburgh, Pa. Morgan Kaufmann. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D. (1989). Back-propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard,W., Jackel, L. D. (1990a). Handwritten digit recognition with a back-propagation network. In Touretzky, D. S., editor, Advances inNeural Information Processing Systems 2, pages 396–404. Morgan Kaufmann. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86(11):2278–2324. LeCun, Y., Denker, J. S., Solla, S. A. (1990b). Optimal brain damage. In Touretzky, D. S., editor,Advances in Neural Information Processing Systems 2, pages 598–605. Morgan Kaufmann. LeCun, Y., Muller, U., Cosatto, E., Flepp, B. (2006). Off-road obstacle avoidance through end-to-endlearning. In Advances in Neural Information Processing Systems (NIPS 2005). LeCun, Y., Simard, P., Pearlmutter, B. (1993). Automatic learning rate maximization by on-lineestimation of the Hessian’s eigenvectors. In Hanson, S., Cowan, J., Giles, L., editors, Advancesin Neural Information Processing Systems (NIPS 1992), volume 5. Morgan Kaufmann Publishers, SanMateo, CA. Lee, H., Battle, A., Raina, R., Ng, A. Y. (2007a). Efficient sparse coding algorithms. In Advances inNeural Information Processing Systems 19, pages 801–808. Lee, H., Ekanadham, C., Ng, A. Y. (2007b). Sparse deep belief net model for visual area V2. InAdvances in Neural Information Processing Systems (NIPS), volume 7, pages 873–880. Lee, L. (1996). Learning of context-free languages: A survey of the literature. Technical Report TR-12-96,Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts. Legenstein, R., Wilbert, N., Wiskott, L. (2010). Reinforcement learning on slow features of highdimensionalinput streams. PLoS Computational Biology, 6(8). Legenstein, R. A.,Maass, W. (2002). Neural circuits for pattern recognition with small total wirelength. Theor. Comput. Sci., 287(1):239–249. Leibniz, G. W. (1676). Memoir using the chain rule (cited in TMME 7:2&3 p 321-332, 2010). Lenat, D. B. (1983). Theory formation by heuristic search. Machine Learning, 21. Lenat, D. B.,Brown, J. S. (1984). Why AM an EURISKO appear to work. Artificial Intelligence,23(3):269–294. Levenberg, K. (1944). A method for the solution of certain problems in least squares. Quarterly of appliedmathematics, 2:164–168. Levin, A. U., Leen, T. K., Moody, J. E. (1994). Fast pruning using principal components. In Advancesin Neural Information Processing Systems 6, page 35. Morgan Kaufmann. Levin, A. U.,Narendra, K. S. (1995). Control of nonlinear dynamical systems using neural networks. ii. observability, identification, control. IEEE transactions on neural networks/a publication of theIEEE Neural Networks Council, 7(1):30–42. Levin, L. A. (1973a). On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416. Levin, L. A. (1973b). Universal sequential search problems. Problems of Information Transmission,9(3):265–266. Lewicki, M. S.,Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an efficientcoding framework. In Jordan, M. I., Kearns, M. J., Solla, S. A., editors, Advances in NeuralInformation Processing Systems 10, pages 815–821. L’Hˆopital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris:L’Imprimerie Royale. Li, M.,Vit´anyi, P. M. B. (1997). An Introduction to Kolmogorov Complexity,its Applications (2ndedition). Springer. Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie MellonUniversity, Pittsburgh. Lin, T., Horne, B., Tino, P., Giles, C. (1996). Learning long-term dependencies in NARX recurrentneural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338. Lin, T., Horne, B. G., Tino, P., Giles, C. L. (1995). Learning long-term dependencies is not as difficultwith NARX recurrent neural networks. Technical Report UMIACS-TR-95-78,CS-TR-3500, Institutefor Advanced Computer Studies, University of Maryland, College Park, MD 20742. Lindenmayer, A. (1968). Mathematical models for cellular interaction in development. J. Theoret. Biology,18:280–315. Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylorexpansion of the local rounding errors. Master’s thesis, Univ. Helsinki. Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics,16(2):146–160. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21:105–117. Littman, M. L. (1996). Algorithms for Sequential Decision Making. PhD thesis, Brown University. Littman, M. L., Cassandra, A. R., Kaelbling, L. P. (1995). Learning policies for partially observableenvironments: Scaling up. In Prieditis, A.,Russell, S., editors, Machine Learning: Proceedingsof the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publishers, San Francisco,CA. Ljung, L. (1998). System identification. Springer. Loiacono, D., Cardamone, L., Lanzi, P. L. (2011). Simulated car racing championship competitionsoftware manual. Technical report, Dipartimento di Elettronica e Informazione, Politecnico di Milano,Italy. Loiacono, D., Lanzi, P. L., Togelius, J., Onieva, E., Pelta, D. A., Butz, M. V., L¨onneker, T. D., Cardamone,L., Perez, D., S´aez, Y., Preuss, M., Quadflieg, J. (2009). The 2009 simulated car racingchampionship. Luciw, M., Kompella, V. R., Kazerounian, S., Schmidhuber, J. (2013). An intrinsic value system fordeveloping multiple invariant representations with incremental slowness learning. Frontiers in Neurorobotics,7(9). Maas, A. L., Hannun, A. Y., Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acousticmodels. In International Conference on Machine Learning (ICML). Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12:2519–2535. Maass, W., Natschl¨ager, T., Markram, H. (2002). Real-time computing without stable states: A newframework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation,4:448–472. MacKay, D. J. C.,Miller, K. D. (1990). Analysis of Linsker’s simulation of Hebbian rules. NeuralComputation, 2:173–187. Maei, H. R.,Sutton, R. S. (2010). GQ(): A general gradient algorithm for temporal-difference predictionlearning with eligibility traces. In Proceedings of the Third Conference on Artificial GeneralIntelligence, volume 1, pages 91–96. Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, empiricalresults. Machine Learning, 22:159. Manolios, P.,Fanelli, R. (1994). First-order recurrent neural networks,deterministic finite stateautomata. Neural Computation, 6:1155–1173. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal ofthe Society for Industrial & Applied Mathematics, 11(2):431–441. Martens, J. (2010). Deep learning via Hessian-free optimization. In F¨urnkranz, J.,Joachims, T., editors,Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735–742,Haifa, Israel. Omnipress. Martens, J.,Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning, pages 1033–1040. Martinetz, T. M., Ritter, H. J., Schulten, K. J. (1990). Three-dimensional neural net for learningvisuomotor coordination of a robot arm. IEEE Transactions on Neural Networks, 1(1):131–136. Masci, J., Ciresan, D. C., Giusti, A., Gambardella, L.M., Schmidhuber, J. (2013a). Fast image scanningwith deep max-pooling convolutional neural networks. In Proc. ICIP. Masci, J., Giusti, A., Ciresan, D. C., Fricout, G., Schmidhuber, J. (2013b). A fast learning algorithm forimage segmentation with max-pooling convolutional networks. In International Conference on ImageProcessing (ICIP13), pages 2713–2717. Matsuoka, K. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions onSystems, Man, Cybernetics, 22(3):436–440. Mayer, H., Gomez, F., Wierstra, D., Nagy, I., Knoll, A., Schmidhuber, J. (2008). A system forrobotic heart surgery that learns to tie knots using recurrent neural networks. Advanced Robotics, 22(13-14):1521–1537. McCallum, R. A. (1996). Learning to use selective attention,short-term memory in sequential tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., Wilson, S. W., editors, From Animals to Animats4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge,MA, pages 315–324. MIT Press, Bradford Books. McCulloch, W.,Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletinof Mathematical Biophysics, 7:115–133. Melnik, O., Levy, S. D., Pollack, J. B. (2000). RAAM for infinite context-free languages. In Proc. IJCNN (5), pages 585–590. Menache, I., Mannor, S., Shimkin, N. (2002). Q-cut - dynamic discovery of sub-goals in reinforcementlearning. In Proc. ECML’02, pages 295–306. Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins,G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J. (2011). Unsupervised andtransfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised,TransferLearning, volume 7. Meuleau, N., Peshkin, L., Kim, K. E., Kaelbling, L. P. (1999). Learning finite state controllers forpartially observable environments. In 15th International Conference of Uncertainty in AI, pages 427–436. Miglino, O., Lund, H., Nolfi, S. (1995). Evolving mobile robots in simulated,real environments. Artificial Life, 2(4):417–434. Miller, G., Todd, P., Hedge, S. (1989). Designing neural networks using genetic algorithms. In Proceedingsof the 3rd International Conference on Genetic Algorithms, pages 379–384.Morgan Kauffman. Miller, K. D. (1994). A model for the development of simple cell receptive fields,the ordered arrangementof orientation columns through activity-dependent competition between on-,off-center inputs. Journal of Neuroscience, 14(1):409–441. Minai, A. A. , Williams, R. D. (1994). Perturbation response in feedforward networks. Neural Networks,7(5):783–796. Minsky, M. (1963). Steps toward artificial intelligence. In Feigenbaum, E.,Feldman, J., editors,Computers,Thought, pages 406–450. McGraw-Hill, New York. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M. (Dec2013). Playing Atari with deep reinforcement learning. Technical Report arXiv:1312.5602 [cs.LG],Deepmind Technologies. Mohamed, A., Dahl, G. E., Hinton, G. E. (2009). Deep belief networks for phone recognition. InNIPS’22 workshop on deep learning for speech recognition. Mohamed, A.,Hinton, G. E. (2010). Phone recognition using restricted Boltzmann machines. In AcousticsSpeech,Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 4354–4357. Molgedey, L.,Schuster, H. G. (1994). Separation of independent signals using time-delayed correlations. Phys. Reviews Letters, 72(23):3634–3637. Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feed-forward network errorfunctions,a vector in O(N) time. Technical Report PB-432, Computer Science Department, AarhusUniversity, Denmark. Montavon, G., Orr, G., M¨uller, K. (2012). Neural Networks: Tricks of the Trade. Number LNCS 7700in Lecture Notes in Computer Science Series. Springer Verlag. Moody, J. E. (1989). Fast learning in multi-resolution hierarchies. In Touretzky, D. S., editor, Advances inNeural Information Processing Systems 1, pages 29–39. Morgan Kaufmann. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization,regularizationin nonlinear learning systems. In Lippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advancesin Neural Information Processing Systems 4, pages 847–854. Morgan Kaufmann. Moody, J. E.,Utans, J. (1994). Architecture selection strategies for neural networks: Application tocorporate bond rating prediction. In Refenes, A. N., editor, Neural Networks in the Capital Markets. John Wiley & Sons. Moore, A.,Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcement learningin multidimensional state-spaces. Machine Learning, 21(3):199–233. Moore, A.,Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data andless time. Machine Learning, 13:103–130. Moriarty, D. E. (1997). Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD thesis,Department of Computer Sciences, The University of Texas at Austin. Moriarty, D. E., Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic evolution. Machine Learning, 22:11–32. Morimoto, J.,Doya, K. (2000). Robust reinforcement learning. In Leen, T. K., Dietterich, T. G.,and Tresp, V., editors, Advances in Neural Information Processing Systems 13, Papers from NeuralInformation Processing Systems (NIPS) 2000, Denver, CO, USA, pages 1061–1067. MIT Press. Mosteller, F.,Tukey, J. W. (1968). Data analysis, including statistics. In Lindzey, G.,Aronson, E.,editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley. Mozer, M. C. (1989). A focused back-propagation algorithm for temporal sequence recognition. ComplexSystems, 3:349–381. Mozer, M. C. (1991). Discovering discrete distributed representations with iterative competitive learning. In Lippmann, R. P., Moody, J. E., Touretzky, D. S., editors, Advances in Neural InformationProcessing Systems 3, pages 627–634. Morgan Kaufmann. Mozer, M. C. (1992). Induction of multiscale temporal structure. In Lippman, D. S., Moody, J. E., andTouretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages 275–282.MorganKaufmann. Mozer, M. C.,Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a networkvia relevance assessment. In Touretzky, D. S., editor, Advances in Neural Information ProcessingSystems 1, pages 107–115. Morgan Kaufmann. Munro, P. W. (1987). A dual back-propagation scheme for scalar reinforcement learning. Proceedings ofthe Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176. Murray, A. F.,Edwards, P. J. (1993). Synaptic weight noise during MLP learning enhances faulttolerance,generalisation,learning trajectory. In S. J. Hanson, J. D. C.,Giles, C. L., editors,Advances in Neural Information Processing Systems 5, pages 491–498. San Mateo, CA: Morgan Kaufmann. Nadal, J.-P.,Parga, N. (1994). Non-linear neurons in the low noise limit: a factorial code maximisesinformation transfer. Network, 5:565–581. Nair, V.,Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. InInternational Conference on Machine Learning (ICML). Narendra, K. S.,Parthasarathy, K. (1990). Identification,control of dynamical systems using neuralnetworks. Neural Networks, IEEE Transactions on, 1(1):4–27. Narendra, K. S.,Thathatchar, M. A. L. (1974). Learning automata - a survey. IEEE Transactions onSystems, Man, Cybernetics, 4:323–334. Neal, R. M. (2006). Classification with Bayesian neural networks. In Quinonero-Candela, J., Magnini, B.,Dagan, I., D’Alche-Buc, F., editors, Machine Learning Challenges. Evaluating Predictive Uncertainty,Visual Object Classification, Recognising Textual Entailment, volume 3944 of Lecture Notesin Computer Science, pages 28–32. Springer. Neal, R. M.,Zhang, J. (2006). High dimensional classification with Bayesian neural networks andDirichlet diffusion trees. In Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. A., editors, FeatureExtraction: Foundations,Applications, Studies in Fuzziness,Soft Computing, pages 265–295. Springer. Neti, C., Schneider, M. H., Young, E. D. (1992). Maximally fault tolerant neural networks. In IEEETransactions on Neural Networks, volume 3, pages 14–23. Neuneier, R.,Zimmermann, H.-G. (1996). How to train neural networks. In Orr, G. B.,M¨uller, K.-R., editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science,pages 373–423. Springer. Nguyen, N.,Widrow, B. (1989). The truck backer-upper: An example of self learning in neural networks. In Proceedings of the International Joint Conference on Neural Networks, pages 357–363. IEEEPress. Nicholas Refenes, A., Zapranis, A., Francis, G. (1994). Stock performance modeling using neuralnetworks: a comparative study with regression models. Neural Networks, 7(2):375–388. Nilsson, N. J. (1980). Principles of artificial intelligence. Morgan Kaufmann, San Francisco, CA, USA. Nolfi, S., Floreano, D., Miglino, O., Mondada, F. (1994). How to evolve autonomous robots: Differentapproaches in evolutionary robotics. In Brooks, R. A.,Maes, P., editors, Fourth InternationalWorkshop on the Synthesis,Simulation of Living Systems (Artificial Life IV), pages 190–197. MIT. Nowlan, S. J.,Hinton, G. E. (1992). Simplifying neural networks by soft weight sharing. NeuralComputation, 4:173–193. Oh, K.-S.,Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311–1314. Oja, E. (1989). Neural networks, principal components, subspaces. International Journal of NeuralSystems, 1(1):61–68. Oja, E. (1991). Data compression, feature extraction, autoassociation in feedforward neural networks. In Kohonen, T., M¨akisara, K., Simula, O., Kangas, J., editors, Artificial Neural Networks, volume 1,pages 737–745. Elsevier Science Publishers B.V., North-Holland. Olshausen, B. A.,Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning asparse code for natural images. Nature, 381(6583):607–609. Omlin, C.,Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks. NeuralNetworks, 9(1):41–52. O’Reilly, R. (2003). Making working memory work: A computational model of learning in the prefrontalcortex,basal ganglia. Technical Report ICS-03-03, ICS. Orr, G.,M¨uller, K. (1998). Neural Networks: Tricks of the Trade. Number LNCS 1524 in LectureNotes in Computer Science Series. Springer Verlag. Ostrovskii, G. M., Volin, Y. M., Borisov, W. W. (1971). ¨Uber die Berechnung von Ableitungen. Wiss. Z. Tech. Hochschule f¨ur Chemie, 13:382–384. Otte, S., Krechel, D., Liwicki, M., Dengel, A. (2012). Local feature based online mode detectionwith recurrent neural networks. In Proceedings of the 2012 International Conference on Frontiers inHandwriting Recognition, pages 533–537. IEEE Computer Society. Oudeyer, P.-Y., Baranes, A., Kaplan, F. (2013). Intrinsically motivated learning of real world sensorimotorskills with developmental constraints. In Baldassarre, G.,Mirolli, M., editors, IntrinsicallyMotivated Learning in Natural,Artificial Systems. Springer. Pachitariu, M.,Sahani, M. (2013). Regularization,nonlinearities for neural language models: whenare they needed? arXiv preprint arXiv:1301.5650. Palm, G. (1980). On associative memory. Biological Cybernetics, 36. Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation,4(2):703–711. Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research in Economicsand Management Sci., MIT. Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y. (2013a). How to construct deep recurrent neuralnetworks. arXiv preprint arXiv:1312.6026. Pascanu, R., Mikolov, T., Bengio, Y. (2013b). On the difficulty of training recurrent neural networks. In ICML’13: JMLR: W&CP volume 28. Pasemann, F., Steinmetz, U., Dieckman, U. (1999). Evolving structure,function of neurocontrollers. In Angeline, P. J., Michalewicz, Z., Schoenauer, M., Yao, X., Zalzala, A., editors, Proceedingsof the Congress on Evolutionary Computation, volume 3, pages 1973–1978, Mayflower Hotel,Washington D.C., USA. IEEE Press. Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural Computation,1(2):263–269. Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–160. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEETransactions on Neural Networks, 6(5):1212–1228. Pearlmutter, B. A.,Hinton, G. E. (1986). G-maximization: An unsupervised learning procedure fordiscovering regularities. In Denker, J. S., editor, Neural Networks for Computing: American Institute ofPhysics Conference Proceedings 151, volume 2, pages 333–338. Peng, J.,Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22:283–290. P´erez-Ortiz, J. A., Gers, F. A., Eck, D., Schmidhuber, J. (2003). Kalman filters improve LSTMnetworkperformance in problems unsolvable by traditional recurrent nets. Neural Networks, (16):241–250. Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698. Peters, J.,Schaal, S. (2008a). Natural actor-critic. Neurocomputing, 71:1180–1190. Peters, J.,Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. NeuralNetwork, 21(4):682–697. Pham, V., Kermorvant, C., Louradour, J. (2013). Dropout Improves Recurrent Neural Networks forHandwriting Recognition. arXiv preprint arXiv:1312.4569. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical ReviewLetters, 19(59):2229–2232. Plate, T. A. (1993). Holographic recurrent networks. In S. J. Hanson, J. D. C.,Giles, C. L., editors,Advances in Neural Information Processing Systems 5, pages 34–41. Morgan Kaufmann. Plumbley, M. D. (1991). On information theory,unsupervised neural networks. Dissertation, publishedas technical report CUED/F-INFENG/TR.78, Engineering Department, Cambridge University. Pollack, J. B. (1988). Implications of recursive distributed representations. In Proc. NIPS, pages 527–536. Pollack, J. B. (1990). Recursive distributed representation. Artificial Intelligence, 46:77–105. Pontryagin, L. S., Boltyanskii, V. G., Gamrelidze, R. V., Mishchenko, E. F. (1961). The MathematicalTheory of Optimal Processes. Post, E. L. (1936). Finite combinatory processes-formulation 1. The Journal of Symbolic Logic, 1(3):103–105. Precup, D., Sutton, R. S., Singh, S. (1998). Multi-time models for temporally abstract planning. pages1050–1056. Morgan Kaufmann. Raina, R., Madhavan, A., Ng, A. (2009). Large-scale deep unsupervised learning using graphicsprocessors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML),pages 873–880. ACM. Ramacher, U., Raab, W., Anlauf, J., Hachmann, U., Beichter, J., Bruels, N., Wesseling, M., Sicheneder,E., Maenner, R., Glaess, J., Wurz, A. (1993). Multiprocessor,memory architecture of theneurocomputer SYNAPSE-1. International Journal of Neural Systems, 04(04):333–336. Ranzato, M., Poultney, C., Chopra, S., LeCun, Y. (2006). Efficient learning of sparse representationswith an energy-based model. In et al., J. P., editor, Advances in Neural Information Processing Systems(NIPS 2006). MIT Press. Ranzato, M. A., Huang, F., Boureau, Y., LeCun, Y. (2007). Unsupervised learning of invariant featurehierarchies with applications to object recognition. In Proc. Computer Vision,Pattern RecognitionConference (CVPR’07), pages 1–8. IEEE Press. Rechenberg, I. (1971). Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischenEvolution. Dissertation. Published 1973 by Fromman-Holzboog. Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Computation,5:289–304. Riedmiller,M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcementlearning method. In Proc. ECML-2005, pages 317–328. Springer-Verlag Berlin Heidelberg. Riedmiller, M.,Braun, H. (1993). A direct adaptive method for faster backpropagation learning: TheRprop algorithm. In Proc. IJCNN, pages 586–591. IEEE Press. Riedmiller, M., Lange, S., Voigtlaender, A. (2012). Autonomous reinforcement learning on raw visualinput data in a real world application. In International Joint Conference on Neural Networks (IJCNN),pages 1–8, Brisbane, Australia. Riesenhuber,M.,Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nat. Neurosci.,2(11):1019–1025. Ring, M., Schaul, T., Schmidhuber, J. (2011). The two-dimensional organization of behavior. InProceedings of the First Joint Conference on Development Learning,on Epigenetic Robotics ICDLEPIROB,Frankfurt. Ring, M. B. (1991). Incremental development of complex behaviors through automatic construction ofsensory-motor hierarchies. In Birnbaum, L.,Collins, G., editors, Machine Learning: Proceedings ofthe Eighth International Workshop, pages 343–347. Morgan Kaufmann. Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, J. D. C.,Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 115–122. Morgan Kaufmann. Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of Texasat Austin, Austin, Texas 78712. Rissanen, J. (1986). Stochastic complexity,modeling. The Annals of Statistics, 14(3):1080–1100. Ritter, H.,Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 61(4):241–254. Robinson, A. J.,Fallside, F. (1987). The utility driven dynamic error propagation network. TechnicalReport CUED/F-INFENG/TR.1, Cambridge University Engineering Department. Robinson, T.,Fallside, F. (1989). Dynamic reinforcement driven error propagation networks withapplication to game playing. In Proceedings of the 11th Conference of the Cognitive Science Society,Ann Arbor, pages 836–843. Rodriguez, P.,Wiles, J. (1998). Recurrent neural networks can learn to implement symbol-sensitivecounting. In Advances in Neural Information Processing Systems, volume 10, pages 87–93. The MITPress. Rodriguez, P., Wiles, J., Elman, J. (1999). A recurrent neural network that learns to count. ConnectionScience, 11(1):5–40. Rohwer, R. (1989). The ‘moving targets’ training method. In Kindermann, J.,Linden, A., editors, Proceedingsof ‘Distributed Adaptive Neural Information Processing’, St.Augustin, 24.-25.5,. Oldenbourg. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage,organization inthe brain. Psychological review, 65(6):386. Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. Roux, L., Racoceanu, D., Lomenie, N., Kulikova, M., Irshad, H., Klossa, J., Capron, F., Genestie, C.,Naour, G. L., Gurcan, M. N. (2013). Mitosis detection in breast cancer histological images - anICPR 2012 contest. J. Pathol. Inform., 4:8. Rubner, J.,Schulten, K. (1990). Development of feature detectors by self-organization: A networkmodel. Biological Cybernetics, 62:193–199. Rubner, J.,Tavan, P. (1989). A self-organization network for principal-component analysis. EurophysicsLetters, 10:693–698. R¨uckstieß, T., Felder, M., Schmidhuber, J. (2008). State-Dependent Exploration for policy gradientmethods. In et al., W. D., editor, European Conference on Machine Learning (ECML),Principlesand Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249. Rumelhart, D. E., Hinton, G. E., Williams, R. J. (1986). Learning internal representations by errorpropagation. In Rumelhart, D. E.,McClelland, J. L., editors, Parallel Distributed Processing,volume 1, pages 318–362. MIT Press. Rummery, G.,Niranjan, M. (1994). On-line Q-learning using connectionist sytems. Technical ReportCUED/F-INFENG-TR 166, Cambridge University, UK. Russell, S.,Norvig, P. (1994). Artificial Intelligence: A Modern Approach. Prentice Hall, EnglewoodCliffs, NJ. Saito, K.,Nakano, R. (1997). Partial BFGS update,efficient step-length calculation for three-layerneural networks. Neural Computation, 9(1):123–141. Sałustowicz, R. P.,Schmidhuber, J. (1997). Probabilistic incremental program evolution. EvolutionaryComputation, 5(2):123–141. Samejima, K., Doya, K., Kawato,M. (2003). Inter-module credit assignment in modular reinforcementlearning. Neural Networks, 16(7):985–994. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal onResearch,Development, 3:210–229. Sanger, T. D. (1989). An optimality principle for unsupervised learning. In Touretzky, D. S., editor,Advances in Neural Information Processing Systems 1, pages 11–19. Morgan Kaufmann. Santamar´ıa, J. C., Sutton, R. S., Ram, A. (1997). Experiments with reinforcement learning in problemswith continuous state,action spaces. Adaptive Behavior, 6(2):163–217. Saravanan, N.,Fogel, D. B. (1995). Evolving neural control systems. IEEE Expert, pages 23–27. Saund, E. (1994). Unsupervised learning of mixtures of multiple causes in binary data. In Cowan, J. D.,Tesauro, G., Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages27–34. Morgan Kaufmann. Sch¨afer, A. M., Udluft, S., Zimmermann, H.-G. (2006). Learning long term dependencies with recurrentneural networks. In Kollias, S. D., Stafylopatis, A., Duch, W., Oja, E., editors, ICANN (1),volume 4131 of Lecture Notes in Computer Science, pages 71–80. Springer. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5:197–227. Schaul, T.,Schmidhuber, J. (2010). Metalearning. Scholarpedia, 6(5):4650. Schaul, T., Zhang, S., LeCun, Y. (2013). No more pesky learning rates. In Proc. 30th InternationalConference on Machine Learning (ICML). Scherer, D., M¨uller, A., Behnke, S. (2010). Evaluation of pooling operations in convolutional architecturesfor object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN),pages 92–101. Schmidhuber, J. (1987). Evolutionary principles in self-referential learning. Diploma thesis, Institut f ¨urInformatik, Technische Universit¨at M¨unchen. http://www.idsia.ch/˜juergen/diploma.html. Schmidhuber, J. (1989a). Accelerated learning in back-propagation nets. In Pfeifer, R., Schreter, Z.,Fogelman, Z., Steels, L., editors, Connectionism in Perspective, pages 429 – 438. Amsterdam:Elsevier, North-Holland. Schmidhuber, J. (1989b). A local learning algorithm for dynamic feedforward,recurrent networks. Connection Science, 1(4):403–412. Schmidhuber, J. (1990a). Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem. Dissertation, Institut f ¨ur Informatik, Technische Universit¨at M¨unchen. Schmidhuber, J. (1990b). Learning algorithms for networks with internal,external feedback. In Touretzky,D. S., Elman, J. L., Sejnowski, T. J., Hinton, G. E., editors, Proc. of the 1990 ConnectionistModels Summer School, pages 52–61. Morgan Kaufmann. Schmidhuber, J. (1990c). The Neural Heat Exchanger. Talks at TUMunich (1990), University of Coloradoat Boulder (1992), Z. Li’s NIPS*94 workshop on unsupervised learning. Also published at the Intl. Conference on Neural Information Processing (ICONIP’96), vol. 1, pages 194-197, 1996. Schmidhuber, J. (1990d). An on-line algorithmfor dynamic reinforcement learning,planning in reactiveenvironments. In Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego,volume 2, pages 253–258. Schmidhuber, J. (1991a). Curious model-building control systems. In Proceedings of the InternationalJoint Conference on Neural Networks, Singapore, volume 2, pages 1458–1463. IEEE press. Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T., M¨akisara,K., Simula, O., Kangas, J., editors, Artificial Neural Networks, pages 967–972. Elsevier SciencePublishers B.V., North-Holland. Schmidhuber, J. (1991c). Reinforcement learning in Markovian,non-Markovian environments. InLippman, D. S., Moody, J. E., Touretzky, D. S., editors, Advances in Neural Information ProcessingSystems 3 (NIPS 3), pages 500–506. Morgan Kaufmann. Schmidhuber, J. (1992a). A fixed size storage O(n3) time complexity learning algorithm for fully recurrentcontinually running networks. Neural Computation, 4(2):243–248. Schmidhuber, J. (1992b). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242. (Based on TR FKI-148-91, TUM, 1991). Schmidhuber, J. (1992c). Learning factorial codes by predictability minimization. Neural Computation,4(6):863–879. Schmidhuber, J. (1993a). An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pages 191–195. IEE. Schmidhuber, J. (1993b). Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network Architectures,Objective Functions, Chain Rule.) Habilitationsschrift (Habilitation Thesis), Institut f ¨ur Informatik,Technische Universit¨at M¨unchen. Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity,high generalizationcapability. In Prieditis, A.,Russell, S., editors, Machine Learning: Proceedings of the TwelfthInternational Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA. Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity,high generalizationcapability. Neural Networks, 10(5):857–873. Schmidhuber, J. (2002). The Speed Prior: a new simplicity measure yielding near-optimal computablepredictions. In Kivinen, J.,Sloan, R. H., editors, Proceedings of the 15th Annual Conference onComputational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages 216–228. Springer, Sydney, Australia. Schmidhuber, J. (2004). Optimal ordered problem solver. Machine Learning, 54:211–254. Schmidhuber, J. (2006a). Developmental robotics, optimal artificial curiosity, creativity, music, thefine arts. Connection Science, 18(2):173–187. Schmidhuber, J. (2006b). G¨odel machines: Fully self-referential optimal universal self-improvers. InGoertzel, B.,Pennachin, C., editors, Artificial General Intelligence, pages 199–226. Springer Verlag. Variant available as arXiv:cs.LO/0309048. Schmidhuber, J. (2007). Prototype resilient, self-modeling robots. Science, 316(5825):688. Schmidhuber, J. (2012). Self-delimiting neural networks. Technical Report IDSIA-08-12,arXiv:1210.0118v1 [cs.NE], The Swiss AI Lab IDSIA. Schmidhuber, J. (2013a). My first Deep Learning system of 1991 + Deep Learning timeline 1962-2013. Technical Report arXiv:1312.5548v1 [cs.NE], The Swiss AI Lab IDSIA. Schmidhuber, J. (2013b). POWERPLAY: Training an Increasingly General Problem Solver by ContinuallySearching for the Simplest Still Unsolvable Problem. Frontiers in Psychology. Schmidhuber, J., Ciresan, D.,Meier, U.,Masci, J., Graves, A. (2011). On fast deep nets for AGI vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, pages243–246. Schmidhuber, J., Eldracher, M., Foltin, B. (1996). Semilinear predictability minimization produceswell-known feature detectors. Neural Computation, 8(4):773–786. Schmidhuber, J.,Hochreiter, S. (1996). Guessing can outperform many long time lag algorithms. Technical Report IDSIA-19-96, The Swiss AI Lab IDSIA. Schmidhuber, J., Hochreiter, S., Bengio, Y. (2001). Evaluating benchmark problems by random guessing. In Kremer, S. C.,Kolen, J. F., editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. Schmidhuber, J.,Huber, R. (1991). Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135–141. Schmidhuber, J., Mozer, M. C., Prelinger, D. (1993). Continuous history compression. In H¨uning, H.,Neuhauser, S., Raus, M., Ritschel, W., editors, Proc. of Intl. Workshop on Neural Networks, RWTHAachen, pages 87–95. Augustinus. Schmidhuber, J.,Prelinger, D. (1993). Discovering predictable classifications. Neural Computation,5(4):625–635. Schmidhuber, J.,Wahnsiedler, R. (1992). Planning simple trajectories using neural subgoal generators. In Meyer, J. A., Roitblat, H. L., andWilson, S.W., editors, Proc. of the 2nd International Conference onSimulation of Adaptive Behavior, pages 196–202. MIT Press. Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F. J. (2007). Training recurrent networks byEvolino. Neural Computation, 19(3):757–779. Schmidhuber, J., Zhao, J., Schraudolph, N. (1997a). Reinforcement learning with self-modifyingpolicies. In Thrun, S.,Pratt, L., editors, Learning to learn, pages 293–309. Kluwer. Schmidhuber, J., Zhao, J., Wiering, M. (1997b). Shifting inductive bias with success-story algorithm,adaptive Levin search, incremental self-improvement. Machine Learning, 28:105–130. Sch¨olkopf, B., Burges, C. J. C., Smola, A. J., editors (1998). Advances in Kernel Methods - SupportVector Learning. MIT Press, Cambridge, MA. Schraudolph, N.,Sejnowski, T. J. (1993). Unsupervised discrimination of clustered data via optimizationof binary information gain. In Hanson, S. J., Cowan, J. D., Giles, C. L., editors, Advances inNeural Information Processing Systems, volume 5, pages 499–506. Morgan Kaufmann, San Mateo. Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent. NeuralComputation, 14(7):1723–1738. Schraudolph, N. N.,Sejnowski, T. J. (1996). Tempering backpropagation networks: Not all weightsare created equal. In Touretzky, D. S., Mozer, M. C., Hasselmo, M. E., editors, Advances in NeuralInformation Processing Systems (NIPS), volume 8, pages 563–569. The MIT Press, Cambridge, MA. Schuster, H. G. (1992). Learning by maximization the information transfer through nonlinear noisy neuronsand “noise breakdown”. Phys. Rev. A, 46(4):2131–2138. Schuster,M. (1999). On supervised learning from sequential data with applications for speech recognition. PhD thesis, Nara Institute of Science,Technolog, Kyoto, Japan. Schuster, M.,Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions onSignal Processing, 45:2673–2681. Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proc. ICML, pages 298–305. Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation. Published 1977by Birkh¨auser, Basel. Segmentation of Neuronal Structures in EM Stacks Challenge (2012). IEEE International Symposium onBiomedical Imaging (ISBI), http://tinyurl.com/d2fgh7g. Sehnke, F., Osendorfer, C., R¨uckstieß, T., Graves, A., Peters, J., Schmidhuber, J. (2010). Parameterexploringpolicy gradients. Neural Networks, 23(4):551–559. Sermanet, P.,LeCun, Y. (2011). Traffic sign recognition with multi-scale convolutional networks. InProceedings of International Joint Conference on Neural Networks (IJCNN’11), pages 2809–2813. Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics ofcomputation, 24(111):647–656. Shannon, C. E. (1948). A mathematical theory of communication (parts I,II). Bell System TechnicalJournal, XXVII:379–423. Siegelmann, H. (1992). Theoretical Foundations of Recurrent Neural Networks. PhD thesis, Rutgers, NewBrunswick Rutgers, The State of New Jersey. Siegelmann, H. T.,Sontag, E. D. (1991). Turing computability with neural nets. Applied MathematicsLetters, 4(6):77–80. Silva, F. M.,Almeida, L. B. (1990). Speeding up back-propagation. In Eckmiller, R., editor, AdvancedNeural Computers, pages 151–158, Amsterdam. Elsevier. Simard, P., Steinkraus, D., Platt, J. (2003). Best practices for convolutional neural networks applied tovisual document analysis. In Seventh International Conference on Document Analysis,Recognition,pages 958–963. Sims, K. (1994). Evolving virtual creatures. In Glassner, A., editor, Proceedings of SIGGRAPH ’94(Orlando, Florida, July 1994), Computer Graphics Proceedings, Annual Conference, pages 15–22. ACMSIGGRAPH, ACM Press. ISBN 0-89791-667-0. Simsek, ¨ O.,Barto, A. G. (2008). Skill characterization based on betweenness. In NIPS’08, pages1497–1504. Singh, S., Barto, A. G., Chentanez, N. (2005). Intrinsically motivated reinforcement learning. InAdvances in Neural Information Processing Systems 17 (NIPS). MIT Press, Cambridge, MA. Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff Markovian decision processes. In National Conference on Artificial Intelligence, pages 700–705. Smith, S. F. (1980). A Learning System Based on Genetic Adaptive Algorithms,. PhD thesis, Univ. Pittsburgh. Smolensky, P. (1986). Parallel distributed processing: Explorations in the microstructure of cognition,vol. 1. chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory, pages194–281. MIT Press, Cambridge, MA, USA. Solla, S. A. (1988). Accelerated learning in layered neural networks. Complex Systems, 2:625–640. Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I. Information,Control, 7:1–22. Speelpenning, B. (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhDthesis, Department of Computer Science, University of Illinois, Urbana-Champaign. Srivastava, R. K.,Masci, J., Kazerounian, S., Gomez, F., Schmidhuber, J. (2013). Compete to compute. In Advances in Neural Information Processing Systems, pages 2310–2318. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C. (2011). INI German Traffic Sign RecognitionBenchmark for the IJCNN’11 Competition. Stanley, K. O., D’Ambrosio, D. B., Gauci, J. (2009). A hypercube-based encoding for evolving largescaleneural networks. Artificial Life, 15(2):185–212. Stanley, K. O.,Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127. Steijvers, M.,Grunwald, P. (1996). A recurrent network that performs a contextsensitive predictiontask. In Proceedings of the 18th Annual Conference of the Cognitive Science Society. Erlbaum. Stone,M. (1974). Cross-validatory choice,assessment of statistical predictions. Roy. Stat. Soc., 36:111–147. Sun, G., Chen, H., Lee, Y. (1993a). Time warping invariant neural networks. In S. J. Hanson, J. D. C.,Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 180–187.MorganKaufmann. Sun, G. Z., Giles, C. L., Chen, H. H., Lee, Y. C. (1993b). The neural network pushdown automaton:Model, stack,learning simulations. Technical Report CS-TR-3118, University of Maryland, CollegePark. Sun, Y., Gomez, F., Schaul, T., Schmidhuber, J. (2013). A Linear Time Natural Evolution Strategy forNon-Separable Functions. In Proceedings of the Genetic,Evolutionary Computation Conference,page 61, Amsterdam, NL. ACM. Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J. (2009). Efficient natural evolution strategies. InProc. 11th Genetic,Evolutionary Computation Conference (GECCO), pages 539–546. Sutton, R.,Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA, MIT Press. Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y. (1999a). Policy gradient methods forreinforcement learning with function approximation. In Advances in Neural Information ProcessingSystems 12 (NIPS), pages 1057–1063. Sutton, R. S., Precup, D., Singh, S. P. (1999b). Between MDPs,semi-MDPs: A framework fortemporal abstraction in reinforcement learning. Artif. Intell., 112(1-2):181–211. Sutton, R. S., Szepesv´ari, C., Maei, H. R. (2008). A convergent O(n) algorithm for off-policy temporaldifferencelearning with linear function approximation. In Advances in Neural Information ProcessingSystems (NIPS’08), volume 21, pages 1609–1616. Szegedy, C., Toshev, A., Erhan, D. (2013). Deep neural networks for object detection. pages 2553–2561. Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J., editor, Advances in GeneticProgramming, pages 199–219. MIT Press. Tenenberg, J., Karlsson, J., Whitehead, S. (1993). Learning via task decomposition. In Meyer, J. A.,Roitblat, H., Wilson, S., editors, From Animals to Animats 2: Proceedings of the Second InternationalConference on Simulation of Adaptive Behavior, pages 337–343. MIT Press. Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. NeuralComputation, 6(2):215–219. Tieleman, T.,Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a running average ofits recent magnitude. COURSERA: Neural Networks for Machine Learning. Ti ˇno, P.,Hammer, B. (2004). Architectural bias in recurrent neural networks: Fractal analysis. NeuralComputation, 15(8):1931–1957. Tonkes, B.,Wiles, J. (1997). Learning a context-free task with a recurrent neural network: An analysisof stability. In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive ScienceSociety. Tsitsiklis, J. N.,van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59–94. Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem. Proceedingsof the London Mathematical Society, Series 2, 41:230–267. Ueda, N. (2000). Optimal linear combination of neural networks for improving classification performance. IEEE Transactions on Pattern Analysis,Machine Intelligence, 22(2):207–215. Urlbe, A. P. (1999). Structure-adaptable digital neural networks. PhD thesis, Universidad del Valle. Vahed, A.,Omlin, C. W. (2004). A machine learning method for extracting symbolic knowledge fromrecurrent neural networks. Neural Computation, 16(1):59–71. Vapnik, V. (1992). Principles of risk minimization for learning theory. In Lippman, D. S., Moody, J. E.,and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages 831–838. Morgan Kaufmann. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York. Veta, M., Viergever, M., Pluim, J., Stathonikos, N., van Diest, P. J. (2013). MICCAI 2013 GrandChallenge on Mitosis Detection. Vincent, P., Hugo, L., Bengio, Y., Manzagol, P.-A. (2008). Extracting,composing robust featureswith denoising autoencoders. In Proceedings of the 25th international conference on Machine learning,ICML ’08, pages 1096–1103, New York, NY, USA. ACM. Vogl, T., Mangis, J., Rigler, A., Zink, W., Alkon, D. (1988). Accelerating the convergence of theback-propagation method. Biological Cybernetics, 59:257–263. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik,14(2):85–100. Wallace, C. S.,Boulton, D. M. (1968). An information theoretic measure for classification. ComputerJournal, 11(2):185–194. Wan, E. A. (1994). Time series prediction by using a connectionist network with internal delay lines. In Weigend, A. S.,Gershenfeld, N. A., editors, Time series prediction: Forecasting the future andunderstanding the past, pages 265–295. Addison-Wesley. Wang, C., Venkatesh, S. S., Judd, J. S. (1994). Optimal stopping,effective machine complexityin learning. In Advances in Neural Information Processing Systems (NIPS’6), pages 303–310. MorganKaufmann. Wang, S.,Manning, C. (2013). Fast dropout training. In Proceedings of the 30th International Conferenceon Machine Learning (ICML-13), pages 118–126. Watanabe, O. (1992). Kolmogorov complexity,computational complexity. EATCS Monographs onTheoretical Computer Science, Springer. Watanabe, S. (1985). Pattern Recognition: Human,Mechanical. Willey, New York. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Oxford. Watkins, C. J. C. H.,Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292. Weigend, A. S.,Gershenfeld, N. A. (1993). Results of the time series prediction competition at the santafe institute. In Neural Networks, 1993., IEEE International Conference on, pages 1786–1793. IEEE. Weigend, A. S., Rumelhart, D. E., Huberman, B. A. (1991). Generalization by weight-elimination withapplication to forecasting. In Lippmann, R. P., Moody, J. E., Touretzky, D. S., editors, Advances inNeural Information Processing Systems 3, pages 875–882. San Mateo, CA: Morgan Kaufmann. Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction,Analysis in the Behavioral Sciences. PhD thesis, Harvard University. Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10thIFIP Conference, 31.8 - 4.9, NYC, pages 762–770. Werbos, P. J. (1987). Building,understanding adaptive systems: A statistical/numerical approach tofactory automation,brain research. IEEE Transactions on Systems, Man, Cybernetics, 17. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1. Werbos, P. J. (1989a). Backpropagation,neurocontrol: A review,prospectus. In IEEE/INNS InternationalJoint Conference on Neural Networks, Washington, D.C., volume 1, pages 209–216. Werbos, P. J. (1989b). Neural networks for control,system identification. In Proceedings of IEEE/CDCTampa, Florida. Werbos, P. J. (1992). Neural networks, system identification, control in the chemical industries. InD. A. White, D. A. S., editor, Handbook of Intelligent Control: Neural, Fuzzy, Adaptive Approaches,pages 283–356. Thomson Learning. Werbos, P. J. (2006). Backwards differentiation in AD,neural nets: Past links,new opportunities. In Automatic Differentiation: Applications, Theory, Implementations, pages 15–34. Springer. West, A. H. L.,Saad, D. (1995). Adaptive back-propagation in on-line learning of multilayer networks. In Touretzky, D. S., Mozer, M., Hasselmo, M. E., editors, NIPS, pages 323–329. MIT Press. White, H. (1989). Learning in artificial neural networks: A statistical perspective. Neural Computation,1(4):425–464. Whiteson, S., Kohl, N., Miikkulainen, R., Stone, P. (2005). Evolving keepaway soccer players throughtask decomposition. Machine Learning, 59(1):5–30. Widrow, B.,Hoff, M. (1962). Associative storage,retrieval of digital information in networks ofadaptive neurons. Biological Prototypes,Synthetic Systems, 1:160. Widrow, B., Rumelhart, D. E., Lehr,M. A. (1994). Neural networks: Applications in industry, businessand science. Commun. ACM, 37(3):93–105. Wiering, M.,Schmidhuber, J. (1996). Solving POMDPs with Levin search,EIRA. In Saitta,L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA. Wiering, M.,Schmidhuber, J. (1998a). HQ-learning. Adaptive Behavior, 6(2):219–246. Wiering, M. A.,Schmidhuber, J. (1998b). Fast online Q(). Machine Learning, 33(1):105–116. Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J. (2007). Solving deep memory POMDPs withrecurrent policy gradients. In ICANN (1), volume 4668 of Lecture Notes in Computer Science, pages697–706. Springer. Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J. (2010). Recurrent policy gradients. Logic Journalof IGPL, 18(2):620–634. Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J. (2008). Natural evolution strategies. In Congressof Evolutionary Computation (CEC 2008). Wiesel, D. H.,Hubel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. J. Physiol., 148:574–591. Wiles, J.,Elman, J. (1995). Learning to count without a counter: A case study of dynamics andactivation landscapes in recurrent networks. In In Proceedings of the Seventeenth Annual Conference ofthe Cognitive Science Society, pages pages 482 – 487, Cambridge, MA. MIT Press. Wilkinson, J. H., editor (1965). The Algebraic Eigenvalue Problem. Oxford University Press, Inc., NewYork, NY, USA. Williams, R. J. (1986). Reinforcement-learning in connectionist networks: A mathematical analysis. TechnicalReport 8605, Institute for Cognitive Science, University of California, San Diego. Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. Technical ReportNU-CCS-88-3, College of Comp. Sci., Northeastern University, Boston, MA. Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University, College ofComputer Science. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine Learning, 8:229–256. Williams, R. J.,Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrentnetwork trajectories. Neural Computation, 4:491–501. Williams, R. J.,Zipser, D. (1988). A learning algorithm for continually running fully recurrent networks. Technical Report ICS Report 8805, Univ. of California, San Diego, La Jolla. Williams, R. J.,Zipser, D. (1989a). Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1(1):87–111. Williams, R. J.,Zipser, D. (1989b). A learning algorithm for continually running fully recurrent networks. Neural Computation, 1(2):270–280. Willshaw, D. J.,von der Malsburg, C. (1976). How patterned neural connections can be set up byself-organization. Proc. R. Soc. London B, 194:431–445. Wiskott, L.,Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. NeuralComputation, 14(4):715–770. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2):241–259. Wolpert, D. H. (1994). Bayesian backpropagation over i-o functions rather than weights. In Cowan, J. D.,Tesauro, G., Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages200–207. Morgan Kaufmann. Yamauchi, B. M.,Beer, R. D. (1994). Sequential behavior,learning in evolved dynamical neuralnetworks. Adaptive Behavior, 2(3):219–246. Yamins, D., Hong, H., Cadieu, C., DiCarlo, J. J. (2013). Hierarchical Modular Optimization of ConvolutionalNetworks Achieves Representations Similar to Macaque IT,Human Ventral Stream. Advancesin Neural Information Processing Systems NIPS, pages 1–9. Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of IntelligentSystems, 4:203–222. Yu, X.-H., Chen, G.-A., Cheng, S.-X. (1995). Dynamic learning rate optimization of the backpropagationalgorithm. IEEE Transactions on Neural Networks, 6(3):669–677. Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701. Zeiler, M. D.,Fergus, R. (2013). Visualizing,understanding convolutional networks. TechnicalReport arXiv:1311.2901 [cs.CV], NYU. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. PhD thesis,University of Toronto. Zemel, R. S.,Hinton, G. E. (1994). Developing population codes by minimizing description length. In Cowan, J. D., Tesauro, G., Alspector, J., editors, Advances in Neural Information ProcessingSystems 6, pages 11–18. Morgan Kaufmann. Zeng, Z., Goodman, R., Smyth, P. (1994). Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 5(2). Zimmermann, H.-G., Tietz, C., Grothmann, R. (2012). Forecasting with recurrent neural networks: 12tricks. In Montavon, G., Orr, G. B., M¨uller, K.-R., editors, Neural Networks: Tricks of the Trade(2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 687–707. Springer.