Surveying the sky using machine learning (SkyML)

The SkyML project has ended, see the publication list below for our achievements in the 3-year project period. The review article

Jan Kremer, Kristoffer Stensbo-Smidt, Fabian Gieseke, Kim Steenstrup Pedersen, and Christin Igel. Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy. IEEE Intelligent Systems  32(2), pp. 16-22, 2017.

summarizes some of our findings.  

Original project description

Astrophysics and cosmology are rich with data. The advent of wide-area digital cameras on large aperture telescopes has led to ever more ambitious surveys of the sky. The data volume of an entire survey of a decade ago can now be acquired in a single night and real-time analysis is often desired.

Our goal is to advance astrophysics research by developing efficient and specialized machine learning (ML) and image analysis techniques for these large-scale survey data. We will work on the wealth of data already available and will prepare for planned missions such as Gaia, the Large Synoptic Survey Telescope (LSST), and the Euclid satellite. These missions will collect an even larger data volume consisting of hundreds of data points for each of more than a billion objects. The magnitude of these surveys makes manual examination impossible. Advanced ML systems can solve this problem by automating the analysis. They are able to uncover the relation between input data (e.g., galaxy images) and outputs (e.g., galaxy physical properties) based on input-output samples. However, there are no ready-made solutions; data analysis in astronomy and cosmology poses scientific challenges to ML research and we will develop novel algorithms to address them. Because of the large amounts of data and time constraints when observing time-variable targets, we need highly efficient methods. Furthermore, the learning algorithms must cope with theoretical and practical problems due to sample selection bias: In astronomy the distributions of training and testing data (the data for building and applying models respectively) are often substantially different. This mismatch is due to only having training sets from old surveys while upcoming missions will probe never-before-seen regions in the astrophysical parameter space. Such systematic differences between samples in the training and testing data have to be addressed by the learning system.

[image analysis for galaxies]

We will consider both transient event detection as well as galaxy classification. Transient events are unpredictable, short-lived changes (lasting between microseconds and weeks) in astrophysical objects, for instance a supernova or variable star. We strive for new methods for detecting them more reliably and quickly, enabling time-critical follow-up observations.

Understanding galaxies and their evolution has been a prime concern for astrophysicists since the time of Edwin Hubble. Large surveys now collect images for millions of galaxies. We will develop image and ML techniques to improve the classification of galaxy morphology and other physical parameters. Methodologically, we will focus on (multi-class) support vector machines (SVMs), which are well understood theoretically and provide excellent classification performance. To apply them to large-scale survey data, we will develop efficient online learning algorithms for consistent multi-class SVMs. Both the learning and the evaluation will be scaled up by exploiting multi-core hardware architectures. We will develop methods to tame sample selection bias for SVMs based on cost-sensitive learning and new variants of active learning, which has proven to increase accuracy of photometric variable star classification. For image analysis, we will employ tailored local image features capturing both image structure and texture. The aspired methodological improvements are driven by our applications, but shall advance the field of ML in general. Data for the project is available in the form of existing surveys, however, our long term goal is to prepare for upcoming missions, in particular Euclid.

Core team

Peer-reviewed Publications

Fabian Gieseke, Cosmin Eugen Oancea, Ashish Mahaba, Christian Igel, and Tom Heskes. Bigger Buffer k-d Trees on Multi-Many-Core Systems. In High Performance Computing for Computational Science  (VECPAR 2018). pp. 202–214, 2019
Jan Kremer, Kristoffer Stensbo-Smidt, Fabian Gieseke, Kim Steenstrup Pedersen, and Christin Igel. Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy. IEEE Intelligent Systems  32(2), pp. 16-22, 2017 
Kristoffer Stensbo-Smidt, Fabian Gieseke, Christian Igel, Andrew Zirm, and Kim Steenstrup Pedersen. Sacrificing information for the greater good: how to select photometric bands for optimal accuracy. Monthly Notices of the Royal Astronomical Society  464(3),  pp. 2577–2596, 2017
Kai Lars Polsterer, Fabian Gieseke, Christian Igel, Bernd Doser, and Nikos Gianniotis. Parallelized rotation and flipping INvariant Kohonen maps (PINK) on GPUs. 24th European Symposium on Artificial Neural Networks (ESANN 2016), 2016
Ürün Dogan, Tobias Glasmachers, and Christian Igel. A Unified View on Multi-class Support Vector Classification. A Unified View on Multi-class Support Vector Classification. Journal of Machine Learning Research17(45), pp. 1-32, 2016
Jan Kremer, Fabian Gieseke, Kim Steenstrup Pedersen, and Christian Igel. Nearest Neighbor Density Ratio Estimation for Large-Scale Applications in Astronomy. Astronomy and Computing 12, pp. 67-72, 2015
Kai Lars Polsterer, Fabian Gieseke, and Christian Igel. Automatic classification of galaxies via machine learning techniques: Parallelized Rotation/Flipping INvariant Kohonen Maps (PINK). In A. R. Taylor and E. Rosolowsky, eds.: Astronomical Data Analysis Software and Systems (ADASS XXVI), pp. 81-86. Astronomical Society of the Pacific, ASP Conference Series 495, 2015 Best Poster Award
Fabian Gieseke, Kai Lars Polsterer, Cosmin E. Oancea, and Christian Igel. Speedy Greedy Feature Selection: Better Redshift Estimation via Massive Parallism. In M. Verleysen, ed.: 22th European Symposium on Artificial Neural Networks (ESANN 2014), pp. 87-92, Belgium: i6doc.com, 2014
Fabian Gieseke, Justin Heinermann, Cosmin Oancea, and Christian Igel. Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs. JMLR W&CP32 (ICML), pp. 172-180, 2014
Jan Kremer, Kim Steenstrup Pedersen, and Christian Igel. Active Learning with Support Vector Machines. WIREs Data Mining and Knowledge Discovery 4(4), pp. 313-326, 2014
Kai Lars Polsterer, Fabian Gieseke, Christian Igel, and Tomotsugu Goto. Improving the Performance of Photometric Regression Models via Massive Parallel Feature Selection. In N. Manset and P. Forshay, eds.: 23rd Annual Astronomical Data Analysis Software and Systems Conference (ADASS XXIII), ASP Conference Series 485, 2014
Fabian Gieseke, Christian Igel, and Tapio Pahikkala. Polynomial runtime bounds for fixed-rank unsupervised least-squares classification. JMLR W&CP29 (ACML), pp. 62-71, 2013
Kim Steenstrup Pedersen, Kristoffer Stensbo-Smidt, Andrew Zirm, and Christian Igel. Shape Index Descriptors Applied to Texture-Based Galaxy Analysis. International Conference on Computer Vision (ICCV), IEEE Press, pp 2440-2447, IEEE Press
Kristoffer Stensbo-Smidt, Christian Igel, Andrew Zirm, and Kim Steenstrup Pedersen. Nearest Neighbour Regression Outperforms Model-based Prediction of Specific Star Formation Rate. IEEE International Conference on Big Data 2013, pp. 141-144, IEEE Press, 2013

Popular Science Publications

Kristoffer Stensbo-Smidt. Modellering i gymnasiet. In Naturen i Computeren, pp. 3-4, Faculty of Science, University of Copenhagen, 2015
Kristoffer Stensbo-Smidt. Universets big data. In Naturen i Computeren, pp. 14-15, Faculty of Science, University of Copenhagen, 2015

MSc Projects and Theses

Dolores Messer. Shear estimation based on large-scale image structure within the frame- work of the GREAT3 challenge - Image analysis, Machine learning, Gravitational lensing. MSc in Mathematical Modelling and Computation, Technical University of Denmark and University of Zurich, 2015
Aske Dörge. Massively Parallel Convolutional Neural Networks in SHARK. MSc in Computer Science, University of Copenhagen, 2015
Aske Dörge and Christian Halfdan Gath. Galaxy Classification. MSc project in Computer Science, University of Copenhagen, 2015
Joachim Mortensen. Transient-Event Exploration in SDSS Stripe 82. MSc in Physics, University of Copenhagen, 2014

Jens Patrick Raaby. An objective categorization of auroral substorms - Exploring large scale morphology. MSc. in Computer Science, University of Copenhagen, 2014

Contact

For information reagrding SkyML please contact Christian Igel or Kim Steenstrup Pedersen.

Funding

The project is funded by the Danish Council for Independent Research, Natural Sciences.