Summer of Haskell

GSoC 2021 Ideas

This is a list of ideas for students who are considering to apply to Google Summer of Code 2021 for Haskell.org.

For project maintainers

Are you working on a Haskell project and you could use the help of a student during the summer? Consider contributing it as an idea here! You can contribute ideas by sending a pull request to our github repository. If you just want to discuss a possible idea, please contact us.

What is a good idea? Anything that improves the Haskell ecosystem is valid. The GSoC rules state that it must involve writing code primarily (as opposed to docs).

Projects should be concrete and small enough in scope such that they can be finished by a student in three months. Past experience has shown that keeping projects “small” is almost always a good idea.

Important change for 2021: In the past, GSoC projects were expected to take up the equivalent of full time employment for a student. However, in 2021, this has been reduced to half time positions: students are expected to work around 175 hours in a 10 week period.

Projects should benefit as many people as possible – e.g. an improvement to GHC will benefit more people than an update to a specific library or tool, but both are acceptable. New libraries and applications written in Haskell, rather than improvements to existing ones, are also welcome.

For students

Please be aware that:

Table of Contents

  1. Dhall bindings to TOML
  2. ghc-debug
  3. Haskell support in CodeMirror 6
  4. Practical Machine Learning with Hasktorch
  5. Haskell Language Server
  6. Restore ihaskell-widgets
  7. Pandoc Figures
  8. Stack
  9. Live coding algorithmic patterns with TidalCycles

Dhall bindings to TOML🔗

Dhall is an interpreted programmable configuration language that you can think of as: JSON + functions + types + imports. Almost all of the language’s supporting tooling is implemented in Haskell, including tools to convert between Dhall and other configuration file formats (like JSON or YAML).

Dhall does not currently support a TOML binding, though, and the scope of this project is to add support for converting bidirectionally between Dhall configuration files and TOML files. Specifically, this project would entail creating a new dhall-toml package that would provide dhall-to-toml and toml-to-dhall executables.

This project is suitable for an intermediate Haskell programmer and no prior knowledge or familiarity with Dhall is required. The student would be able to consult from the existing dhall-yaml and dhall-json packages, so even though they would be scaffolding a new package they wouldn’t be starting from scratch.

Mentors:

Difficulty: Intermediate

ghc-debug🔗

ghc-debug is a new heap profiling tool which can be used to answer precise questions about the memory usage of Haskell programs. It works, and has already been used to analyse problems in big codebases such as GHC. It would be great if a capable student could help take it to the next level and inject some fresh ideas into the project.

There are several possible avenues which could be explored:

This is an advanced project because it requires an understanding of the runtime representation of Haskell programs. Deep knowledge of the RTS is not necessary.

Anyone interested in this project should make sure to contact me before writing their proposal. I would also expect a successful applicant to have completed at least one merge request before a proposal is submitted.

Mentors: Matthew Pickering

Difficulty: Advanced

Haskell support in CodeMirror 6🔗

CodeMirror is a popular web-based code editor, with support for many languages. CodeMirror has a first-class Haskell mode up through version 5. In version 6, though, language support has been redesigned from the ground up. In particular:

Haskell no longer has first-class language support in CodeMirror 6. Instead, there is only a compatibility shim around the version 5 mode. This shim lacks any of the advantages of the new model: the shim doesn’t produce a true abstract syntax tree, doesn’t recover state well when an error exists in the source, etc.

An interesting project could be to implement a first-class Haskell language mode for CodeMirror 6. This would be a basis on which a wide variety of web-based Haskell tooling could be built. In particular, the CodeWorld project which provides an online Haskell playground and education tool is based on CodeMirror, and would eventually adopt such a mode.

This project is best suited for a student who has significant understanding of JavaScript, but wants to work with something in the Haskell tooling space. It also requires some understanding of parsing, and the Haskell grammar.

Mentors: Chris Smith Difficulty: Intermediate

Practical Machine Learning with Hasktorch🔗

Hasktorch is a library for neural networks and tensor math in Haskell. It is leveraging the C++ backend of PyTorch for fast numerical computation with GPU support. Our goal with Hasktorch is to provide a platform for machine learning using typed functional programming.

This summer, we have selected three exciting projects for GSoC contributors:

Integration between Hasktorch and Huggingface

Make State-of-the-art pre-trained neural network language models in Haskell.

The Huggingface Open-Source Python libraries have become the de-facto standard for deep-learning based Natural Language Processing (NLP) for researchers, practitioners, and educators.

We aim to unlock a number of NLP features and capabilities for Haskellers that Huggingface provides:

The goal is to not only provide this functionalitity as frictionless as it is in Python, but also to add type safety and idiomatic functional abstractions. For instance, generating natural language from a pre-trained T5 model in Haskell can look like this:

λ> type GPU = 'Device ('CUDA 0)
λ> model <-
    initialize
      @(T5Large 'WithLMHead GPU)
      "t5-large.pt"
λ> g <- mkGenerator @GPU 0
λ> type BatchSize = 1
λ> type MaxSeqSize = 64
λ> input <-
    mkT5Input
      @GPU @BatchSize @MaxSeqSize
      ["translate English to German: Monads are monoids in the category of endofunctors."]
λ> runBeamSearch 1 model input g
["Monaden sind Monoide in der Kategorie der Endofunktoren."]

A proof of concept for NLG from a T5 model already exists, but it misses an essential component: a tokenizer. A natural language tokenizer encodes input text like the above as lists of integers that the model has learned to interpret. To this end, Huggingface provides a tokenization library written in Rust. One potential GSoC project is to create Haskell bindings for this library. For this project, the student should be proficient in both Haskell and Rust. Of course, other projects in service of the Huggingface integration agenda can be pursued, too.

Potential mentors: Austin Huang, Torsten Scholak

Difficulty: Intermediate

Model Monitoring And Data Version Control

Haskell support for DVC (data version control https://github.com/iterative/dvc) - a library that defines cross-language protocols supporting versioning datasets for machine learning and tracking/persistence of ML experiments.

Wandb: A central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.

Potential mentors: Austin Huang, Torsten Scholak

Difficulty: Intermediate

Gradually Typed Hasktorch

Gradually typed Hasktorch, Torch.GraduallyTyped, is a new API for tensors and neural networks that interpolates between the already existing unchecked (untyped) and checked (typed) Hasktorch APIs, Torch and Torch.Typed, respectively. Thus far, users have to choose whether they want to commit fully to either typed or untyped tensors and models. The new gradually typed API relaxes this black-and-white tradeoff and makes the decision more granular. In Torch.GraduallyTyped, users can choose whether or not they want type checking for every individual type variable, like a tensor’s compute device (e.g. the CPU or a GPU), its precision (Bool, Int64, Float, etc.), or its shape (the names and sizes of its dimensions). Thus, users can enjoy the flexibility of an unchecked API for rapid prototyping, while they can also add as much type checking as they want later on. Alternatively, users can start with fully checked tensor and model types and relax them when and where they get in the way. Thus, Torch.GraduallyTyped combines the best of both worlds, of checked and of unchecked Hasktorch.

More concretely, consider the existing unchecked and checked APIs. Torch.Tensor is an untyped wrapper around a reference to a libtorch tensor. None of its properties are tracked by Haskell’s type system. A Torch.Typed.Tensor, on the other hand, has three type annotations: a static device (of kind (DeviceType, Nat)), a precision (of kind DType), and a shape (of kind [Nat]). By contrast, in the gradually typed API, all these types become optional:

The existing unchecked and checked APIs are thus special cases of the new gradually typed API. Indeed, one could define:

type UncheckedTensor = Tensor 'UncheckedDevice 'UncheckedDataType 'UncheckedShape
type CheckedTensor deviceType dtype dims = Tensor ('Device deviceType) ('DataType dtype) ('Shape (ToGradualDims dims))

(Here, ToGradualDims is a helper type family that converts types of kind [(Symbol, Nat)] to those of kind [Dim (Name Symbol) (Size Nat)].) UncheckedTensor is the equivalent to the fully unchecked Torch.Tensor, and CheckedTensor is equivalent to the fully checked Torch.Typed.Tensor. Beyond these two limit cases, one can see that there are many more configurations of partial checking. For instance, a compute device could be statically known but the shape could be statically unknown. We can even represent the case in which the number of dimensions is statically known but only some dimensions have a statically known size or name.

Extending the tensor types in the above way has some interesting consequences. Signatures of functions that operate on gradually typed tensors now depend on the information that is statically available, and they have to propagate and process the information in a way that is compatible with their function. For example, the nonzero function returns a tensor where each row contains a list of indices of all non-zero elements of the input tensor. Since the number of non-zero elements is not known at compile time, the output tensor has an unknown number of rows. It cannot be checked, and hence the output shape is 'Shape '[ 'Dim ('Name "*") UncheckedSize, 'Dim ('Name "*") inputDimNum], where inputDimNum is the (potentially unknown) number of dimensions of the input tensor.

Gradually typed hasktorch has been developed steadily in its own branch, https://github.com/hasktorch/hasktorch/tree/gradually-typed-hasktorch. Our goal is to bring it to maturity this year. For GSoC, we are looking for individuals who are interested in developing the gradually typed API further, add missing functionality, and test out new ideas.

Potential mentor: Torsten Scholak

Difficulty: Intermediate

Haskell Language Server🔗

Haskell Language Server is a full-featured Haskell IDE that recently reached version v1.0. Since it is a large project, there’s a lot of possible ideas for students.

This thread has a discussions around projects, and there are a number of issues tagged as Eligible for GSoC 2021.

Mentors: the Haskell Language Server team

Restore ihaskell-widgets🔗

IHaskell is a a kernel that allows you to create interactive Jupyter notebooks. ihaskell-widgets is a library for making the IHaskell kernel work with ipywidgets.

ipywidgets provide GUI controls for individual named variables in a Jupyter notebook. They are used, for example, to parameterize a rendered graph, plot, or chart with a GUI slider which instantly updates the rendering with the new parameter. ihaskell-widgets stopped working with ipywidgets version 7 in 2018.

This Github issue has some of the particulars: https://github.com/gibiansky/IHaskell/issues/870.

Potential mentors: Vaibhav Sagar, James Brock, Sumit Sahrawat, Rehno Lindeque

Difficulty: Beginner

Pandoc Figures🔗

Pandoc, the universal document converter written in Haskell, is not only a verstile conversion tool, but has also become a central part of scholarly publishing pipelines. It is an integral part of R Markdown used by many scientists. It is also being used for the production of academic journals, e.g. JOSS, kommunikation@gesellschaft.

Figures are an integral part of scientific communication, and of documents in general. The goal of this project is to extend pandoc’s basic handling of figures to satisfy the demands of modern single-source publishing.

In the scope of this project, pandoc’s central document data type will be modified such that it can capture the necessary information. This will also require adjustments to multiple parts of pandoc. Full figure support should be implemented for at least one input and output format, e.g. HTML.

The nature of the issue makes it a good candidate for an iterative approach, i.e., designing and refining the pandoc AST in close contact with the mentors. The project can build on prior work, but it should be continuously re-evaluated and updated during the course of this project.

Further goals might be the design and implementation of a figure interface usable by Lua filters, and experimental extensions to the Markdown syntax to allows authors to make best use of the new features.

Helpful skills:

The project will require a basic familiarity with Haskell. Driving the design with the help of algebraic data types can be a useful skill for rebuilding the relevant central data structures. Some experience with markup and typesetting formats like HTML and LaTeX would be ideal.

The project might also involve Lua API usage and programming. This, as well as other details, can be picked up during the project and with guidance of the mentors.

Potential mentors:

Difficulty: Beginner, Intermediate

Stack🔗

Stack is a build system for building, installing, testing and benchmarking Haskell applications. There are several suggestions for students:

Mentors: the Stack team

Live coding algorithmic patterns with TidalCycles🔗

TidalCycles (or Tidal for short) is a library for live pattern-making, usually musical patterns and often for an audience. It represents pattern using techniques based on Functional Reactive Programming, where both continuous and discrete events may be represented as a function of time. Tidal does not produce sound itself, and is co-developed with the SuperDirt framework for sound sampling and synthesis implemented in SuperCollider. It has a thriving community worldwide, tending towards musicians and artists without a formal background in computer science. A recent streamed event featured 84 talks and performances.

The Tidal github repository has a range of active issues, among other possibilities. As the primary maintainer I’m particularly interested in supporting projects which make Tidal more accessible. Currently the main way to interact with Tidal is via an editor plugin as an intermediary with the ghci REPL. Developing an API approach could both support new UI approaches to Tidal, and could allow binary distributions drag-and-drop installers, rather than the current error-prone installation method which many beginners are turned away by.

Potential mentor: Alex McLean

Difficulty: Beginner/intermediate