GSoC 2021 Ideas
This is a list of ideas for students who are considering to apply to Google Summer of Code 2021 for Haskell.org.
For project maintainers
Are you working on a Haskell project and you could use the help of a student during the summer? Consider contributing it as an idea here! You can contribute ideas by sending a pull request to our github repository. If you just want to discuss a possible idea, please contact us.
What is a good idea? Anything that improves the Haskell ecosystem is valid. The GSoC rules state that it must involve writing code primarily (as opposed to docs).
Projects should be concrete and small enough in scope such that they can be finished by a student in three months. Past experience has shown that keeping projects “small” is almost always a good idea.
Important change for 2021: In the past, GSoC projects were expected to take up the equivalent of full time employment for a student. However, in 2021, this has been reduced to half time positions: students are expected to work around 175 hours in a 10 week period.
Projects should benefit as many people as possible – e.g. an improvement to GHC will benefit more people than an update to a specific library or tool, but both are acceptable. New libraries and applications written in Haskell, rather than improvements to existing ones, are also welcome.
Please be aware that:
- This is not an all-inclusive list, so you can apply for projects not in this list and we will try our best to match you with a mentor.
- You can apply for as many ideas as you want (but only one can be accepted).
- Some general tips on writing a proposal are discussed here.
Table of Contents
- Dhall bindings to TOML
- Haskell support in CodeMirror 6
- Practical Machine Learning with Hasktorch
- Haskell Language Server
- Restore ihaskell-widgets
- Pandoc Figures
- Live coding algorithmic patterns with TidalCycles
Dhall bindings to TOML🔗
Dhall is an interpreted programmable configuration language that you can think of as: JSON + functions + types + imports. Almost all of the language’s supporting tooling is implemented in Haskell, including tools to convert between Dhall and other configuration file formats (like JSON or YAML).
Dhall does not currently support a TOML binding, though, and the scope of this project is to add support for converting bidirectionally between Dhall configuration files and TOML files. Specifically, this project would entail creating a new
dhall-toml package that would provide
This project is suitable for an intermediate Haskell programmer and no prior knowledge or familiarity with Dhall is required. The student would be able to consult from the existing
dhall-json packages, so even though they would be scaffolding a new package they wouldn’t be starting from scratch.
ghc-debug is a new heap profiling tool which can be used to answer precise questions about the memory usage of Haskell programs. It works, and has already been used to analyse problems in big codebases such as GHC. It would be great if a capable student could help take it to the next level and inject some fresh ideas into the project.
There are several possible avenues which could be explored:
- Applying ghc-debug to existing open source projects such as ghcide and developing reproducible tests to prevent memory problems being reintroduced.
- Implement an existing memory analysis from literature (such as BLeak)
- Improve visualisations of existing analysis modes, for example, adding convenient functions to output large graphs in a suitable format for consumption by an external tool.
- Anything else you can think of! The best projects are your own ideas.
This is an advanced project because it requires an understanding of the runtime representation of Haskell programs. Deep knowledge of the RTS is not necessary.
Anyone interested in this project should make sure to contact me before writing their proposal. I would also expect a successful applicant to have completed at least one merge request before a proposal is submitted.
Mentors: Matthew Pickering
Haskell support in CodeMirror 6🔗
CodeMirror is a popular web-based code editor, with support for many languages. CodeMirror has a first-class Haskell mode up through version 5. In version 6, though, language support has been redesigned from the ground up. In particular:
- Language support in CodeMirror 6 is designed around an incremental error-correcting parser built using Lezer, rather than ad hoc pseudo-parsing with regular expressions.
- CodeMirror 6 provides language modes access to a constantly updated abstract syntax tree that it can use to inform editor behavior.
Haskell no longer has first-class language support in CodeMirror 6. Instead, there is only a compatibility shim around the version 5 mode. This shim lacks any of the advantages of the new model: the shim doesn’t produce a true abstract syntax tree, doesn’t recover state well when an error exists in the source, etc.
An interesting project could be to implement a first-class Haskell language mode for CodeMirror 6. This would be a basis on which a wide variety of web-based Haskell tooling could be built. In particular, the CodeWorld project which provides an online Haskell playground and education tool is based on CodeMirror, and would eventually adopt such a mode.
Mentors: Chris Smith Difficulty: Intermediate
Practical Machine Learning with Hasktorch🔗
Hasktorch is a library for neural networks and tensor math in Haskell. It is leveraging the C++ backend of PyTorch for fast numerical computation with GPU support. Our goal with Hasktorch is to provide a platform for machine learning using typed functional programming.
This summer, we have selected three exciting projects for GSoC contributors:
Integration between Hasktorch and Huggingface
Make State-of-the-art pre-trained neural network language models in Haskell.
The Huggingface Open-Source Python libraries have become the de-facto standard for deep-learning based Natural Language Processing (NLP) for researchers, practitioners, and educators.
We aim to unlock a number of NLP features and capabilities for Haskellers that Huggingface provides:
- Access to general-purpose, Transformer-based reference implementations (BERT, GPT-2, RoBERTa, T5, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
- Fine-tuning of Transformer models.
- Sharing of pretrained and/or fine-tuned state-of-the-art models for the aforementioned model architectures. Huggingface maintains a model hub where users can share and download models that have been trained or fine-tuned on new data.
- Deployment of models.
The goal is to not only provide this functionalitity as frictionless as it is in Python, but also to add type safety and idiomatic functional abstractions. For instance, generating natural language from a pre-trained T5 model in Haskell can look like this:
> type GPU = 'Device ('CUDA 0) λ> model <- λ initialize@(T5Large 'WithLMHead GPU) "t5-large.pt" > g <- mkGenerator @GPU 0 λ> type BatchSize = 1 λ> type MaxSeqSize = 64 λ> input <- λ mkT5Input@GPU @BatchSize @MaxSeqSize "translate English to German: Monads are monoids in the category of endofunctors."] [> runBeamSearch 1 model input g λ"Monaden sind Monoide in der Kategorie der Endofunktoren."][
A proof of concept for NLG from a T5 model already exists, but it misses an essential component: a tokenizer. A natural language tokenizer encodes input text like the above as lists of integers that the model has learned to interpret. To this end, Huggingface provides a tokenization library written in Rust. One potential GSoC project is to create Haskell bindings for this library. For this project, the student should be proficient in both Haskell and Rust. Of course, other projects in service of the Huggingface integration agenda can be pursued, too.
Potential mentors: Austin Huang, Torsten Scholak
Model Monitoring And Data Version Control
Haskell support for DVC (data version control https://github.com/iterative/dvc) - a library that defines cross-language protocols supporting versioning datasets for machine learning and tracking/persistence of ML experiments.
Wandb: A central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.
Potential mentors: Austin Huang, Torsten Scholak
Gradually Typed Hasktorch
Gradually typed Hasktorch,
Torch.GraduallyTyped, is a new API for tensors and neural networks that interpolates between the already existing unchecked (untyped) and checked (typed) Hasktorch APIs,
Torch.Typed, respectively. Thus far, users have to choose whether they want to commit fully to either typed or untyped tensors and models. The new gradually typed API relaxes this black-and-white tradeoff and makes the decision more granular. In
Torch.GraduallyTyped, users can choose whether or not they want type checking for every individual type variable, like a tensor’s compute device (e.g. the CPU or a GPU), its precision (
Float, etc.), or its shape (the names and sizes of its dimensions). Thus, users can enjoy the flexibility of an unchecked API for rapid prototyping, while they can also add as much type checking as they want later on. Alternatively, users can start with fully checked tensor and model types and relax them when and where they get in the way. Thus,
Torch.GraduallyTyped combines the best of both worlds, of checked and of unchecked Hasktorch.
More concretely, consider the existing unchecked and checked APIs.
Torch.Tensor is an untyped wrapper around a reference to a libtorch tensor. None of its properties are tracked by Haskell’s type system. A
Torch.Typed.Tensor, on the other hand, has three type annotations: a static device (of kind
(DeviceType, Nat)), a precision (of kind
DType), and a shape (of kind
[Nat]). By contrast, in the gradually typed API, all these types become optional:
Device (DeviceType Nat)where
Device a ~ Maybe a(i.e.
data Device a = UncheckedDevice | Device a).
DataType a ~ Maybe a.
Shape [Dim (Name Symbol) (Size Nat)]where
Shape a ~ Maybe a,
Name a ~ Maybe a, and
Size a ~ Maybe a.
The existing unchecked and checked APIs are thus special cases of the new gradually typed API. Indeed, one could define:
type UncheckedTensor = Tensor 'UncheckedDevice 'UncheckedDataType 'UncheckedShape type CheckedTensor deviceType dtype dims = Tensor ('Device deviceType) ('DataType dtype) ('Shape (ToGradualDims dims))
ToGradualDims is a helper type family that converts types of kind
[(Symbol, Nat)] to those of kind
[Dim (Name Symbol) (Size Nat)].)
UncheckedTensor is the equivalent to the fully unchecked
CheckedTensor is equivalent to the fully checked
Torch.Typed.Tensor. Beyond these two limit cases, one can see that there are many more configurations of partial checking. For instance, a compute device could be statically known but the shape could be statically unknown. We can even represent the case in which the number of dimensions is statically known but only some dimensions have a statically known size or name.
Extending the tensor types in the above way has some interesting consequences. Signatures of functions that operate on gradually typed tensors now depend on the information that is statically available, and they have to propagate and process the information in a way that is compatible with their function. For example, the
nonzero function returns a tensor where each row contains a list of indices of all non-zero elements of the input tensor. Since the number of non-zero elements is not known at compile time, the output tensor has an unknown number of rows. It cannot be checked, and hence the output shape is
'Shape '[ 'Dim ('Name "*") UncheckedSize, 'Dim ('Name "*") inputDimNum], where
inputDimNum is the (potentially unknown) number of dimensions of the input tensor.
Gradually typed hasktorch has been developed steadily in its own branch, https://github.com/hasktorch/hasktorch/tree/gradually-typed-hasktorch. Our goal is to bring it to maturity this year. For GSoC, we are looking for individuals who are interested in developing the gradually typed API further, add missing functionality, and test out new ideas.
Potential mentor: Torsten Scholak
Haskell Language Server🔗
Haskell Language Server is a full-featured Haskell IDE that recently reached version
v1.0. Since it is a large project, there’s a lot of possible ideas for students.
This thread has a discussions around projects, and there are a number of issues tagged as Eligible for GSoC 2021.
Mentors: the Haskell Language Server team
IHaskell is a a kernel that allows you to create interactive Jupyter notebooks. ihaskell-widgets is a library for making the IHaskell kernel work with ipywidgets.
ipywidgets provide GUI controls for individual named variables in a Jupyter notebook. They are used, for example, to parameterize a rendered graph, plot, or chart with a GUI slider which instantly updates the rendering with the new parameter. ihaskell-widgets stopped working with ipywidgets version 7 in 2018.
This Github issue has some of the particulars: https://github.com/gibiansky/IHaskell/issues/870.
Potential mentors: Vaibhav Sagar, James Brock, Sumit Sahrawat, Rehno Lindeque
Pandoc, the universal document converter written in Haskell, is not only a verstile conversion tool, but has also become a central part of scholarly publishing pipelines. It is an integral part of R Markdown used by many scientists. It is also being used for the production of academic journals, e.g. JOSS, kommunikation@gesellschaft.
Figures are an integral part of scientific communication, and of documents in general. The goal of this project is to extend pandoc’s basic handling of figures to satisfy the demands of modern single-source publishing.
In the scope of this project, pandoc’s central document data type will be modified such that it can capture the necessary information. This will also require adjustments to multiple parts of pandoc. Full figure support should be implemented for at least one input and output format, e.g. HTML.
The nature of the issue makes it a good candidate for an iterative approach, i.e., designing and refining the pandoc AST in close contact with the mentors. The project can build on prior work, but it should be continuously re-evaluated and updated during the course of this project.
Further goals might be the design and implementation of a figure interface usable by Lua filters, and experimental extensions to the Markdown syntax to allows authors to make best use of the new features.
The project will require a basic familiarity with Haskell. Driving the design with the help of algebraic data types can be a useful skill for rebuilding the relevant central data structures. Some experience with markup and typesetting formats like HTML and LaTeX would be ideal.
The project might also involve Lua API usage and programming. This, as well as other details, can be picked up during the project and with guidance of the mentors.
- Albert Krewinkel
- Alison Hill
- Christophe Dervieux
Difficulty: Beginner, Intermediate
Stack is a build system for building, installing, testing and benchmarking Haskell applications. There are several suggestions for students:
Garbage collection: Stack downloads and builds, and then stores, large files such as programs and artifacts. These large files take up disk space and can often be removed after a period of time, since they are no longer in use. A new
stack gccommand could be added to clean up unused files automatically.
Configuration improvements: Stack uses YAML as a configuration language. Using newer libraries would allow Stack to drastically improve error messages and even automatically generate documentation.
.hifiles: Stack relies on information in
.hifiles in order to determine recompilation needs without running GHC. Importantly, this tracks Template Haskell dependent files. This is mediated via the [hi-file-parser] package. This package needs to be updated for newer GHC versions. And ideally, a more long-term solution would work with upstream GHC features to avoid needing to maintain a separate binary parser.
Internal libraries: add support for internal libraries to Stack.
Mentors: the Stack team
Live coding algorithmic patterns with TidalCycles🔗
TidalCycles (or Tidal for short) is a library for live pattern-making, usually musical patterns and often for an audience. It represents pattern using techniques based on Functional Reactive Programming, where both continuous and discrete events may be represented as a function of time. Tidal does not produce sound itself, and is co-developed with the SuperDirt framework for sound sampling and synthesis implemented in SuperCollider. It has a thriving community worldwide, tending towards musicians and artists without a formal background in computer science. A recent streamed event featured 84 talks and performances.
The Tidal github repository has a range of active issues, among other possibilities. As the primary maintainer I’m particularly interested in supporting projects which make Tidal more accessible. Currently the main way to interact with Tidal is via an editor plugin as an intermediary with the ghci REPL. Developing an API approach could both support new UI approaches to Tidal, and could allow binary distributions drag-and-drop installers, rather than the current error-prone installation method which many beginners are turned away by.
Potential mentor: Alex McLean