GSoC 2021 Ideas
This is a list of ideas for students who are considering to apply to Google Summer of Code 2021 for Haskell.org.
For project maintainers
Are you working on a Haskell project and you could use the help of a student during the summer? Consider contributing it as an idea here! You can contribute ideas by sending a pull request to our github repository. If you just want to discuss a possible idea, please contact us.
What is a good idea? Anything that improves the Haskell ecosystem is valid. The GSoC rules state that it must involve writing code primarily (as opposed to docs).
Projects should be concrete and small enough in scope such that they can be finished by a student in three months. Past experience has shown that keeping projects “small” is almost always a good idea.
Important change for 2021: In the past, GSoC projects were expected to take up the equivalent of full time employment for a student. However, in 2021, this has been reduced to half time positions: students are expected to work around 175 hours in a 10 week period.
Projects should benefit as many people as possible – e.g. an improvement to GHC will benefit more people than an update to a specific library or tool, but both are acceptable. New libraries and applications written in Haskell, rather than improvements to existing ones, are also welcome.
Please be aware that:
- This is not an all-inclusive list, so you can apply for projects not in this list and we will try our best to match you with a mentor.
- You can apply for as many ideas as you want (but only one can be accepted).
- Some general tips on writing a proposal are discussed here.
Table of Contents
- Dhall bindings to TOML
- Practical Machine Learning with Hasktorch
- Restore ihaskell-widgets
- Pandoc Figures
Dhall bindings to TOML🔗
Dhall is an interpreted programmable configuration language that you can think of as: JSON + functions + types + imports. Almost all of the language’s supporting tooling is implemented in Haskell, including tools to convert between Dhall and other configuration file formats (like JSON or YAML).
Dhall does not currently support a TOML binding, though, and the scope of this project is to add support for converting bidirectionally between Dhall configuration files and TOML files. Specifically, this project would entail creating a new
dhall-toml package that would provide
This project is suitable for an intermediate Haskell programmer and no prior knowledge or familiarity with Dhall is required. The student would be able to consult from the existing
dhall-json packages, so even though they would be scaffolding a new package they wouldn’t be starting from scratch.
Practical Machine Learning with Hasktorch🔗
Hasktorch is a library for neural networks and tensor math in Haskell. It is leveraging the C++ backend of PyTorch for fast numerical computation with GPU support. Our goal with Hasktorch is to provide a platform for machine learning using typed functional programming.
This summer, we have selected three exciting projects for GSoC contributors:
Integration between Hasktorch and Huggingface
Make State-of-the-art pre-trained neural network language models in Haskell.
The Huggingface Open-Source Python libraries have become the de-facto standard for deep-learning based Natural Language Processing (NLP) for researchers, practitioners, and educators.
We aim to unlock a number of NLP features and capabilities for Haskellers that Huggingface provides:
- Access to general-purpose, Transformer-based reference implementations (BERT, GPT-2, RoBERTa, T5, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
- Fine-tuning of Transformer models.
- Sharing of pretrained and/or fine-tuned state-of-the-art models for the aforementioned model architectures. Huggingface maintains a model hub where users can share and download models that have been trained or fine-tuned on new data.
- Deployment of models.
The goal is to not only provide this functionalitity as frictionless as it is in Python, but also to add type safety and idiomatic functional abstractions. For instance, generating natural language from a pre-trained T5 model in Haskell can look like this:
> type GPU = 'Device ('CUDA 0) λ> model <- λ initialize@(T5Large 'WithLMHead GPU) "t5-large.pt" > g <- mkGenerator @GPU 0 λ> type BatchSize = 1 λ> type MaxSeqSize = 64 λ> input <- λ mkT5Input@GPU @BatchSize @MaxSeqSize "translate English to German: Monads are monoids in the category of endofunctors."] [> runBeamSearch 1 model input g λ"Monaden sind Monoide in der Kategorie der Endofunktoren."][
A proof of concept for NLG from a T5 model already exists, but it misses an essential component: a tokenizer. A natural language tokenizer encodes input text like the above as lists of integers that the model has learned to interpret. To this end, Huggingface provides a tokenization library written in Rust. One potential GSoC project is to create Haskell bindings for this library. For this project, the student should be proficient in both Haskell and Rust. Of course, other projects in service of the Huggingface integration agenda can be pursued, too.
Potential mentors: Austin Huang, Torsten Scholak
Model Monitoring And Data Version Control
Haskell support for DVC (data version control https://github.com/iterative/dvc) - a library that defines cross-language protocols supporting versioning datasets for machine learning and tracking/persistence of ML experiments.
Wandb: A central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.
Potential mentors: Austin Huang, Torsten Scholak
Gradually Typed Hasktorch
Gradually typed Hasktorch,
Torch.GraduallyTyped, is a new API for tensors and neural networks that interpolates between the already existing unchecked (untyped) and checked (typed) Hasktorch APIs,
Torch.Typed, respectively. Thus far, users have to choose whether they want to commit fully to either typed or untyped tensors and models. The new gradually typed API relaxes this black-and-white tradeoff and makes the decision more granular. In
Torch.GraduallyTyped, users can choose whether or not they want type checking for every individual type variable, like a tensor’s compute device (e.g. the CPU or a GPU), its precision (
Float, etc.), or its shape (the names and sizes of its dimensions). Thus, users can enjoy the flexibility of an unchecked API for rapid prototyping, while they can also add as much type checking as they want later on. Alternatively, users can start with fully checked tensor and model types and relax them when and where they get in the way. Thus,
Torch.GraduallyTyped combines the best of both worlds, of checked and of unchecked Hasktorch.
More concretely, consider the existing unchecked and checked APIs.
Torch.Tensor is an untyped wrapper around a reference to a libtorch tensor. None of its properties are tracked by Haskell’s type system. A
Torch.Typed.Tensor, on the other hand, has three type annotations: a static device (of kind
(DeviceType, Nat)), a precision (of kind
DType), and a shape (of kind
[Nat]). By contrast, in the gradually typed API, all these types become optional:
Device (DeviceType Nat)where
Device a ~ Maybe a(i.e.
data Device a = UncheckedDevice | Device a).
DataType a ~ Maybe a.
Shape [Dim (Name Symbol) (Size Nat)]where
Shape a ~ Maybe a,
Name a ~ Maybe a, and
Size a ~ Maybe a.
The existing unchecked and checked APIs are thus special cases of the new gradually typed API. Indeed, one could define:
type UncheckedTensor = Tensor 'UncheckedDevice 'UncheckedDataType 'UncheckedShape type CheckedTensor deviceType dtype dims = Tensor ('Device deviceType) ('DataType dtype) ('Shape (ToGradualDims dims))
ToGradualDims is a helper type family that converts types of kind
[(Symbol, Nat)] to those of kind
[Dim (Name Symbol) (Size Nat)].)
UncheckedTensor is the equivalent to the fully unchecked
CheckedTensor is equivalent to the fully checked
Torch.Typed.Tensor. Beyond these two limit cases, one can see that there are many more configurations of partial checking. For instance, a compute device could be statically known but the shape could be statically unknown. We can even represent the case in which the number of dimensions is statically known but only some dimensions have a statically known size or name.
Extending the tensor types in the above way has some interesting consequences. Signatures of functions that operate on gradually typed tensors now depend on the information that is statically available, and they have to propagate and process the information in a way that is compatible with their function. For example, the
nonzero function returns a tensor where each row contains a list of indices of all non-zero elements of the input tensor. Since the number of non-zero elements is not known at compile time, the output tensor has an unknown number of rows. It cannot be checked, and hence the output shape is
'Shape '[ 'Dim ('Name "*") UncheckedSize, 'Dim ('Name "*") inputDimNum], where
inputDimNum is the (potentially unknown) number of dimensions of the input tensor.
Gradually typed hasktorch has been developed steadily in its own branch, https://github.com/hasktorch/hasktorch/tree/gradually-typed-hasktorch. Our goal is to bring it to maturity this year. For GSoC, we are looking for individuals who are interested in developing the gradually typed API further, add missing functionality, and test out new ideas.
Potential mentor: Torsten Scholak
IHaskell is a a kernel that allows you to create interactive Jupyter notebooks. ihaskell-widgets is a library for making the IHaskell kernel work with ipywidgets.
ipywidgets provide GUI controls for individual named variables in a Jupyter notebook. They are used, for example, to parameterize a rendered graph, plot, or chart with a GUI slider which instantly updates the rendering with the new parameter. ihaskell-widgets stopped working with ipywidgets version 7 in 2018.
This Github issue has some of the particulars: https://github.com/gibiansky/IHaskell/issues/870.
Potential mentors: Vaibhav Sagar, James Brock, Sumit Sahrawat, Rehno Lindeque
Pandoc, the universal document converter written in Haskell, is not only a verstile conversion tool, but has also become a central part of scholarly publishing pipelines. It is an integral part of R Markdown used by many scientists. It is also being used for the production of academic journals, e.g. JOSS, kommunikation@gesellschaft.
Figures are an integral part of scientific communication, and of documents in general. The goal of this project is to extend pandoc’s basic handling of figures to satisfy the demands of modern single-source publishing.
In the scope of this project, pandoc’s central document data type will be modified such that it can capture the necessary information. This will also require adjustments to multiple parts of pandoc. Full figure support should be implemented for at least one input and output format, e.g. HTML.
The nature of the issue makes it a good candidate for an iterative approach, i.e., designing and refining the pandoc AST in close contact with the mentors. The project can build on prior work, but it should be continuously re-evaluated and updated during the course of this project.
Further goals might be the design and implementation of a figure interface usable by Lua filters, and experimental extensions to the Markdown syntax to allows authors to make best use of the new features.
The project will require a basic familiarity with Haskell. Driving the design with the help of algebraic data types can be a useful skill for rebuilding the relevant central data structures. Some experience with markup and typesetting formats like HTML and LaTeX would be ideal.
The project might also involve Lua API usage and programming. This, as well as other details, can be picked up during the project and with guidance of the mentors.
- Albert Krewinkel
- Alison Hill
- Christophe Dervieux
Difficulty: Beginner, Intermediate
Stack is a build system for building, installing, testing and benchmarking Haskell applications. There are several suggestions for students:
Garbage collection: Stack downloads and builds, and then stores, large files such as programs and artifacts. These large files take up disk space and can often be removed after a period of time, since they are no longer in use. A new
stack gccommand could be added to clean up unused files automatically.
Configuration improvements: Stack uses YAML as a configuration language. Using newer libraries would allow Stack to drastically improve error messages and even automatically generate documentation.
.hifiles: Stack relies on information in
.hifiles in order to determine recompilation needs without running GHC. Importantly, this tracks Template Haskell dependent files. This is mediated via the [hi-file-parser] package. This package needs to be updated for newer GHC versions. And ideally, a more long-term solution would work with upstream GHC features to avoid needing to maintain a separate binary parser.
Internal libraries: add support for internal libraries to Stack.
Mentors: the Stack team