Member-only story

Classic Computer Science Problems in ̶P̶y̶t̶h̶o̶n̶ Scala — Trivial Compression

Brian Schlining
2 min readMar 19, 2019

I recently purchased the book Classic Computer Science Problems in Python to help me brush up on my python. As I’m working through the Python code in the book, I’m taking some time to also port the Python to Scala. Below is an example the illustrates trivial compression for efficiently storing genes. A little science refresher, genes are typically made of Five nucleobases — adenine (A), cytosine ©, guanine (G), thymine (T), and uracil (U). Uracil (U) and thymine (T) are very similar except uracil is found in RNA, thymine (T) replaces uracil in DNA.

Below I’ll show the example code from the book for compressing the in-memory representation of a DNA sequence, which can have bases of A, C, G, or T. For those new to computer science, compression techniques allow you to use less memory to store information. Since we only have four possible nucleotide values we can encode them using 2 bits per character: 00, 01, 10, or 11. This will compress the same DNA information into at least 1/4 of the original memory, depending on the character set used for your String encoding.

Python

In this code, each nucleotide is converted to its corresponding 2-bit representation and prepended to bit_string . To decode it, we reverse the mapping; we decode each 2-bit chunk of the bit_string back to the corresponding character. Python makes this possible as its integer type just keeps expanding and can be as large as you need it to be. We can…

--

--

Brian Schlining
Brian Schlining

Written by Brian Schlining

Polyglot coder. Deep-sea Researcher. Zazen aficionado. I think squids are pretty cool.

Responses (1)