Perl5 as a Data Science Language

[FINISHED] Last updated by Joe Schaefer on Sun, 29 Mar 2026    source
 

Data Science

Series introduction — Post 0 of N
This post is the first in a series documenting the co-development of a vector-database engine (VDBE) written entirely in Perl5 + PDL. Later posts walk through every component of that engine; this one sets the stage. The main impetus for this series is NOT to have you dump your VDBE as I make no performance claims, but to show how one may use Perl to achieve pretty much anything you can achieve with any other language, but smarter!


Table of Contents


1. Why Perl5 for Data Science?

When data scientists discuss language choices the conversation quickly converges on
Python, R, or Julia. Perl5 rarely gets a seat at the table — yet it carries a compelling set of traits that deserve a second look. These traits have not materially changed over the years (Perl5 has always been this way!), but unless you have been exposed to the language and learned to appreciate its tercity, rationality, flexibility, expressibility and actually used it to drive your work forward, you would not know these features not only come for free with Perl5, but can help you drive your projects forward.

Ubiquity and zero-install deployment

Perl5 ships as a default component of virtually every UNIX-like operating system —
Linux distributions, macOS, BSDs, and many embedded Linux environments all include
a working perl binary out of the box. Python has been making inroads here, but
it is still common to find headless servers, network appliances, or HPC login nodes
where Perl is present and a full Python stack is not. A data pipeline written in
Perl can run on day one without a conda environment, a venv, or a container.

Portability from the data centre to the edge

The same script that analyses a terabyte dataset on a 256-core HPC node can, with
minor configuration changes, run on a Raspberry Pi, an IoT gateway, or an embedded
controller. Perl’s single-binary deployment model and low runtime overhead make it
a genuine “write once, run anywhere” language in environments where Python’s
interpreter overhead or Julia’s JIT warm-up time would be unacceptable.
if you are planning to deploy anywhere and everywhere Perl5 is your obvious choice.

A heritage built on text and data munging

Perl was designed from the ground up for text processing, regular expressions, and
“glue” work between system components. In practice, scientific data pipelines are
dominated not by numerical computation but by data wrangling: reading heterogeneous
file formats, cleaning messy records, joining datasets from different sources, and
routing results to downstream consuming components. Perl’s regex engine remains among the most
powerful available, and one-liners can accomplish data-cleaning tasks that would
require helper libraries in other languages. If you are in the domain of scientific computing,
you may have come across the notion of workflow management systems and reproducible research.
They both rely on the execution of end to end data transformations and workflow to eliminate the
manual, error-prone and tedious point and click activities that analysts and scientists have to do to morph
their data into insights and inferences respectively. In this brave new world, Perl5’s rich
history allows it to shine both as a component of workflows, or as a the application language that
implements these workflows.

CPAN: a battle-tested module ecosystem

The Comprehensive Perl Archive Network (CPAN) hosts over 200,000 modules across
every domain imaginable. While the data science offerings are not nearly as extensive as Python,
the basic components for dedicated builders ARE there:

Modern Perl is not your grandfather’s Perl

The features below are drawn directly from the official release notes
(perl5360delta, perl5380delta, perl5400delta) and organised by the
release in which they reached stable status or were first introduced.
Only features relevant to data-science and scientific-computing workloads are
highlighted.

Perl 5.36 — May 2022

Perl 5.38 — July 2023

Perl 5.40 — June 2024

Longstanding features (pre-5.36)

Combined with perlbrew or plenv for version management and carton for
reproducible dependency snapshots, a modern Perl project looks and feels like a
first-class software engineering effort.

Honest limitations

No case for Perl is complete without honesty about where it falls short:


2. The Perl Data-Type System — Strengths and Cache-Era Limits

Core Perl types

Perl’s fundamental data model centres on three constructs:

Construct Sigil What it holds
Scalar $ A single value: number, string, reference, or undef
Array @ An ordered list of scalars, indexed by integer
Hash % An unordered collection of scalar values keyed by string

Everything else — objects, closures, complex data structures — is built from these
three primitives via references (\@array, \%hash, sub { ... }).

This model is extraordinarily flexible. A single array can hold integers, floating-
point numbers, strings, and nested references simultaneously. That flexibility is
exactly what made Perl the dominant system-administration and web-scripting language
for two decades.

The cache-hierarchy problem

Modern CPUs achieve peak throughput only when data flows through L1/L2/L3 cache in
large, contiguous blocks — a property called spatial locality. Perl arrays do not
provide this. Under the hood, a Perl array is a C array of pointers to heap-
allocated scalar (SV) structs. Each scalar carries a reference count, a type tag,
and padding — typically 24–56 bytes per scalar on a 64-bit build. Iterating over a
million-element Perl array therefore involves a million pointer dereferences scattered
across the heap, producing a cache-miss pattern that completely negates the speed
advantage of modern SIMD pipelines.

A concrete consequence: a dot product of two 1 000-element vectors written in pure
Perl is roughly 100–1000× slower than the equivalent operation on a pair of PDL
float ndarrays, which occupy two flat, 4 000-byte memory regions that fit comfortably
in L1 cache.

Contrast with R

R occupies a curious middle ground. Like Perl, it is a dynamic, interpreted
language — variables are untyped containers, functions are first-class values, and
the interactive REPL is the primary development environment. R even has direct
analogues to Perl’s three core types:

Perl concept R analogue
$scalar length-1 atomic vector or scalar-in-list
@array list()
%hash named list()
Reference (\@arr) R does not use explicit references; copy-on-modify semantics instead

But R’s workhorse type, i.e. the atomic vector has no straightforward Perl counterpart.
An R atomic vector is a contiguous, homogeneously typed block of memory — exactly the
layout that a CPU cache rewards. Every built-in scalar in R is actually a length-1
atomic vector; there is no “bare scalar” outside of atomic vectors. This design
choice means that R code naturally operates on vectors of millions of doubles with
BLAS-level throughput, without the user writing a single loop or allocating a special
“array” object.

R’s atomic types are:

R atomic type Storage C equivalent
logical 4 bytes/element int (with NA sentinel)
integer 4 bytes/element int32_t
double 8 bytes/element double
complex 16 bytes/element _Complex double
character pointer to CHARSXP char * (interned)
raw 1 byte/element uint8_t

R also defines higher-level structures built on atomic vectors:

The lesson: R’s computing performance when used in statistical and data science applications
flows directly from its contiguous atomic vectors. Perl’s equivalent path to performance is
an extension (which also is a stand alone matlab like enviroment), the Perl Data Language PDL.


3. Enter PDL: Strongly Typed N-Dimensional Arrays

The Perl Data Language (PDL, pdl.perl.org) extends Perl with ndarrays
(N-dimensional arrays): contiguous, strongly typed memory buffers that look and feel
like first-class Perl objects.

use PDL;

# A 1-D float ndarray — 4 bytes × 5 elements in one contiguous block
my $v = float( 1.0, 2.0, 3.0, 4.0, 5.0 );

# A 128-dimensional random database of 1000 vectors — all in cache-friendly memory
my $db = random( 128, 1000 );   # double by default

# Dot product of every DB vector against a query — a single BLAS call
my $scores = $db x $query->transpose;

PDL primitive types

PDL exposes the full palette of C numeric types as first-class constructors:

PDL type Bytes C type Constructor
byte 1 uint8_t byte(...)
short 2 int16_t short(...)
ushort 2 uint16_t ushort(...)
long 4 int32_t long(...)
indx 4 or 8 ssize_t indx(...)
longlong 8 int64_t longlong(...)
float 4 float float(...)
double 8 double double(...)
cfloat 8 _Complex float cfloat(...)
cdouble 16 _Complex double cdouble(...)

Threading and SIMD

One of PDL’s most distinctive features is implicit threading: operations broadcast
automatically over extra dimensions, eliminating explicit loops in user code and
delegating inner loops to optimised C or Fortran kernels. Combined with
set_autopthread_targ(N), PDL will automatically parallelise independent slices
across N OS threads — without the user writing a single fork or Thread::Queue
call.

Bad values

PDL has a built-in concept of bad values (PDL::Bad), directly analogous to R’s
NA. An ndarray can be flagged as “bad-value aware”, and PDL operations propagate
badness correctly through arithmetic, statistics, and I/O.


4. Type Comparison: Perl, PDL, and R Side-by-Side

The table below maps every commonly used R type to its closest Perl and PDL
counterparts, highlighting where the three languages agree, differ, or complement
each other.

R type Perl equivalent PDL equivalent Notes
double (length-1) $x = 3.14 (scalar) double(3.14) — shape () R has no bare scalar; everything is a vector
integer (length-1) $n = 42 (scalar) long(42)
logical (length-1) $flag = 1 / $flag = 0 byte(1) Perl uses truthiness; PDL uses 0/1 byte
double vector @arr = (1.1, 2.2, 3.3) double(1.1, 2.2, 3.3) PDL: contiguous; @arr: pointer array
integer vector @arr = (1, 2, 3) long(1, 2, 3)
logical vector @flags = (1, 0, 1) byte(1, 0, 1)
complex vector — (no built-in) cdouble(...) Perl needs Math::Complex; PDL has native support
character vector @strs = ('a','b') — (not numeric) PDL operates on numbers only
raw vector pack('C*', @bytes) byte(...)
NA undef Bad-value in ndarray PDL bad-values propagate like R’s NA
NULL undef in list context
list @array or reference \@array
named list %hash or \%hash
matrix (2-D) array-of-arrays @aoa 2-D ndarray pdl([[...],[...]]) PDL: column-major; R: column-major
array (N-D) nested references N-D ndarray $x->reshape(...)
data.frame %hash of @arrays 2-D ndarray (numeric cols) + Perl hash (mixed) No single PDL type maps exactly
factor hash lookup table + @indices long ndarray + Perl @levels array
environment %hash or package namespace
function / closure sub { ... } / closure PDL PP defines compiled kernels
S3 / S4 object blessed reference + method dispatch PDL object (blessed ndarray) PDL objects are first-class Perl objects

Key takeaways


5. Road Map: What the Rest of This Series Covers

This series documents the construction of a vector database engine built in
Perl5 + PDL from scratch. Vector databases underpin modern retrieval-augmented
generation (RAG) pipelines, semantic search, and nearest-neighbour recommendation
systems. Implementing one from first principles is an excellent vehicle for
demonstrating PDL’s numerical capabilities alongside Perl’s systems-programming
strengths.

The directory co-developed alongside these posts contains the following components,
each of which will be the subject of one or more dedicated posts that will reference files
in a dedicated repository

Post 1 — Serialisation and I/O: the VectorIO module

File: VectorIO.pm

The engine stores vectors as packed binary blobs inside
MessagePack payloads. This post covers:

Post 2 — Simulating a Vector Database

File: simulate_vectorDB.pl

Before we can search a database we need one. This post shows:

Post 3 — Benchmarking: the timing_DB Module

File: timing_DB.pm

Performance claims require measurement. This post introduces:

Post 4 — K-Means Clustering with PDL::Stats::Kmeans

File: kmeans.pl

K-means clustering is the backbone of the inverted-file index (IVF) approach to
approximate nearest-neighbour search. This post covers:

Post 5 — Mini-Batch K-Means: Scaling to Large Datasets

File: compare_kmeans_centroids.pl

Full k-means requires all data in memory for every iteration. Mini-batch k-means
trades a small amount of centroid accuracy for a large reduction in memory and
compute. This post explores:

Post 6 — Inverted File Index (IVF) Search

File: compare_ivf_search.pl

With centroids in hand we can partition the database and perform sub-linear
approximate nearest-neighbour search. This post covers:

Post 7 — Validating Against R: Numerical Correctness and Cross-Language Pipelines

Files: compare_kmeans_centroids.R, compare_kmeans_centroids_pure.R,
plot_centroid_coordinates.R

The final post in the foundation series closes the loop between Perl and R:


Next up — Post 1: Serialisation and I/O with VectorIO.pm


Modern CPUs have multiple levels of fast, on-chip memory called caches (L1, L2, L3)
that sit between the processor cores and main RAM. L1 is the smallest (typically 32–64 KB per
core) and fastest (1–4 clock cycles latency); L2 is larger (256 KB–1 MB) and slightly slower;
L3 is shared across cores (4–64 MB) with higher latency still. Main RAM sits further away at
60–100 ns latency — roughly 200× slower than L1. When a computation touches memory in a
predictable, contiguous pattern the hardware prefetcher can load upcoming data into L1/L2
before it is needed, achieving near-peak throughput. Scattered pointer-chasing (such as
traversing a Perl array of heap-allocated scalars) defeats prefetching, stalling the CPU while
it waits for each cache miss to be resolved from RAM.