Perl5 as a Data Science Language
Series introduction — Post 0 of N
This post is the first in a series documenting the co-development of a vector-database engine (VDBE) written entirely in Perl5 + PDL. Later posts walk through every component of that engine; this one sets the stage. The main impetus for this series is NOT to have you dump your VDBE as I make no performance claims, but to show how one may use Perl to achieve pretty much anything you can achieve with any other language, but smarter!
Table of Contents
- Table of Contents
- 1. Why Perl5 for Data Science?
- 2. The Perl Data-Type System — Strengths and Cache-Era Limits
- 3. Enter PDL: Strongly Typed N-Dimensional Arrays
- 4. Type Comparison: Perl, PDL, and R Side-by-Side
- 5. Road Map: What the Rest of This Series Covers
- Post 1 — Serialisation and I/O: the
VectorIOmodule - Post 2 — Simulating a Vector Database
- Post 3 — Benchmarking: the
timing_DBModule - Post 4 — K-Means Clustering with
PDL::Stats::Kmeans - Post 5 — Mini-Batch K-Means: Scaling to Large Datasets
- Post 6 — Inverted File Index (IVF) Search
- Post 7 — Validating Against R: Numerical Correctness and Cross-Language Pipelines
- Post 1 — Serialisation and I/O: the
1. Why Perl5 for Data Science?
When data scientists discuss language choices the conversation quickly converges on
Python, R, or Julia. Perl5 rarely gets a seat at the table — yet it carries a compelling set of traits that deserve a second look. These traits have not materially changed over the years (Perl5 has always been this way!), but unless you have been exposed to the language and learned to appreciate its tercity, rationality, flexibility, expressibility and actually used it to drive your work forward, you would not know these features not only come for free with Perl5, but can help you drive your projects forward.
Ubiquity and zero-install deployment
Perl5 ships as a default component of virtually every UNIX-like operating system —
Linux distributions, macOS, BSDs, and many embedded Linux environments all include
a working perl binary out of the box. Python has been making inroads here, but
it is still common to find headless servers, network appliances, or HPC login nodes
where Perl is present and a full Python stack is not. A data pipeline written in
Perl can run on day one without a conda environment, a venv, or a container.
Portability from the data centre to the edge
The same script that analyses a terabyte dataset on a 256-core HPC node can, with
minor configuration changes, run on a Raspberry Pi, an IoT gateway, or an embedded
controller. Perl’s single-binary deployment model and low runtime overhead make it
a genuine “write once, run anywhere” language in environments where Python’s
interpreter overhead or Julia’s JIT warm-up time would be unacceptable.
if you are planning to deploy anywhere and everywhere Perl5 is your obvious choice.
A heritage built on text and data munging
Perl was designed from the ground up for text processing, regular expressions, and
“glue” work between system components. In practice, scientific data pipelines are
dominated not by numerical computation but by data wrangling: reading heterogeneous
file formats, cleaning messy records, joining datasets from different sources, and
routing results to downstream consuming components. Perl’s regex engine remains among the most
powerful available, and one-liners can accomplish data-cleaning tasks that would
require helper libraries in other languages. If you are in the domain of scientific computing,
you may have come across the notion of workflow management systems and reproducible research.
They both rely on the execution of end to end data transformations and workflow to eliminate the
manual, error-prone and tedious point and click activities that analysts and scientists have to do to morph
their data into insights and inferences respectively. In this brave new world, Perl5’s rich
history allows it to shine both as a component of workflows, or as a the application language that
implements these workflows.
CPAN: a battle-tested module ecosystem
The Comprehensive Perl Archive Network (CPAN) hosts over 200,000 modules across
every domain imaginable. While the data science offerings are not nearly as extensive as Python,
the basic components for dedicated builders ARE there:
- PDL (Perl Data Language) — vectorised numerical computing with strongly typed
N-dimensional arrays (covered in depth below). - PDL::Stats — descriptive statistics, regression, clustering (k-means,
mini-batch k-means), and more, built on top of PDL ndarrays. - AI::MXNet, AI::TensorFlow — deep learning bindings.
- Statistics::Regression, Statistics::Descriptive — classical statistics
without the PDL dependency. - Text::CSV, Spreadsheet::XLSX, Data::MessagePack, Sereal —
high-performance serialisation and I/O. - DBI + dozens of database drivers — SQL access to every major RDBMS.
- MCE (Many-Core Engine) — structured parallelism for shared- and
distributed-memory workloads. - Inline::C, Inline::CPP — embed C or C++ code directly inside a Perl
source file; the compiler is invoked transparently the first time the script
runs, making it trivial to drop performance-critical kernels into an otherwise
pure-Perl program without a full XS build system. - FFI::Platypus — call functions in any shared library (
.so/.dylib/.dll) from Perl without writing a single line of XS or C glue code. Platypus
supports all C-equivalent types, structs, callbacks, and closures, and is the
modern way to bind Perl to BLAS, LAPACK, HDF5, or any other native library.
Modern Perl is not your grandfather’s Perl
The features below are drawn directly from the official release notes
(perl5360delta, perl5380delta, perl5400delta) and organised by the
release in which they reached stable status or were first introduced.
Only features relevant to data-science and scientific-computing workloads are
highlighted.
Perl 5.36 — May 2022
use v5.36— the feature bundle now automatically enablesuse warnings
in addition touse strict. It also disables theindirectmethod-call
syntax andmultidimensionalhash-key simulation, eliminating two common
sources of subtle bugs.Named subroutine signatures (stable since 5.36; experimental since 5.20)
— function parameters are now declared by name, with optional defaults.
The//=and||=default-value operators were further added to signatures
in 5.38, allowing defaults that trigger onundefor falseness respectively:use v5.36; sub clamp ($val, $lo = 0, $hi //= 1) { $val < $lo ? $lo : $val > $hi ? $hi : $val; }isaclass-instance operator (stable since 5.36; introduced in 5.32)
—$obj isa "ClassName"returns a boolean; cleaner thanref($obj) eq "ClassName".builtinmodule (stable since 5.40; experimental since 5.36) —
lexically importable functions built directly into the interpreter. The
stable 5.40 bundle includes, among others:ceil,floor— integer rounding withoutuse POSIX.trim— strip leading/trailing whitespace from a string.indexed— pairs each element with its index; the idiomatic companion to
multi-valueforloops (see below).true,false,is_bool— typed boolean sentinels; serialisers can now
emit JSONtrue/falserather than1/0.weaken,unweaken,is_weak— reference-count control for building
bidirectional data structures without memory leaks.blessed,reftype,refaddr— reference introspection.
Stable boolean tracking (5.36) — scalars created as booleans (e.g.,
!!1) now retain their boolean nature through assignment, enabling reliable
type-aware serialisation to JSON and MessagePack.Multi-value
forloops (stable since 5.40; experimental since 5.36)
Iterate over pairs or N-tuples without manual index arithmetic:use v5.40; use builtin 'indexed'; for my ($i, $val) (indexed @scores) { ... } # index and valueOr grab multiple values at the same time
use v5.40; for my ($val1, $val2, $val3) (@scores) { ... }deferblocks (experimental since 5.36) — a scope-exit guard that
runs cleanup code unconditionally when a block exits, whether normally or via
exception — a natural replacement for destructor-based scope-guard objects
and an important pattern for resource management in data pipelines.
Perl 5.38 — July 2023
PERL_RAND_SEEDenvironment variable (5.38) — setting this variable
before a run makes everyrandcall (without an explicitsrand) produce the
same sequence, enabling reproducible stochastic algorithms — simulations,
random sampling, Monte Carlo methods — without modifying source code.class/field/methodsyntax (experimental since 5.38) — a
purpose-built, lexically-scoped object system requiring neitherblessnor@ISAnor any CPAN module. Useful for defining typed value objects such as
dataset rows, model parameters, or pipeline stages:use feature 'class'; no warnings 'experimental::class'; class Vector2D { field $x :param; field $y :param; method magnitude { sqrt($x**2 + $y**2) } } my $v = Vector2D->new(x => 3, y => 4); say $v->magnitude; # 5
Perl 5.40 — June 2024
try/catchexception handling (stable since 5.40; experimental since
5.34;finallyblock added in 5.36) — structured exception handling is now
a core language feature; no CPAN module required:use v5.40; try { my $result = load_and_process($file); } catch ($e) { warn "Pipeline error: $e"; } finally { close_resources(); # runs whether or not an exception was thrown }(
Try::Tiny/Feature::Compat::Tryare only needed when targeting perls
older than 5.34.)Multi-value
forloops (stable since 5.40) — see 5.36 entry above;
they graduated from experimental to stable in this release.builtin::infandbuiltin::nan(experimental since 5.40) — typed
floating-point infinity and Not-a-Number constants, eliminating9**9**9or
POSIX hacks in numerical code.^^logical XOR operator (5.40) — completes the medium-precedence
logical operator set (&&,||,^^); handy for boolean mask operations.use v5.40imports builtin functions — beyond enabling the feature bundle,use v5.40also imports the correspondingbuiltinversion bundle, making
all stablebuiltin::functions available as short names without a separateuse builtinstatement.
Longstanding features (pre-5.36)
sayandstate(since 5.10) —sayisprintwith an implicit
newline;statedeclares a lexical that persists across invocations of its
enclosing sub (a lightweight memoisation primitive).- First-class references and closures — anonymous subs, closures, and
reference construction are fundamental and have been stable since Perl 5. use constantor the CPANReadonlymodule for named constants;Readonlyenforces deep immutability thatuse constantdoes not.
Combined with perlbrew or plenv for version management and carton for
reproducible dependency snapshots, a modern Perl project looks and feels like a
first-class software engineering effort.
Honest limitations
No case for Perl is complete without honesty about where it falls short:
- Visualisation — Perl has no equivalent to
ggplot2ormatplotlib.
Plots typically require an external call to R, gnuplot, or a web library.
At times this weakness can become an actual strength, allowing one to use Perl5 as the application
language that orchestrates and enhances the other actors. - Community momentum — the data-science community has converged on Python and R.
Finding ready-made tutorials, Stack Overflow answers, and co-authors is harder. - Object orientation — without Moose/Moo the OOP model is verbose; with them it
adds a dependency. The newclassfeature may solve some of these problems - Type safety at scale — the core language’s dynamic scalars make large,
collaborative numerical codebases harder to reason about (see next section).
2. The Perl Data-Type System — Strengths and Cache-Era Limits
Core Perl types
Perl’s fundamental data model centres on three constructs:
| Construct | Sigil | What it holds |
|---|---|---|
| Scalar | $ |
A single value: number, string, reference, or undef |
| Array | @ |
An ordered list of scalars, indexed by integer |
| Hash | % |
An unordered collection of scalar values keyed by string |
Everything else — objects, closures, complex data structures — is built from these
three primitives via references (\@array, \%hash, sub { ... }).
This model is extraordinarily flexible. A single array can hold integers, floating-
point numbers, strings, and nested references simultaneously. That flexibility is
exactly what made Perl the dominant system-administration and web-scripting language
for two decades.
The cache-hierarchy problem
Modern CPUs achieve peak throughput only when data flows through L1/L2/L3 cache† in
large, contiguous blocks — a property called spatial locality. Perl arrays do not
provide this. Under the hood, a Perl array is a C array of pointers to heap-
allocated scalar (SV) structs. Each scalar carries a reference count, a type tag,
and padding — typically 24–56 bytes per scalar on a 64-bit build. Iterating over a
million-element Perl array therefore involves a million pointer dereferences scattered
across the heap, producing a cache-miss pattern that completely negates the speed
advantage of modern SIMD pipelines.
A concrete consequence: a dot product of two 1 000-element vectors written in pure
Perl is roughly 100–1000× slower than the equivalent operation on a pair of PDL
float ndarrays, which occupy two flat, 4 000-byte memory regions that fit comfortably
in L1 cache.
Contrast with R
R occupies a curious middle ground. Like Perl, it is a dynamic, interpreted
language — variables are untyped containers, functions are first-class values, and
the interactive REPL is the primary development environment. R even has direct
analogues to Perl’s three core types:
| Perl concept | R analogue |
|---|---|
$scalar |
length-1 atomic vector or scalar-in-list |
@array |
list() |
%hash |
named list() |
Reference (\@arr) |
R does not use explicit references; copy-on-modify semantics instead |
But R’s workhorse type, i.e. the atomic vector has no straightforward Perl counterpart.
An R atomic vector is a contiguous, homogeneously typed block of memory — exactly the
layout that a CPU cache rewards. Every built-in scalar in R is actually a length-1
atomic vector; there is no “bare scalar” outside of atomic vectors. This design
choice means that R code naturally operates on vectors of millions of doubles with
BLAS-level throughput, without the user writing a single loop or allocating a special
“array” object.
R’s atomic types are:
| R atomic type | Storage | C equivalent |
|---|---|---|
logical |
4 bytes/element | int (with NA sentinel) |
integer |
4 bytes/element | int32_t |
double |
8 bytes/element | double |
complex |
16 bytes/element | _Complex double |
character |
pointer to CHARSXP | char * (interned) |
raw |
1 byte/element | uint8_t |
R also defines higher-level structures built on atomic vectors:
- matrix — a 2-D atomic vector with a
dimattribute. - array — an N-D atomic vector with a
dimattribute. - data.frame — a named list of equal-length atomic vectors; the lingua franca of
tabular data in R. - factor — an integer vector with a
levelsattribute; encodes categorical data.
The lesson: R’s computing performance when used in statistical and data science applications
flows directly from its contiguous atomic vectors. Perl’s equivalent path to performance is
an extension (which also is a stand alone matlab like enviroment), the Perl Data Language PDL.
3. Enter PDL: Strongly Typed N-Dimensional Arrays
The Perl Data Language (PDL, pdl.perl.org) extends Perl with ndarrays
(N-dimensional arrays): contiguous, strongly typed memory buffers that look and feel
like first-class Perl objects.
use PDL;
# A 1-D float ndarray — 4 bytes × 5 elements in one contiguous block
my $v = float( 1.0, 2.0, 3.0, 4.0, 5.0 );
# A 128-dimensional random database of 1000 vectors — all in cache-friendly memory
my $db = random( 128, 1000 ); # double by default
# Dot product of every DB vector against a query — a single BLAS call
my $scores = $db x $query->transpose;
PDL primitive types
PDL exposes the full palette of C numeric types as first-class constructors:
| PDL type | Bytes | C type | Constructor |
|---|---|---|---|
byte |
1 | uint8_t |
byte(...) |
short |
2 | int16_t |
short(...) |
ushort |
2 | uint16_t |
ushort(...) |
long |
4 | int32_t |
long(...) |
indx |
4 or 8 | ssize_t |
indx(...) |
longlong |
8 | int64_t |
longlong(...) |
float |
4 | float |
float(...) |
double |
8 | double |
double(...) |
cfloat |
8 | _Complex float |
cfloat(...) |
cdouble |
16 | _Complex double |
cdouble(...) |
Threading and SIMD
One of PDL’s most distinctive features is implicit threading: operations broadcast
automatically over extra dimensions, eliminating explicit loops in user code and
delegating inner loops to optimised C or Fortran kernels. Combined withset_autopthread_targ(N), PDL will automatically parallelise independent slices
across N OS threads — without the user writing a single fork or Thread::Queue
call.
Bad values
PDL has a built-in concept of bad values (PDL::Bad), directly analogous to R’sNA. An ndarray can be flagged as “bad-value aware”, and PDL operations propagate
badness correctly through arithmetic, statistics, and I/O.
4. Type Comparison: Perl, PDL, and R Side-by-Side
The table below maps every commonly used R type to its closest Perl and PDL
counterparts, highlighting where the three languages agree, differ, or complement
each other.
| R type | Perl equivalent | PDL equivalent | Notes |
|---|---|---|---|
double (length-1) |
$x = 3.14 (scalar) |
double(3.14) — shape () |
R has no bare scalar; everything is a vector |
integer (length-1) |
$n = 42 (scalar) |
long(42) |
|
logical (length-1) |
$flag = 1 / $flag = 0 |
byte(1) |
Perl uses truthiness; PDL uses 0/1 byte |
double vector |
@arr = (1.1, 2.2, 3.3) |
double(1.1, 2.2, 3.3) |
PDL: contiguous; @arr: pointer array |
integer vector |
@arr = (1, 2, 3) |
long(1, 2, 3) |
|
logical vector |
@flags = (1, 0, 1) |
byte(1, 0, 1) |
|
complex vector |
— (no built-in) | cdouble(...) |
Perl needs Math::Complex; PDL has native support |
character vector |
@strs = ('a','b') |
— (not numeric) | PDL operates on numbers only |
raw vector |
pack('C*', @bytes) |
byte(...) |
|
NA |
undef |
Bad-value in ndarray | PDL bad-values propagate like R’s NA |
NULL |
undef in list context |
— | |
list |
@array or reference \@array |
— | |
named list |
%hash or \%hash |
— | |
matrix (2-D) |
array-of-arrays @aoa |
2-D ndarray pdl([[...],[...]]) |
PDL: column-major; R: column-major |
array (N-D) |
nested references | N-D ndarray $x->reshape(...) |
|
data.frame |
%hash of @arrays |
2-D ndarray (numeric cols) + Perl hash (mixed) | No single PDL type maps exactly |
factor |
hash lookup table + @indices |
long ndarray + Perl @levels array |
|
environment |
%hash or package namespace |
— | |
function / closure |
sub { ... } / closure |
— | PDL PP defines compiled kernels |
S3 / S4 object |
blessed reference + method dispatch | PDL object (blessed ndarray) | PDL objects are first-class Perl objects |
Key takeaways
- For pure numeric, homogeneous data (vectors, matrices, tensors), PDL ndarrays
and R atomic vectors are functionally equivalent and comparably efficient. - For heterogeneous tabular data (mixed types, string columns, factors), R’s
data.frameis more ergonomic; Perl typically uses a hash of arrays or a
dedicated module such asData::FrameorPDL::IO::CSV. - For text, irregular structures, and system glue, Perl’s native types are
superior to both R and Python. - The Perl+PDL combination therefore provides the union of what R offers as a
statistical language and what Perl offers as a systems language — at the cost of a
steeper learning curve and less out-of-the-box nd frankly limited statistical tooling.
However the combination of Perl+PDL+R (with the latter used as a component, or instrumentalized via Perl)
5. Road Map: What the Rest of This Series Covers
This series documents the construction of a vector database engine built in
Perl5 + PDL from scratch. Vector databases underpin modern retrieval-augmented
generation (RAG) pipelines, semantic search, and nearest-neighbour recommendation
systems. Implementing one from first principles is an excellent vehicle for
demonstrating PDL’s numerical capabilities alongside Perl’s systems-programming
strengths.
The directory co-developed alongside these posts contains the following components,
each of which will be the subject of one or more dedicated posts that will reference files
in a dedicated repository
Post 1 — Serialisation and I/O: the VectorIO module
File: VectorIO.pm
The engine stores vectors as packed binary blobs inside
MessagePack payloads. This post covers:
- Designing a module with a clean
Exporter-based public API underuse v5.40. - Validation helpers that enforce schema correctness at system boundaries.
Post 2 — Simulating a Vector Database
File: simulate_vectorDB.pl
Before we can search a database we need one. This post shows:
- Generating reproducible random float vectors with
PDL::random. - Using
GetOpt::Longfor ergonomic CLI option parsing. - Writing a
--seed-controlled simulation that produces identical databases across
runs — essential for benchmarking.
Post 3 — Benchmarking: the timing_DB Module
File: timing_DB.pm
Performance claims require measurement. This post introduces:
- A reusable Perl benchmarking harness built on
Time::HiRes. - Methodology for fair wall-clock comparisons between Perl/PDL and R
implementations. - Interpreting throughput (vectors/second) vs. latency (ms/query) for different
workload sizes.
Post 4 — K-Means Clustering with PDL::Stats::Kmeans
File: kmeans.pl
K-means clustering is the backbone of the inverted-file index (IVF) approach to
approximate nearest-neighbour search. This post covers:
- The
PDL::Stats::Kmeansinterface and its return contract
(centroid,cluster,n,R2,ss). - Interpreting the
[obs × clusters]membership mask returned byrun_kmeans. - Comparing Perl/PDL k-means centroids against R’s
kmeans()andClusterR::MiniBatchKmeans()to validate numerical correctness.
Post 5 — Mini-Batch K-Means: Scaling to Large Datasets
File: compare_kmeans_centroids.pl
Full k-means requires all data in memory for every iteration. Mini-batch k-means
trades a small amount of centroid accuracy for a large reduction in memory and
compute. This post explores:
- Implementing a true re-sampled mini-batch loop in PDL.
- Quantifying centroid drift between full and mini-batch variants.
- Side-by-side output with R’s
MiniBatchKmeansfrom theClusterRpackage.
Post 6 — Inverted File Index (IVF) Search
File: compare_ivf_search.pl
With centroids in hand we can partition the database and perform sub-linear
approximate nearest-neighbour search. This post covers:
- Building the inverted lists: mapping each database vector to its nearest centroid.
- The
unpack_inverted_listshelper inVectorIO. - Querying: finding the top-K nearest centroids, then searching only those lists.
- Accuracy vs. speed trade-offs as the number of probed lists varies.
Post 7 — Validating Against R: Numerical Correctness and Cross-Language Pipelines
Files: compare_kmeans_centroids.R, compare_kmeans_centroids_pure.R,plot_centroid_coordinates.R
The final post in the foundation series closes the loop between Perl and R:
- Exporting PDL results to CSV and reading them in R for independent validation.
- Using ggplot2 to visualise centroid coordinates from both languages
simultaneously. - A workflow pattern for “compute in Perl, visualise in R” that leverages the
strengths of both ecosystems.
Next up — Post 1: Serialisation and I/O with
VectorIO.pm
† Modern CPUs have multiple levels of fast, on-chip memory called caches (L1, L2, L3)
that sit between the processor cores and main RAM. L1 is the smallest (typically 32–64 KB per
core) and fastest (1–4 clock cycles latency); L2 is larger (256 KB–1 MB) and slightly slower;
L3 is shared across cores (4–64 MB) with higher latency still. Main RAM sits further away at
60–100 ns latency — roughly 200× slower than L1. When a computation touches memory in a
predictable, contiguous pattern the hardware prefetcher can load upcoming data into L1/L2
before it is needed, achieving near-peak throughput. Scattered pointer-chasing (such as
traversing a Perl array of heap-allocated scalars) defeats prefetching, stalling the CPU while
it waits for each cache miss to be resolved from RAM.
