Perl5 as a Data Science Language

[FINISHED] 最終更新日 によって Joe Schaefer 木, 16 4月 2026    ソース
 

Data Science

Series introduction — Post 0 of N
This post is the first in a series documenting the co-development of a vector-database engine (VDBE) written entirely in Perl5 + PDL. Later posts walk through every component of that engine; this one sets the stage. The main impetus for this series is NOT to have you dump your VDBE as I make no performance claims, but to show how one may use Perl to achieve pretty much anything you can achieve with any other language, but smarter!


Table of Contents


1. Why Perl5 for Data Science?

When data scientists discuss language choices the conversation quickly converges on Python, R, or Julia. Perl5 rarely gets a seat at the table — yet it carries a compelling set of traits that deserve a second look. These traits have not materially changed over the years (Perl5 has always been this way!), but unless you have been exposed to the language and learned to appreciate its tercity, rationality, flexibility, expressibility and actually used it to drive your work forward, you would not know these features not only come for free with Perl5, but can help you drive your projects forward.

Ubiquity and zero-install deployment

Perl5 ships as a default component of virtually every UNIX-like operating system — Linux distributions, macOS, BSDs, and many embedded Linux environments all include a working perl binary out of the box. Python has been making inroads here, but it is still common to find headless servers, network appliances, or HPC login nodes
where Perl is present and a full Python stack is not. A data pipeline written in Perl can run on day one without a conda environment, a venv, or a container.

Portability from the data centre to the edge

The same script that analyses a terabyte dataset on a 256-core HPC node can, with minor configuration changes, run on a Raspberry Pi, an IoT gateway, or an embedded controller. Perl’s single-binary deployment model and low runtime overhead make it a genuine “write once, run anywhere” language in environments where Python’s interpreter overhead or Julia’s JIT warm-up time would be unacceptable.

If you are planning to deploy anywhere and everywhere Perl5 is your obvious choice.

A heritage built on text and data munging

Perl was designed from the ground up for text processing, regular expressions, and “glue” work between system components. In practice, scientific data pipelines are dominated not by numerical computation but by data wrangling: reading heterogeneous file formats, cleaning messy records, joining datasets from different sources, and routing results to downstream consuming components.

Perl’s regex engine remains among the most powerful available, and one-liners can accomplish data-cleaning tasks that would require helper libraries in other languages.

If you are in the domain of scientific computing, you may have come across the notion of workflow management systems and reproducible research. They both rely on the execution of end to end data transformations and workflow to eliminate the manual, error-prone and tedious point and click activities that analysts and scientists have to do to morph their data into insights and inferences respectively.

In this brave new world, Perl5’s rich history allows it to shine both as a component of workflows, or as a the application language that implements these workflows.

CPAN: a battle-tested module ecosystem

The Comprehensive Perl Archive Network (CPAN) hosts over 200,000 modules across every domain imaginable. While the data science offerings are not nearly as extensive as Python, the basic components for dedicated builders ARE there:

Modern Perl is not your grandfather’s Perl

The features below are drawn directly from the official release notes (perl5360delta, perl5380delta, perl5400delta) and organised by the release in which they reached stable status or were first introduced. Only features relevant to data-science and scientific-computing workloads are highlighted.

Perl 5.36 — May 2022

Perl 5.38 — July 2023

Perl 5.40 — June 2024

Longstanding features (pre-5.36)

Combined with perlbrew or plenv for version management and carton for reproducible dependency snapshots, a modern Perl project looks and feels like a first-class software engineering effort.

Honest limitations

No case for Perl is complete without honesty about where it falls short:


2. The Perl Data-Type System — Strengths and Cache-Era Limits

Core Perl types

Perl’s fundamental data model centres on three constructs:

Construct Sigil What it holds
Scalar $ A single value: number, string, reference, or undef
Array @ An ordered list of scalars, indexed by integer
Hash % An unordered collection of scalar values keyed by string

Everything else — objects, closures, complex data structures — is built from these three primitives via references (\@array, \%hash, sub { ... }).

This model is extraordinarily flexible. A single array can hold integers, floating-point numbers, strings, and nested references simultaneously. That flexibility is exactly what made Perl the dominant system-administration and web-scripting language for two decades.

The cache-hierarchy problem

Modern CPUs achieve peak throughput only when data flows through L1/L2/L3 cache in large, contiguous blocks — a property called spatial locality. Perl arrays do not provide this. Under the hood, a Perl array is a C array of pointers to heap-allocated scalar (SV) structs. Each scalar carries a reference count, a type tag, and padding — typically 24–56 bytes per scalar on a 64-bit build. Iterating over a million-element Perl array therefore involves a million pointer dereferences scattered across the heap, producing a cache-miss pattern that completely negates the speed advantage of modern SIMD pipelines.

A concrete consequence: a dot product of two 1 000-element vectors written in pure Perl is roughly 100–1000× slower than the equivalent operation on a pair of PDL float ndarrays, which occupy two flat, 4 000-byte memory regions that fit comfortably in L1 cache.

Contrast with R

R occupies a curious middle ground. Like Perl, it is a dynamic, interpreted language — variables are untyped containers, functions are first-class values, and the interactive REPL is the primary development environment. R even has direct analogues to Perl’s three core types:

Perl concept R analogue
$scalar length-1 atomic vector or scalar-in-list
@array list()
%hash named list()
Reference (\@arr) R does not use explicit references; copy-on-modify semantics instead

But R’s workhorse type, i.e. the atomic vector has no straightforward Perl counterpart. An R atomic vector is a contiguous, homogeneously typed block of memory — exactly the layout that a CPU cache rewards. Every built-in scalar in R is actually a length-1 atomic vector; there is no “bare scalar” outside of atomic vectors.

This design choice means that R code naturally operates on vectors of millions of doubles with BLAS-level throughput, without the user writing a single loop or allocating a special “array” object.

R’s atomic types are:

R atomic type Storage C equivalent
logical 4 bytes/element int (with NA sentinel)
integer 4 bytes/element int32_t
double 8 bytes/element double
complex 16 bytes/element _Complex double
character pointer to CHARSXP char * (interned)
raw 1 byte/element uint8_t

R also defines higher-level structures built on atomic vectors:

The lesson: R’s computing performance when used in statistical and data science applications flows directly from its contiguous atomic vectors. Perl’s equivalent path to performance is an extension (which also is a stand alone matlab like enviroment), the Perl Data Language PDL.


3. Enter PDL: Strongly Typed N-Dimensional Arrays

The Perl Data Language (PDL, pdl.perl.org) extends Perl with ndarrays (N-dimensional arrays): contiguous, strongly typed memory buffers that look and feel like first-class Perl objects.

use PDL;

# A 1-D float ndarray — 4 bytes × 5 elements in one contiguous block
my $v = float( 1.0, 2.0, 3.0, 4.0, 5.0 );

# A 128-dimensional random database of 1000 vectors — all in cache-friendly memory
my $db = random( 128, 1000 );   # double by default

# Dot product of every DB vector against a query — a single BLAS call
my $scores = $db x $query->transpose;

PDL primitive types

PDL exposes the full palette of C numeric types as first-class constructors:

PDL type Bytes C type Constructor
byte 1 uint8_t byte(...)
short 2 int16_t short(...)
ushort 2 uint16_t ushort(...)
long 4 int32_t long(...)
indx 4 or 8 ssize_t indx(...)
longlong 8 int64_t longlong(...)
float 4 float float(...)
double 8 double double(...)
cfloat 8 _Complex float cfloat(...)
cdouble 16 _Complex double cdouble(...)

Threading and SIMD

One of PDL’s most distinctive features is implicit threading: operations broadcast automatically over extra dimensions, eliminating explicit loops in user code and delegating inner loops to optimised C or Fortran kernels. Combined with set_autopthread_targ(N), PDL will automatically parallelise independent slices across N OS threads — without the user writing a single fork or Thread::Queue call.

Bad values

PDL has a built-in concept of bad values (PDL::Bad), directly analogous to R’s NA. An ndarray can be flagged as “bad-value aware”, and PDL operations propagate badness correctly through arithmetic, statistics, and I/O.


4. Type Comparison: Perl, PDL, and R Side-by-Side

The table below maps every commonly used R type to its closest Perl and PDL counterparts, highlighting where the three languages agree, differ, or complement each other.

R type Perl equivalent PDL equivalent Notes
double (length-1) $x = 3.14 (scalar) double(3.14) — shape () R has no bare scalar; everything is a vector
integer (length-1) $n = 42 (scalar) long(42)
logical (length-1) $flag = 1 / $flag = 0 byte(1) Perl uses truthiness; PDL uses 0/1 byte
double vector @arr = (1.1, 2.2, 3.3) double(1.1, 2.2, 3.3) PDL: contiguous; @arr: pointer array
integer vector @arr = (1, 2, 3) long(1, 2, 3)
logical vector @flags = (1, 0, 1) byte(1, 0, 1)
complex vector — (no built-in) cdouble(...) Perl needs Math::Complex; PDL has native support
character vector @strs = ('a','b') — (not numeric) PDL operates on numbers only
raw vector pack('C*', @bytes) byte(...)
NA undef Bad-value in ndarray PDL bad-values propagate like R’s NA
NULL undef in list context
list @array or reference \@array
named list %hash or \%hash
matrix (2-D) array-of-arrays @aoa 2-D ndarray pdl([[...],[...]]) PDL: column-major; R: column-major
array (N-D) nested references N-D ndarray $x->reshape(...)
data.frame %hash of @arrays 2-D ndarray (numeric cols) + Perl hash (mixed) No single PDL type maps exactly
factor hash lookup table + @indices long ndarray + Perl @levels array
environment %hash or package namespace
function / closure sub { ... } / closure PDL PP defines compiled kernels
S3 / S4 object blessed reference + method dispatch PDL object (blessed ndarray) PDL objects are first-class Perl objects

Key takeaways

However the combination of Perl+PDL+R (with the latter used as a component, or instrumentalized via Perl)


5. Road Map: What the Rest of This Series Covers

This series documents the construction of a vector database engine built in Perl5 + PDL from scratch. Vector databases underpin modern retrieval-augmented generation (RAG) pipelines, semantic search, and nearest-neighbour recommendation systems. Implementing one from first principles is an excellent vehicle for demonstrating PDL’s numerical capabilities alongside Perl’s systems-programming strengths.

The directory co-developed alongside these posts contains the following components, each of which will be the subject of one or more dedicated posts that will reference files in a dedicated repository.

Post 1 — Serialisation and I/O: the VectorIO module

File: VectorIO.pm

The engine stores vectors as packed binary blobs inside MessagePack payloads. This post covers:

Post 2 — Simulating a Vector Database

File: simulate_vectorDB.pl

Before we can search a database we need one. This post shows:

Post 3 — Benchmarking: the timing_DB Module

File: timing_DB.pm

Performance claims require measurement. This post introduces:

Post 4 — K-Means Clustering with PDL::Stats::Kmeans

File: kmeans.pl

K-means clustering is the backbone of the inverted-file index (IVF) approach to approximate nearest-neighbour search. This post covers:

Post 5 — Mini-Batch K-Means: Scaling to Large Datasets

File: compare_kmeans_centroids.pl

Full k-means requires all data in memory for every iteration. Mini-batch k-means trades a small amount of centroid accuracy for a large reduction in memory and compute. This post explores:

Post 6 — Inverted File Index (IVF) Search

File: compare_ivf_search.pl

With centroids in hand we can partition the database and perform sub-linear approximate nearest-neighbour search. This post covers:

Post 7 — Validating Against R: Numerical Correctness and Cross-Language Pipelines

Files: compare_kmeans_centroids.R, compare_kmeans_centroids_pure.R, plot_centroid_coordinates.R

The final post in the foundation series closes the loop between Perl and R:


Next up — Post 1: Serialisation and I/O with VectorIO.pm


Modern CPUs have multiple levels of fast, on-chip memory called caches (L1, L2, L3) that sit between the processor cores and main RAM. L1 is the smallest (typically 32–64 KB per core) and fastest (1–4 clock cycles latency); L2 is larger (256 KB–1 MB) and slightly slower; L3 is shared across cores (4–64 MB) with higher latency still. Main RAM sits further away at 60–100 ns latency — roughly 200× slower than L1.

When a computation touches memory in a predictable, contiguous pattern the hardware prefetcher can load upcoming data into L1/L2 before it is needed, achieving near-peak throughput. Scattered pointer-chasing (such as traversing a Perl array of heap-allocated scalars) defeats prefetching, stalling the CPU while it waits for each cache miss to be resolved from RAM.