Package Stats and Its Data Pipeline – Mr. Joel Kemp

Screen Shot 2015 12 30 At 10.07.42 Am 1024x419

For the last year (on and off), I’ve been working on PackageStats.io: an analytics service for NPM package authors. As an author of 60+ NPM packages, I found it extremely frustrating to gather a holistic understanding of the reach/impact of my packages – resorting to clicking through the 60+ package pages on NPM’s site. I built Package Stats to solve that problem – giving authors rich insights into their package ecosystems.

Package Stats surfaces stats (about authors and their packages) that would be impossible to gather manually. Some examples: a ranking of your packages based on downloads so you get a bird’s eye view of your most popular packages, notifications of when new packages depend on your packages, insights into which packages are on hockey-stick growth, and other usage trends.

Very soon, Package Stats will also integrate a package’s github data to give a holistic view of a module’s npm and github activity.

If you’re interested in learning more, sign up at PackageStats.io

Where Others Fail

One of the primary tenets of Package Stats was that authors should be able to quickly see their stats on the go. I wanted to be able to pull out my phone, hit the site, and immediately see new insights. This fundamentally differentiated Package Stats from its competitors like npm-stats and npm-stat – where they opt for client-side fetching and aggregating an author’s stats. For heavy-hitting npm authors (authors of hundreds of packages) like substack or sindresorhus, you’d have to wait for minutes for the entire process to finish. This is not a predictable nor performant experience for desktop users and even worse for mobile web users.

As a result of the client-side architecture, these sites can’t (without further increasing the wait) offer much insight outside of historical download charts.

stats from npm-statsstats from npm-stat

Due to the aforementioned limitations of a client-side approach, I knew that Package Stats had to fetch and aggregate stats on the server. What follows is an overview of the various iterations of the application and data pipeline architectures.

Architectural Iteration 1: my-top-npm-packages

The first version of Package Stats was powered by a nodejs-based tool I wrote called my-top-npm-packages. The initial version of that tool scraped npm’s website for all of an author’s packages’ download stats and aggregated them on the fly. This was horribly slow.

I then found out about the npm-registry library, which has been a god-send. I ditched the scraper and used npm-registry to communicate with the actual npm registry to fetch an author’s packages and the download counts for each package. This was (obviously) a significant speedup. This first version of Package Stats would ask you which author you’d like to see the stats for and fetch and aggregate those stats on the fly.

Unfortunately, this was just like a client-side solution but done on a server – so subject to the same performance obstacles. Stats for heavy-hitting package authors still took forever; worse so, since this was on the server, the client’s http request would time out :). Now that the embarrassing approach is out of the way…

Architectural Iteration 1.5: Optimization

One of the bottlenecks to the embarrassing approach was the number of http requests used to fetch the stats. Thankfully, I could request stats from NPM’s download-counts service in batch, but my batch size was naively hardcoded to 50 packages. One optimization was to be smarter about batch sizes and squeeze as many packages as possible into a single request.

So I upped my batch size to 1000 packages… and quickly blew through the max length of an http GET request. After some searching, it seemed like there wasn’t a standard max length of a GET request, but ~2000 characters seemed to be acceptable. So the problem now was to pack as many packages (whose names were of varying lengths) as possible into a single GET request where the length of the requested url was around 2000 characters. I built string-packer to solve this problem; for practicality, it uses a first-fit, bin packing algorithm

For a user like Substack (600+ packages), fetching data on the fly went from around 12 http requests to 5. This sped up the site a bit, but requests were still timing out.

On top of the requests timing out, crunching stats on the fly was CPU intensive and preventing the nodejs-based web server from handling other requests. Package Stats got a little bit of press on Twitter and the app fell over immediately. As a bandaid, I clustered/parallelized the web server processes, but blew through the Heroku dyno memory limits causing the nodes to get forcefully shut down.

I could have upped the specs on my Heroku dynos. I could have map-reduced/parallelized the fetching of the stats (per server process) by spawning workers. However, those felt like more bandaids that wouldn’t yield consistent and fast response times.

I also thought about possibly caching the fetched stats, but that wouldn’t fix the initial timeout. It would only prevent timeouts on subsequent fetches of that data. Since I also had no idea who would visit the site, I couldn’t pre-cache stats without doing so for every package in the registry. Sigh.

Architectural Iteration 2: Bring the data home

One big benefit to crunching stats on the fly was that any user could come to Package Stats and eventually (barring a timeout) get their stats without my intervention (more on that in a bit). I also wouldn’t have to house any data – reducing up-front costs. This flexibility was double-edged and at the end of the day, I never had any data and always had to fetch it. I would always be subject to the http request + on-the-fly crunching performance bottlenecks.

I had to bring the data home and avoid real-time fetching/crunching. I thought about cloning the NPM registry and then spawning a daily, batch processing job that would fetch/aggregate/store data from NPM’s download-counts api. This seemed like overkill (costly and complex) for an app that was still unproven. It would be handling the worst-case situation at the very beginning.

The middle ground that I reached allowed me to control costs, load, and userbase growth by letting users sign up to see their stats and I’d grant access when I was sure the pipeline could handle it.

One major benefit for Package Stats is the fact that NPM only crunches download data once a day (at midnight UTC). This means that my batch processing pipeline didn’t have to account for real-time download data. I could ultimately pre-crunch everything (all stats) and cache the results for a day (a classic speed vs. space trade-off). The web servers would be dumb: only concerned with rendering the data from the cache/db. This would scale really well and yield fast, predictable response times.

Pipeline Iteration #1

The batch pipeline (through its various iterations) has roughly followed the following algorithm:

  • Fetch all packages for all registered authors
  • Fetch all dependents for all of those packages
  • Fetch a year’s worth of daily download stats for all of the packages and dependents
  • Store the daily download stats and an aggregated/crunched daily version of the stats (like percentage change over the past day/week/month)

The first version fetched a year’s worth of data for all packages in the system (even existing ones). This made sense for new packages (where I needed to backfill data), but existing packages (ones whose yearly data was already in the system) only needed the last day’s worth of data – an optimization that became necessary later on.

To fetch the daily download data, I used NPM’s download counts api (which they’ve generously made open-source and free to use).

It turns out that when you only have a small number of packages, fetching a year’s worth of data for a packed batch of package works. I soon found out, however, that there’s a breaking point where npm’s api will return an incomplete/truncated response if the payload is too big (not sure what the max size is though).

On top of that, you’re actually only supposed to be able to request up to 31 days worth of data, but due to a bug, you can bypass that with a custom range.

Both issues combined led me to split the processing of existing packages from new packages.

Pipeline Iteration 2

As I said, for existing packages, I only needed the last day’s worth of data. With the string-packing solution discussed previously (to maximize the number of packages in a single GET request), I could efficiently and safely avoid incomplete responses and play within the 31-day data limit.

For new packages, I still needed a year’s worth of data. The natural solution there was to request a month of data at a time per package. The algorithm looks like the following:

  1. For all new packages, generate string-packed batches (a list of lists of packages that maximize GET requests)
  2. For each batch, generate 12 requests: for each month in the past year

Once I have all of the data for the past year, packing it all into a MySQL database query became the next constraint. I didn’t want to overwhelm the database with an insert request per day of data per package. However, on the other end of solutions, batching all daily download data (across all packages) into a single, massive INSERT query will eventually blow through the mysql packet size limit. The middle-ground that works well so far (for 6k packages) is to have an INSERT per monthly batch generated in step #1 above.

All in all, the batch processing that powers Package Stats comes down to a minimization problem to avoid killing NPM’s registry and download count api, my MySQL database, and the web server’s CPU and memory limits – while the number of users and packages continues to grow.

I look forward to reporting back soon with the next couple of iterations. Wish me luck and check out Package Stats if you’re an NPM user. Always happy to chat about it on twitter @mrjoelkemp