May 13, 2026: Data parsed!
May 14, 2026: Site launch!
May 22, 2026: We're in the papers!
June 1, 2026: Data cleaned!
After I managed to salvage the unstructured data from the Wayback Machine (see about for that tale), I was left with a pile of HTML that was distinctly un-data-like. The headers were in place, so I knew there must be a way to add some structure back to the data -- but that feels like a Python problem, and alas, my coding knowledge is very css-html. reached out on bsky asking if any coders could help me out, and Jayme Howard came to the rescue! He parsed the HTML into JSON, removing all the copyright material and leaving just the crowd-sourced data. You can see the full dataset.
The day I actually launched this site! And archived it with the Wayback Machine right away.
From Stefan:Specific changes:
- Title
- Separated out the computer-formatted title from the human-readable title (i.e. "20-sided-stories" vs "20 Sided Stories"). Kept the computer-formatted version as the Title field, added new Title (Clean) field for the human-readable version
- Actual Title (Clean) data was taken from the Type & Channels field as it was more consistent there
- Episode Frequency & Length
- Separated into two fields called Episode Frequency and Episode Length
- Type & Channels
- Separated into multiple fields - Media Type, Distribution Channels, Title (Clean), and Release Year
- This also had a duplicate of the Description that I just dropped
- Rules & Sources
- Separated into new fields - Rules, Sources, and Tags
- Not all series had data for all three fields so I did some brute force overriding to get the data into the right place
- Format, Setting & Vulgarity
- Separated into new fields - Format, Setting, and Vulgarity
- Similar to Rules & Sources, not all series had data for all three fields so I did some extra logic to get it working. Some entries listed one of the fields as "Not Applicable" - I ended up dropping these because it was not possible to tell whether it applied to the Format field or the Setting field and really isn't all that informational anyway
- Cast & Player Characters
- Separated into two fields called Cast and Player Characters
- Audio Quality & Equipment
- Separated into two fields called Audio Quality & Equipment and Audio Tags
Anything not listed here should be preserved exactly as it was.