Speaker Notes: - Kia ora koutou. - My name is Josh McArthur. - Today I'm going to talk about a project called Plain Language Law, where we're using AI to tackle a really interesting problem: making law understandable for everyone.
Speaker Notes: - Start with a relatable hook. Ask the audience to raise hands. - It's dense, it's complex, it's intimidating. - It's written by lawyers, for lawyers. But it affects all of us. - This complexity isn't just an inconvenience; it's an access to justice issue.
Speaker Notes: - This is what we built. - We take that complex source material and, using AI, we translate it into something people can actually understand. - It's not about "dumbing it down," it's about making it clear.
Speaker Notes: - We didn't just pick this problem at random. It has the perfect set of characteristics for an AI project. - It's a large-scale data problem, which is where AI shines. - The structure gives us handholds to work with the data programmatically. - And the complexity means that simple rule-based systems would fail. It requires the nuance of LLMs.
Speaker Notes: - So, how did we actually build it? - Here's a high-level view of our pipeline. - It's a multi-stage process that takes raw XML from the source and ends with a fully built, searchable, and accessible static website. - I'm going to walk through each of these stages.
Speaker Notes: - The first stage is the "boring" but critical foundation. - We use standard, reliable tools like XSLT to transform the legislative XML into clean HTML. - Crucially, we create a YAML frontmatter block in each file. This acts as a structured data container that our AI processes will populate later.
Speaker Notes: - Once we have clean, structured files, we bring in the AI. - We have several distinct jobs that the LLMs perform. - The core one is the plain language translation. - But we also use it for metadata enrichment - extracting key ideas, and classifying the document against our taxonomy. - All of this AI-generated data is written back into the YAML frontmatter.
Speaker Notes: - Once we have clean, structured files, we bring in the AI. - We have several distinct jobs that the LLMs perform. - The core one is the plain language translation. - But we also use it for metadata enrichment - extracting key ideas, and classifying the document against our taxonomy. - All of this AI-generated data is written back into the YAML frontmatter.
Speaker Notes: - The final stage is the assembly and deployment. - We use Astro as our static site generator. I'll talk more about why in a moment. - We pre-build a search index so the site is fast and searchable on the client side. - The whole thing is automated with GitHub Actions, which monitors the source for changes and rebuilds the site automatically.
- **RSS Monitoring:** GitHub Actions monitors the Legislation site for changes - **Pull request workflow:** Updated legislation automatically creates a pull request with the changes. - **mdev environments:** We review new legilsation in situ and decide whether to generate plain language or not - **Labeling workflow:** We label the legislation with a "generate-plain-language" label if we decide to generate plain language for it, which updates the PR and mdev. Speaker Notes: - Automation is an important part of making the project sustainable long-term. - We've automated the entire process of updating the legislation, from monitoring the source for changes to creating a pull request with the changes. - We set up mdev environments early on to make it easy for us to review legislation. - We use labels to trigger different parts of the pipeline, so we can move individual PRs through the process.
Speaker Notes: - Of course, it wasn't all smooth sailing. - This is often the most interesting part of any project: what went wrong and what we learned. - I want to cover three main challenges we faced.
Speaker Notes: - As we added more and more documents, our Jekyll build times exploded. - We went on a tour of modern SSGs. We were surprised at how many struggled with our dataset, often crashing with out-of-memory errors. - Astro was the winner. Its island architecture was a good fit, but the key was its flexibility. We could write our own data loading logic to handle the scale, which wasn't as easy in other frameworks. - `Jekyll` build times were difficult to work with at scale (>30 minutes). - We tried `11ty`, `Next.js`, and `Bridgetown`. Many struggled with the sheer number of files and memory usage. - `Astro` was flexible enough to handle it, but required custom file loading logic.
Speaker Notes: - This is the classic engineering trade-off. - We started with Claude via an API, and the quality was fantastic out of the box. But the costs were scaling linearly with our content. - We moved to running Llama on our own. It's drastically cheaper, but it's not a free lunch. We've had to invest a lot more time in prompt engineering and are now looking at fine-tuning to close the quality gap.
Speaker Notes: - This is the hardest problem we face, and I don't have a perfect answer. - How do we know the AI isn't misinterpreting a critical legal detail? - We can't have lawyers review thousands of pages a day. - This has completely blocked some features, like providing translations into languages our team doesn't speak. - We're exploring solutions like using a second, different LLM as a "reviewer," but it's a major challenge.
Speaker Notes: - So that's where we are today. But what's next? - The next major phase of the project is about moving from simple text generation to deeper semantic understanding, and the key to that is embeddings.
Speaker Notes: - For those who aren't familiar, embeddings are a way of representing text as a series of numbers—a vector—in a high-dimensional space. - Documents with similar meanings will have vectors that are "close" to each other. - This unlocks a whole new set of capabilities beyond what we're doing now. I'll give you two examples.
Speaker Notes: - Our Llama model struggles sometimes to create summaries as good as Claude's. - With embeddings, we can implement a RAG pattern. - When we want to summarize a new document, we first do a vector search to find a few existing documents that are conceptually similar and that have high-quality summaries we generated with Claude back in the day. - We then put those examples directly into the prompt for Llama. This "few-shot" prompting dramatically improves the quality of the output.
Speaker Notes: - The other big area is content discovery. - Embeddings will let us build much smarter features. - We can find clusters of related ideas within the law and automatically generate FAQ sections. - We can build "related content" features that are far more powerful than what you could do with simple tags or keywords. - It's really about building a semantic web of legal knowledge.
Speaker Notes: - So, if you take away three things from this talk, I hope it's these. - First, don't neglect the data engineering. It's not glamorous, but it's essential. - Second, be ready to make hard choices about cost vs. quality. There's no magic bullet. - And finally, if you're working with large text corpora, start thinking about embeddings. It's the key to unlocking the next level of features.
Speaker Notes: - That's all I have. Thank you for your time. - I'd be happy to take any questions.