Один разработчик против 241 государственного портала: цифровые руины общественных данных

In a striking demonstration of individual initiative versus institutional inertia, a solo developer has successfully extracted 2.6 million planning decision records from 241 separate UK local authority planning portals. The project, which took four months of relentless effort, exposed a chaotic patchwork of IT systems—some running on ASP.NET frameworks from 2004, others protected by AWS Web Application Firewalls, each with its own unique data schema and access restrictions. This is not an isolated case but a systemic failure: the UK's planning data, which should be a public asset, is instead scattered across 241 digital silos, each operating like a medieval fiefdom with its own rules and barriers. The developer's work effectively created a unified, queryable dataset from this chaos, demonstrating both the power of modern AI-driven scraping techniques and the profound inefficiency of government IT procurement. The project highlights a growing trend: as public-sector systems fail to provide unified access, third-party scraping becomes the de facto public API. This raises urgent questions about data ownership, digital sovereignty, and the role of individual citizens in holding governments accountable. The developer's GitHub repository, which documents the scraping methodology and the resulting dataset, has already garnered significant attention from urban planners, AI researchers, and policy analysts. The underlying message is clear: when governments fail to digitize effectively, the public will find a way to do it themselves.

Technical Deep Dive

The developer's approach is a masterclass in adaptive data extraction. Each of the 241 planning portals is a unique technical artifact, representing a different era of government IT procurement. Some run on legacy ASP.NET Web Forms, others on modern React SPAs, and a few on custom-built PHP backends. The core challenge was not just scraping but schema mapping: each portal uses different field names, date formats, and decision categories. For example, one portal might label 'Application Type' as 'app_type', another as 'planning_type', and a third as 'category'. The developer employed a multi-stage pipeline:

1. Discovery Phase: Automated scanning to identify portal endpoints, authentication requirements, and rate-limiting mechanisms.
2. Adaptive Parsing: Using a combination of regex patterns and lightweight NLP models to extract structured data from HTML tables, JSON APIs, and even PDF documents embedded in pages.
3. Anti-Bot Evasion: Rotating user-agent strings, using residential proxy networks, and implementing randomized delays to avoid triggering AWS WAF or Cloudflare protections. Some portals required session cookie management and CAPTCHA solving, which was handled via a third-party CAPTCHA solving service.
4. Schema Normalization: A custom Python library mapped each portal's schema to a unified data model, handling date formats (DD/MM/YYYY vs YYYY-MM-DD), address parsing, and decision codes.

The GitHub repository (repo name: `uk-planning-scraper`, currently 1,200+ stars) includes detailed documentation of the scraping methodology and the resulting SQLite database. The developer noted that approximately 15% of portals required manual intervention due to non-standard interfaces or broken search functionality.

Performance Data Table:
| Metric | Value |
|---|---|
| Total Portals Scraped | 241 |
| Total Records Collected | 2,600,000 |
| Average Records per Portal | 10,788 |
| Total Time Elapsed | 4 months |
| Estimated Requests Made | 15 million+ |
| Portals Requiring CAPTCHA | 38 (15.8%) |
| Portals with Broken Search | 12 (5.0%) |
| Average Schema Fields Mapped | 22 per portal |

Data Takeaway: The 15% manual intervention rate and 5% broken portal rate are telling: even after two decades of digital transformation, a significant minority of government systems are fundamentally non-functional for automated access, undermining the very concept of 'public data'.

Key Players & Case Studies

This project is not happening in a vacuum. Several organizations and tools are relevant:

- OpenDataSoft: A French company that provides a unified data platform for cities. Their platform is used by several UK councils, but adoption is uneven. The developer's work effectively creates a competitor to such platforms, albeit an unofficial one.
- Scrapy & Playwright: The developer used Scrapy for initial scraping and Playwright for JavaScript-heavy portals. Playwright's ability to handle modern SPAs was critical for about 30% of portals.
- UK Planning Inspectorate: The national body that oversees planning appeals. They maintain a separate database, but it does not include local-level decisions. This project fills that gap.
- LocalGov Digital: A network of UK council digital officers. Their efforts to standardize planning data have been slow, with only 40% of councils using a common schema as of 2023.

Comparison Table: Data Access Solutions
| Solution | Coverage | Update Frequency | Cost | Data Quality |
|---|---|---|---|---|
| UK Planning Portal (Official) | 241 councils (partial) | Weekly | Free (limited) | Inconsistent |
| This Developer's Dataset | 241 councils (full) | One-time (2024) | Free | High (normalized) |
| OpenDataSoft (Commercial) | 50 councils | Daily | Paid | High |
| LocalGov Digital Schema | 96 councils | Varies | Free | Medium |

Data Takeaway: The developer's dataset, despite being a one-time snapshot, offers broader coverage and higher normalization than official or commercial alternatives, highlighting the gap between what governments promise and what they deliver.

Industry Impact & Market Dynamics

The implications extend far beyond planning data. This project is a proof-of-concept for a new category of service: Data Liberation as a Service (DLaaS). As AI models require ever-larger and more diverse training datasets, the demand for structured public data is exploding. However, government IT systems are not designed for machine consumption. This creates a market opportunity for companies that can scrape, normalize, and sell access to public data.

Market Data Table:
| Segment | 2023 Market Size | 2028 Projected Size | CAGR |
|---|---|---|---|
| Public Data Scraping Services | $1.2B | $3.8B | 25.8% |
| Government IT Modernization | $45B | $78B | 11.6% |
| AI Training Data Market | $2.5B | $8.7B | 28.3% |

Data Takeaway: The public data scraping market is growing faster than government IT modernization, suggesting that third-party solutions will increasingly fill the gap left by slow-moving bureaucracies.

Several startups are already moving in this direction. Bright Data offers residential proxy networks that enable large-scale scraping. Apify provides a platform for building and running scrapers. Common Crawl maintains a free, open repository of web crawl data, though it lacks the specificity needed for domain-specific datasets like planning records. The developer's work could serve as a template for similar projects in other domains—property records, court filings, environmental permits—each of which suffers from similar fragmentation.

Risks, Limitations & Open Questions

While the project is impressive, it raises several concerns:

1. Legal Gray Areas: The UK's Computer Misuse Act and the EU's GDPR create legal risks for scraping public data, even if the data is ostensibly public. Some councils may argue that scraping violates their terms of service, even if the data itself is not copyrighted. The developer has not been contacted by any council, but the threat of legal action remains.
2. Data Freshness: The dataset is a snapshot from early 2024. Planning decisions are made daily, so the data becomes stale quickly. Maintaining a live feed would require continuous scraping, which increases costs and legal exposure.
3. Bias in Coverage: The developer focused on English councils. Scotland, Wales, and Northern Ireland have separate planning systems, and their portals may be even more fragmented. The dataset is therefore incomplete for UK-wide analysis.
4. Quality Assurance: Despite normalization efforts, some records may contain errors due to OCR mistakes from PDF parsing or misaligned schema mapping. The developer has not released a formal accuracy audit.
5. Ethical Concerns: Scraping can put strain on government servers, especially smaller councils with limited IT resources. The developer used polite scraping techniques (delays, off-peak hours), but this is not guaranteed for all future projects.

AINews Verdict & Predictions

This project is a watershed moment for government transparency. It proves that one motivated individual can outperform entire government IT departments in making public data accessible. The implications are profound:

Prediction 1: The Rise of 'Shadow APIs'
Within two years, we will see a proliferation of third-party datasets that scrape government portals and offer them as APIs. These 'shadow APIs' will become the de facto standard for accessing public data, especially for AI training. Governments will either have to build their own unified APIs or watch their data be commoditized by outsiders.

Prediction 2: Legal Backlash and Reform
The UK government will face pressure to either legitimize scraping or provide official APIs. The most likely outcome is a compromise: the government will launch a pilot program for a unified planning data API, but it will take 3-5 years to roll out fully. In the meantime, scraping will continue in a legal gray zone.

Prediction 3: AI-Driven Data Liberation
As AI agents become more capable of navigating heterogeneous systems, we will see automated 'data liberation' bots that can scrape, normalize, and publish datasets with minimal human intervention. This will democratize access to public data but also create new challenges around data quality and legal compliance.

What to Watch Next:
- The developer's GitHub repository for updates on data refresh cycles.
- Any legal actions from UK councils against scrapers.
- Announcements from the UK Ministry of Housing, Communities and Local Government regarding a unified planning data platform.
- The emergence of startups offering 'public data as a service' using similar scraping techniques.

The era of 'public data' being locked in digital silos is ending. The question is not whether it will be liberated, but who will do it first—and who will profit.

More from Hacker News

常见问题

这篇关于“One Developer vs 241 Government Portals: The Digital Ruins of Public Data”的文章讲了什么？

In a striking demonstration of individual initiative versus institutional inertia, a solo developer has successfully extracted 2.6 million planning decision records from 241 separa…

从“how to scrape UK planning data legally”看，这件事为什么值得关注？

The developer's approach is a masterclass in adaptive data extraction. Each of the 241 planning portals is a unique technical artifact, representing a different era of government IT procurement. Some run on legacy ASP.NE…

如果想继续追踪“best tools for scraping government websites”，应该重点看什么？

可以继续查看本文整理的原文链接、相关文章和 AI 分析部分，快速了解事件背景、影响与后续进展。