What We Learned Running 10,000 Property-Based Tests Against a Calendar Engine

The Truth Engine is the calendar computation library inside Temporal Cortex. It handles RRULE expansion, timezone conversion, availability merging, and duration calculation. We ship it as a Rust crate, a WASM package, and a Python library — all from the same codebase, all expected to produce identical results.

Unit tests caught the obvious bugs. Property-based tests caught the ones that would have shipped.

Why property-based testing for calendar math

Unit tests verify specific examples: “Given this RRULE and this start date, produce these 5 dates.” They’re necessary but bounded by the developer’s imagination. If you don’t think of a particular edge case, you don’t write a test for it.

Property-based testing inverts this. Instead of specifying examples, you specify invariants — properties that must hold for all valid inputs. The testing framework generates thousands of random inputs and checks every one. You stop testing what you thought of and start testing what’s universally true.

Calendar math is uniquely suited to this approach because it has strong invariants:

Every expanded RRULE instance must fall within the DTSTART-UNTIL range
Converting a timestamp to another timezone and back must return the original
Availability intersection must be a subset of each input’s availability
Duration between two timestamps must be non-negative if end > start
COUNT-limited RRULEs must never produce more instances than COUNT

These aren’t aspirational properties. They’re mathematical facts about how time works. If any generated input violates them, the implementation has a bug.

The proptest setup

We use proptest for Rust. The key design decision was building custom strategies that generate realistic calendar data — not arbitrary bytes, but structurally valid RRULEs, timezone-aware datetimes, and sensible date ranges.

use proptest::prelude::*;

/// Generate a valid RRULE frequency
fn arb_frequency() -> impl Strategy<Value = Frequency> {
    prop_oneof![
        Just(Frequency::Daily),
        Just(Frequency::Weekly),
        Just(Frequency::Monthly),
        Just(Frequency::Yearly),
    ]
}

/// Generate a valid BYDAY value
fn arb_weekday() -> impl Strategy<Value = Weekday> {
    prop_oneof![
        Just(Weekday::Mon),
        Just(Weekday::Tue),
        Just(Weekday::Wed),
        Just(Weekday::Thu),
        Just(Weekday::Fri),
        Just(Weekday::Sat),
        Just(Weekday::Sun),
    ]
}

/// Generate a realistic RRULE with constrained parameters
fn arb_rrule() -> impl Strategy<Value = RRule> {
    (
        arb_frequency(),
        1..=4u32,                          // INTERVAL: 1-4
        prop::collection::vec(arb_weekday(), 0..=3),  // BYDAY: 0-3 days
        prop::option::of(1..=366i32),      // BYSETPOS
        prop::option::of(1..=100u32),      // COUNT
    )
        .prop_map(|(freq, interval, by_day, by_set_pos, count)| {
            RRule::new(freq)
                .interval(interval)
                .by_day(by_day)
                .by_set_pos(by_set_pos.into_iter().collect())
                .count(count)
        })
}

The strategies constrain inputs to the valid parameter space. INTERVAL stays between 1 and 4 because higher values are legal but rare and slow to expand. COUNT caps at 100 to keep test execution under a second. BYDAY includes 0 to 3 days because the interaction between multiple BYDAY values and INTERVAL is where most bugs hide.

Properties that caught real bugs

Property 1: Timezone conversion round-trip

Invariant: For any timestamp t and timezones tz_a and tz_b, converting t from tz_a to tz_b and back to tz_a must return the original timestamp.

proptest! {
    #[test]
    fn timezone_roundtrip(
        ts in arb_timestamp(),
        tz_a in arb_timezone(),
        tz_b in arb_timezone(),
    ) {
        let converted = convert_timezone(&ts, &tz_a, &tz_b);
        let roundtripped = convert_timezone(&converted, &tz_b, &tz_a);
        prop_assert_eq!(ts.to_utc(), roundtripped.to_utc());
    }
}

Bug caught: DST double-counting. When converting from America/New_York (UTC-5) to Europe/London (UTC+0) during the overlap hour of fall-back, our initial implementation applied the DST offset twice — once when resolving the source timezone’s ambiguity and once when converting to the target. The round-trip returned a timestamp off by one hour.

This bug only manifested during the 60-minute fall-back window, in specific timezone pairs where both zones observe DST but transition on different dates. No unit test covered this combination. proptest found it in under 200 iterations.

Property 2: RRULE expansion monotonicity

Invariant: Expanded RRULE instances must be in strictly ascending chronological order.

proptest! {
    #[test]
    fn rrule_instances_monotonic(
        rrule in arb_rrule(),
        dtstart in arb_datetime_in_range(2020, 2030),
        tz in arb_timezone(),
    ) {
        let instances = expand_rrule(&rrule, &dtstart, &tz, Some(50));
        for window in instances.windows(2) {
            prop_assert!(window[0] < window[1],
                "Non-monotonic: {:?} >= {:?}", window[0], window[1]);
        }
    }
}

Bug caught: BYSETPOS with INTERVAL > 1 produced duplicate instances. When a monthly RRULE with INTERVAL=2 and BYSETPOS=1 expanded, the first occurrence appeared twice — once from the natural expansion and once from the BYSETPOS selection. The instances were equal, not ascending, violating the monotonicity property.

The root cause: our expansion pipeline applied BYSETPOS before deduplication. The fix was a single line — dedup after BYSETPOS filtering. But the bug only appeared with specific INTERVAL and BYSETPOS combinations that we hadn’t manually enumerated.

Property 3: Availability intersection correctness

Invariant: The intersection of two availability windows must be a subset of each input. Every minute in the intersection must be free in both input schedules.

proptest! {
    #[test]
    fn availability_intersection_subset(
        slots_a in arb_time_slots(1..=10),
        slots_b in arb_time_slots(1..=10),
    ) {
        let intersection = merge_availability(&slots_a, &slots_b);
        for slot in &intersection {
            prop_assert!(is_within_any(&slot, &slots_a),
                "Intersection slot {:?} not in schedule A", slot);
            prop_assert!(is_within_any(&slot, &slots_b),
                "Intersection slot {:?} not in schedule B", slot);
        }
    }
}

Bug caught: Adjacent slots with touching boundaries (one ends at 2:00 PM, another starts at 2:00 PM) were incorrectly merged into a single slot in the intersection, even when only one input had the gap bridged. The intersection contained a continuous block from 1:00 PM to 3:00 PM when it should have been two separate blocks: 1:00-2:00 PM and 2:00-3:00 PM.

The fix required distinguishing between “adjacent” (touching endpoints, separate slots) and “overlapping” (shared time range, merged) in the interval merge algorithm.

Property 4: COUNT ceiling

Invariant: An RRULE with COUNT=n must produce at most n instances. Combined with EXDATE exclusions, it may produce fewer.

proptest! {
    #[test]
    fn count_never_exceeded(
        rrule in arb_rrule_with_count(),
        dtstart in arb_datetime_in_range(2020, 2030),
        exdates in prop::collection::vec(arb_datetime_in_range(2020, 2030), 0..=5),
        tz in arb_timezone(),
    ) {
        let count = rrule.count().unwrap();
        let instances = expand_rrule_with_exdates(
            &rrule, &dtstart, &tz, &exdates
        );
        prop_assert!(instances.len() <= count as usize,
            "Got {} instances for COUNT={}", instances.len(), count);
    }
}

Bug caught: Leap year RRULEs with FREQ=YEARLY;BYMONTH=2;BYMONTHDAY=29;COUNT=3 starting in 2024 should produce instances in 2024, 2028, and 2032. Our implementation initially counted skipped years (2025, 2026, 2027) toward the COUNT, stopping expansion after 2026 with only one instance produced. The COUNT should count produced instances, not evaluated periods.

What property-based testing cannot catch

Honesty about limitations matters. Three categories of bugs escape property-based testing entirely.

Semantic correctness of natural language. resolve_datetime("next Tuesday") needs to return the correct Tuesday. But “correct” depends on cultural convention (does “next Tuesday” mean “the coming Tuesday” or “Tuesday of next week”?), and the property test can’t generate a natural-language expression and its expected interpretation. We use unit tests for the 60+ supported expressions, with the property tests only verifying structural properties (output is a valid RFC 3339 timestamp, output is in the future for “next X” expressions).

Provider API behavior. The Truth Engine is pure computation — no network calls. But Temporal Cortex also interacts with Google Calendar, Outlook, and CalDAV APIs. Their behavior (rate limits, eventual consistency, undocumented quirks) can’t be property-tested because the property would be “the API does what it documents,” which is not always true.

Cross-platform consistency. We verify that Rust native, WASM, and Python produce identical output for identical inputs. But the inputs arrive through different serialization paths (JSON for WASM, PyO3 for Python), and edge cases in serialization (floating-point precision, Unicode normalization) are hard to express as properties.

Running the suite

The full property-based test suite generates over 9,000 test cases across all properties and runs in under 30 seconds on CI. Each property uses proptest’s default configuration of 256 cases, and we have roughly 36 property tests across the Truth Engine, TOON parser, and availability merger.

# Run only property-based tests
cargo test --package truth-engine proptest

# Run with more cases (slower, useful for pre-release)
PROPTEST_CASES=1024 cargo test --package truth-engine proptest

When a property test fails, proptest automatically shrinks the failing input to the minimal reproduction case. The DST double-counting bug was originally found with a complex multi-field timestamp, but proptest shrunk it to the simplest possible input that triggered the failure — a single conversion during the fall-back hour. This makes debugging fast: you get the smallest input that demonstrates the problem.

The confidence argument

We can’t prove the Truth Engine is correct. But we can demonstrate that it satisfies 36 independently specified invariants across 9,000+ randomly generated inputs, on three platforms, with automatic shrinking that isolates failures to minimal cases.

This gives us a qualitatively different kind of confidence than unit tests alone. Unit tests verify that the examples we thought of work. Property-based tests verify that the mathematical relationships we specified hold across the space of valid inputs — including inputs we never would have written by hand.

For a library that handles other people’s calendars, that difference matters. A timezone conversion bug doesn’t crash the program. It silently books a meeting at the wrong time. Property-based testing is how we catch the silent bugs before they ship.

The Truth Engine is open source under MIT/Apache-2.0. Browse the source, the property tests, and the strategies on GitHub. It’s available on crates.io, npm, and PyPI.