Documentation

Reference CSV schema, data requirements, and sample audit reports.

📊 CSV Schema

Your uploaded file must meet the following requirements:

CSV format only (comma-separated, with a header row)
Maximum file size: 500 MB
Required columns: candidate_id, sex, race_ethnicity
Either model_score (0–1) or model_pred (0/1) is required; if both exist, model_score will be used
No personal identifiers — candidate IDs must be anonymized
Unknown or missing demographic values must be left blank (not “NA”, “?”, “unknown”, etc.)
≥ 1,000 rows minimum
For datasets under 6 months: ≥ 5,000 rows; under 3 months: ≥ 10,000 rows
Rows missing sex or race_ethnicity must be < 20 %

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "BiasBeacon CSV Row Schema",
  "description": "Each row represents one candidate's model evaluation record. The file must contain a header row and one or more candidate rows.",
  "type": "array",
  "minItems": 1,
  "items": {
    "type": "object",
    "properties": {
      "candidate_id": {
        "type": "string",
        "description": "Unique anonymized candidate identifier (must not include PII)."
      },
      "sex": {
        "type": "string",
        "description": "Candidate sex (M/F or equivalent coding). Leave empty if unknown."
      },
      "race_ethnicity": {
        "type": "string",
        "description": "Candidate race or ethnicity. Leave empty if unknown. Consistent labeling is required."
      },
      "model_score": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Predicted probability or continuous model score between 0 and 1. Optional if model_pred is provided."
      },
      "model_pred": {
        "type": "integer",
        "enum": [
          0,
          1
        ],
        "description": "Binary model output (0 = not selected, 1 = selected). Optional if model_score is provided."
      }
    },
    "required": [
      "candidate_id",
      "sex",
      "race_ethnicity"
    ],
    "anyOf": [
      {
        "required": [
          "model_score"
        ]
      },
      {
        "required": [
          "model_pred"
        ]
      }
    ],
    "additionalProperties": false
  }
}

📄 Example CSVs

You can use either model_score or model_pred columns.

Example 1: Using `model_score`

candidate_id,sex,race_ethnicity,model_score
1001,M,white,0.87
1002,F,black,0.43
1003,F,,0.66
1004,M,hispanic,0.65

Example 2: Using `model_pred`

candidate_id,sex,race_ethnicity,model_pred
2001,F,white,1
2002,M,,0
2003,F,asian,1
2004,,hispanic,1

📈 Data Quality & Warnings

If your CSV doesn’t fully meet quality requirements, BiasBeacon will still generate the audit but include warning banners in the final PDF report. These indicate reduced statistical confidence or representativeness.

⚠️ Low Row Count

The dataset contains fewer than 1,000 rows. Bias estimates may be unstable or unreliable with such limited data.

⚠️ High Unknown Rate

More than 20% of rows are missing at least one protected attribute (sex or race_ethnicity), which reduces the accuracy of group comparisons.

⚠️ Unreliable — Less Than 3 Months

The dataset covers less than 3 months of hiring data and includes fewer than 10,000 rows. Temporal coverage may be too short for stable, representative metrics.

⚠️ Unreliable — Less Than 6 Months

The dataset covers less than 6 months and includes fewer than 5,000 rows. The sample may not fully capture historical or seasonal hiring patterns.

These warnings appear automatically in your audit PDF if the upload violates any of the thresholds. They do not block report generation, but they signal reduced audit reliability.