Data Validation

Graphizy provides comprehensive input validation to help you debug data format issues and ensure optimal performance. This guide covers the validation function and common data problems you might encounter.

Quick Validation

Use the built-in validation function to check your data before creating graphs:

from graphizy import validate_graphizy_input
import numpy as np

# Your data
data = np.array([
    [0, 100, 200],
    [1, 300, 400],
    [2, 500, 600]
])

# Validate your input
result = validate_graphizy_input(
    data,
    aspect="array",           # or "dict"
    dimension=(800, 800),     # your image dimensions
    proximity_thresh=50.0,    # if using proximity graphs
    verbose=True              # print detailed results
)

if result["valid"]:
    print("✅ Data is ready!")
else:
    print("❌ Issues found:")
    for error in result["errors"]:
        print(f"  - {error}")

Validation Function Reference

Function Signature:

validate_graphizy_input(
    data_points,                    # Your data (array or dict)
    aspect="array",                 # "array" or "dict"
    data_shape=None,               # Expected data structure
    dimension=(1200, 1200),        # Image dimensions (width, height)
    proximity_thresh=None,         # Proximity threshold if applicable
    verbose=True                   # Print detailed results
)

Return Value:

The function returns a dictionary with:

{
    "valid": True/False,           # Overall validity
    "errors": [],                  # List of error messages
    "warnings": [],                # List of warning messages
    "info": {},                    # Data information (shape, ranges, etc.)
    "suggestions": []              # Performance and usage suggestions
}

Data Format Requirements

Array Format (aspect=”array”):

Your data should be a 2D NumPy array with at least 3 columns:

# ✅ Correct format: [id, x, y, additional_columns...]
data = np.array([
    [0, 100, 200],      # object 0 at (100, 200)
    [1, 300, 400],      # object 1 at (300, 400)
    [2, 500, 600]       # object 2 at (500, 600)
])

# ✅ With additional columns is fine
data = np.array([
    [0, 100, 200, 1.5, True],   # [id, x, y, speed, active]
    [1, 300, 400, 2.0, False],
    [2, 500, 600, 1.8, True]
])

Dictionary Format (aspect=”dict”):

Your data should be a dictionary with required keys:

# ✅ Correct format
data = {
    "id": [0, 1, 2],
    "x": [100, 300, 500],
    "y": [200, 400, 600]
}

# ✅ Additional keys are fine
data = {
    "id": [0, 1, 2],
    "x": [100, 300, 500],
    "y": [200, 400, 600],
    "speed": [1.5, 2.0, 1.8],
    "color": ["red", "blue", "green"]
}

Common Data Issues and Solutions

1. String IDs (Most Common Issue)

Problem: Using string identifiers instead of numeric ones.

# ❌ WRONG - This will cause errors
bad_data = np.array([
    ["particle_1", 100, 200],
    ["particle_2", 300, 400]
])

# Error: "Object IDs must be numeric, not string type"

Solution: Use numeric IDs:

# ✅ CORRECT
good_data = np.array([
    [0, 100, 200],      # Use 0, 1, 2... or any numeric IDs
    [1, 300, 400]
], dtype=int)

# Or convert strings to numbers
string_ids = ["particle_1", "particle_2", "particle_3"]
numeric_ids = list(range(len(string_ids)))  # [0, 1, 2]

2. Wrong Array Dimensions

Problem: 1D arrays or wrong shapes.

# ❌ WRONG - 1D array
bad_data = np.array([1, 2, 3, 4, 5, 6])

# ❌ WRONG - 3D array
bad_data = np.array([[[1, 2, 3]]])

Solution: Use 2D arrays:

# ✅ CORRECT - Reshape if needed
data_1d = np.array([0, 100, 200, 1, 300, 400])
good_data = data_1d.reshape(-1, 3)  # Reshape to 2D

# Result: [[0, 100, 200], [1, 300, 400]]

3. Insufficient Columns

Problem: Less than 3 columns (need at least id, x, y).

# ❌ WRONG - Only 2 columns
bad_data = np.array([[0, 100], [1, 200]])

Solution: Add the missing coordinate:

# ✅ CORRECT - Add missing y coordinates
x_coords = np.array([[0, 100], [1, 200]])
y_coords = np.random.randint(0, 400, (len(x_coords), 1))
good_data = np.column_stack([x_coords, y_coords])

# Or create from scratch
good_data = np.array([[0, 100, 150], [1, 200, 250]])

4. Coordinates Outside Bounds

Problem: Points outside the defined image dimensions.

# ❌ PROBLEMATIC - x=1300 exceeds dimension width of 1200
data = np.array([[0, 1300, 200]])

# Warning: "X coordinates outside dimension bounds [0, 1200)"

Solutions:

# Option 1: Clip coordinates to bounds
data[:, 1] = np.clip(data[:, 1], 0, 1199)  # x coordinates
data[:, 2] = np.clip(data[:, 2], 0, 1199)  # y coordinates

# Option 2: Scale coordinates to fit
def scale_to_fit(data, dimension):
    x_min, x_max = data[:, 1].min(), data[:, 1].max()
    y_min, y_max = data[:, 2].min(), data[:, 2].max()

    # Scale x coordinates
    if x_max > x_min:
        data[:, 1] = (data[:, 1] - x_min) / (x_max - x_min) * (dimension[0] - 1)

    # Scale y coordinates
    if y_max > y_min:
        data[:, 2] = (data[:, 2] - y_min) / (y_max - y_min) * (dimension[1] - 1)

    return data

scaled_data = scale_to_fit(data, (1200, 1200))

# Option 3: Increase image dimensions
larger_dimension = (1500, 1500)  # Make room for all points

5. Dictionary Format Issues

Problem: Missing keys or mismatched array lengths.

# ❌ WRONG - Missing 'y' key
bad_data = {"id": [0, 1], "x": [100, 300]}

# ❌ WRONG - Mismatched lengths
bad_data = {
    "id": [0, 1, 2],        # 3 elements
    "x": [100, 300],        # 2 elements
    "y": [200, 400, 600]    # 3 elements
}

Solution:

# ✅ CORRECT - All required keys with matching lengths
good_data = {
    "id": [0, 1, 2],
    "x": [100, 300, 500],
    "y": [200, 400, 600]
}

# Fix mismatched lengths by trimming or padding
def fix_dict_lengths(data_dict):
    min_length = min(len(v) for v in data_dict.values())
    return {k: v[:min_length] for k, v in data_dict.items()}

fixed_data = fix_dict_lengths(bad_data)

Best Practices

Performance Tips:

Use appropriate data types:

# Use int32 for coordinates if possible (saves memory)
data = np.array(coordinates, dtype=np.int32)

# Use float32 instead of float64 for large datasets
data = data.astype(np.float32)

Validate early and often:

# Validate immediately after loading data
def load_and_validate(filename):
    data = np.loadtxt(filename)  # or your loading method
    result = validate_graphizy_input(data, verbose=False)

    if not result["valid"]:
        raise ValueError(f"Invalid data: {result['errors']}")

    return data

Handle large datasets efficiently:

# For very large datasets, validate a sample first
def validate_large_dataset(data, sample_size=1000):
    if len(data) > sample_size:
        sample_indices = np.random.choice(len(data), sample_size, replace=False)
        sample_data = data[sample_indices]
        result = validate_graphizy_input(sample_data, verbose=False)

        if not result["valid"]:
            print("❌ Sample validation failed - likely issues in full dataset")
            return result

        print(f"✅ Sample of {sample_size} points validated successfully")

    return validate_graphizy_input(data, verbose=True)

Integration with Graphizy Workflow:

def safe_graphizy_workflow(data, graph_type="delaunay"):
    """Complete safe workflow with validation"""

    # Step 1: Validate
    result = validate_graphizy_input(data, verbose=False)
    if not result["valid"]:
        print("❌ Validation failed:")
        for error in result["errors"]:
            print(f"  - {error}")
        return None

    # Step 2: Create grapher
    dimension = result["info"].get("dimension", (1200, 1200))
    grapher = Graphing(dimension=dimension)

    # Step 3: Create graph
    try:
        if graph_type == "delaunay":
            graph = grapher.make_delaunay(data)
        elif graph_type == "proximity":
            graph = grapher.make_proximity(data, proximity_thresh=50.0)
        elif graph_type == "mst":
            graph = grapher.make_mst(data)
        else:
            raise ValueError(f"Unknown graph type: {graph_type}")

        print(f"✅ Successfully created {graph_type} graph with {graph.vcount()} vertices")
        return graph, grapher

    except Exception as e:
        print(f"❌ Graph creation failed: {e}")
        return None

# Use the safe workflow
result = safe_graphizy_workflow(my_data, "delaunay")
if result:
    graph, grapher = result
    image = grapher.draw_graph(graph)
    grapher.show_graph(image)

This validation system helps ensure your data works perfectly with Graphizy and provides clear guidance when issues arise. Always validate your data first - it will save you time debugging later!