Data Validation
Graphizy provides comprehensive input validation to help you debug data format issues and ensure optimal performance. This guide covers the validation function and common data problems you might encounter.
Quick Validation
Use the built-in validation function to check your data before creating graphs:
from graphizy import validate_graphizy_input
import numpy as np
# Your data
data = np.array([
[0, 100, 200],
[1, 300, 400],
[2, 500, 600]
])
# Validate your input
result = validate_graphizy_input(
data,
aspect="array", # or "dict"
dimension=(800, 800), # your image dimensions
proximity_thresh=50.0, # if using proximity graphs
verbose=True # print detailed results
)
if result["valid"]:
print("✅ Data is ready!")
else:
print("❌ Issues found:")
for error in result["errors"]:
print(f" - {error}")
Validation Function Reference
Function Signature:
validate_graphizy_input(
data_points, # Your data (array or dict)
aspect="array", # "array" or "dict"
data_shape=None, # Expected data structure
dimension=(1200, 1200), # Image dimensions (width, height)
proximity_thresh=None, # Proximity threshold if applicable
verbose=True # Print detailed results
)
Return Value:
The function returns a dictionary with:
{
"valid": True/False, # Overall validity
"errors": [], # List of error messages
"warnings": [], # List of warning messages
"info": {}, # Data information (shape, ranges, etc.)
"suggestions": [] # Performance and usage suggestions
}
Data Format Requirements
Array Format (aspect=”array”):
Your data should be a 2D NumPy array with at least 3 columns:
# ✅ Correct format: [id, x, y, additional_columns...]
data = np.array([
[0, 100, 200], # object 0 at (100, 200)
[1, 300, 400], # object 1 at (300, 400)
[2, 500, 600] # object 2 at (500, 600)
])
# ✅ With additional columns is fine
data = np.array([
[0, 100, 200, 1.5, True], # [id, x, y, speed, active]
[1, 300, 400, 2.0, False],
[2, 500, 600, 1.8, True]
])
Dictionary Format (aspect=”dict”):
Your data should be a dictionary with required keys:
# ✅ Correct format
data = {
"id": [0, 1, 2],
"x": [100, 300, 500],
"y": [200, 400, 600]
}
# ✅ Additional keys are fine
data = {
"id": [0, 1, 2],
"x": [100, 300, 500],
"y": [200, 400, 600],
"speed": [1.5, 2.0, 1.8],
"color": ["red", "blue", "green"]
}
Common Data Issues and Solutions
1. String IDs (Most Common Issue)
Problem: Using string identifiers instead of numeric ones.
# ❌ WRONG - This will cause errors
bad_data = np.array([
["particle_1", 100, 200],
["particle_2", 300, 400]
])
# Error: "Object IDs must be numeric, not string type"
Solution: Use numeric IDs:
# ✅ CORRECT
good_data = np.array([
[0, 100, 200], # Use 0, 1, 2... or any numeric IDs
[1, 300, 400]
], dtype=int)
# Or convert strings to numbers
string_ids = ["particle_1", "particle_2", "particle_3"]
numeric_ids = list(range(len(string_ids))) # [0, 1, 2]
2. Wrong Array Dimensions
Problem: 1D arrays or wrong shapes.
# ❌ WRONG - 1D array
bad_data = np.array([1, 2, 3, 4, 5, 6])
# ❌ WRONG - 3D array
bad_data = np.array([[[1, 2, 3]]])
Solution: Use 2D arrays:
# ✅ CORRECT - Reshape if needed
data_1d = np.array([0, 100, 200, 1, 300, 400])
good_data = data_1d.reshape(-1, 3) # Reshape to 2D
# Result: [[0, 100, 200], [1, 300, 400]]
3. Insufficient Columns
Problem: Less than 3 columns (need at least id, x, y).
# ❌ WRONG - Only 2 columns
bad_data = np.array([[0, 100], [1, 200]])
Solution: Add the missing coordinate:
# ✅ CORRECT - Add missing y coordinates
x_coords = np.array([[0, 100], [1, 200]])
y_coords = np.random.randint(0, 400, (len(x_coords), 1))
good_data = np.column_stack([x_coords, y_coords])
# Or create from scratch
good_data = np.array([[0, 100, 150], [1, 200, 250]])
4. Coordinates Outside Bounds
Problem: Points outside the defined image dimensions.
# ❌ PROBLEMATIC - x=1300 exceeds dimension width of 1200
data = np.array([[0, 1300, 200]])
# Warning: "X coordinates outside dimension bounds [0, 1200)"
Solutions:
# Option 1: Clip coordinates to bounds
data[:, 1] = np.clip(data[:, 1], 0, 1199) # x coordinates
data[:, 2] = np.clip(data[:, 2], 0, 1199) # y coordinates
# Option 2: Scale coordinates to fit
def scale_to_fit(data, dimension):
x_min, x_max = data[:, 1].min(), data[:, 1].max()
y_min, y_max = data[:, 2].min(), data[:, 2].max()
# Scale x coordinates
if x_max > x_min:
data[:, 1] = (data[:, 1] - x_min) / (x_max - x_min) * (dimension[0] - 1)
# Scale y coordinates
if y_max > y_min:
data[:, 2] = (data[:, 2] - y_min) / (y_max - y_min) * (dimension[1] - 1)
return data
scaled_data = scale_to_fit(data, (1200, 1200))
# Option 3: Increase image dimensions
larger_dimension = (1500, 1500) # Make room for all points
5. Dictionary Format Issues
Problem: Missing keys or mismatched array lengths.
# ❌ WRONG - Missing 'y' key
bad_data = {"id": [0, 1], "x": [100, 300]}
# ❌ WRONG - Mismatched lengths
bad_data = {
"id": [0, 1, 2], # 3 elements
"x": [100, 300], # 2 elements
"y": [200, 400, 600] # 3 elements
}
Solution:
# ✅ CORRECT - All required keys with matching lengths
good_data = {
"id": [0, 1, 2],
"x": [100, 300, 500],
"y": [200, 400, 600]
}
# Fix mismatched lengths by trimming or padding
def fix_dict_lengths(data_dict):
min_length = min(len(v) for v in data_dict.values())
return {k: v[:min_length] for k, v in data_dict.items()}
fixed_data = fix_dict_lengths(bad_data)
Best Practices
Performance Tips:
Use appropriate data types:
# Use int32 for coordinates if possible (saves memory) data = np.array(coordinates, dtype=np.int32) # Use float32 instead of float64 for large datasets data = data.astype(np.float32)
Validate early and often:
# Validate immediately after loading data def load_and_validate(filename): data = np.loadtxt(filename) # or your loading method result = validate_graphizy_input(data, verbose=False) if not result["valid"]: raise ValueError(f"Invalid data: {result['errors']}") return data
Handle large datasets efficiently:
# For very large datasets, validate a sample first def validate_large_dataset(data, sample_size=1000): if len(data) > sample_size: sample_indices = np.random.choice(len(data), sample_size, replace=False) sample_data = data[sample_indices] result = validate_graphizy_input(sample_data, verbose=False) if not result["valid"]: print("❌ Sample validation failed - likely issues in full dataset") return result print(f"✅ Sample of {sample_size} points validated successfully") return validate_graphizy_input(data, verbose=True)
Integration with Graphizy Workflow:
def safe_graphizy_workflow(data, graph_type="delaunay"):
"""Complete safe workflow with validation"""
# Step 1: Validate
result = validate_graphizy_input(data, verbose=False)
if not result["valid"]:
print("❌ Validation failed:")
for error in result["errors"]:
print(f" - {error}")
return None
# Step 2: Create grapher
dimension = result["info"].get("dimension", (1200, 1200))
grapher = Graphing(dimension=dimension)
# Step 3: Create graph
try:
if graph_type == "delaunay":
graph = grapher.make_delaunay(data)
elif graph_type == "proximity":
graph = grapher.make_proximity(data, proximity_thresh=50.0)
elif graph_type == "mst":
graph = grapher.make_mst(data)
else:
raise ValueError(f"Unknown graph type: {graph_type}")
print(f"✅ Successfully created {graph_type} graph with {graph.vcount()} vertices")
return graph, grapher
except Exception as e:
print(f"❌ Graph creation failed: {e}")
return None
# Use the safe workflow
result = safe_graphizy_workflow(my_data, "delaunay")
if result:
graph, grapher = result
image = grapher.draw_graph(graph)
grapher.show_graph(image)
This validation system helps ensure your data works perfectly with Graphizy and provides clear guidance when issues arise. Always validate your data first - it will save you time debugging later!