Data Validation
===============

Graphizy provides comprehensive input validation to help you debug data format issues and ensure optimal performance. This guide covers the validation function and common data problems you might encounter.

Quick Validation
----------------

Use the built-in validation function to check your data before creating graphs:

.. code-block:: python

   from graphizy import validate_graphizy_input
   import numpy as np
   
   # Your data
   data = np.array([
       [0, 100, 200],
       [1, 300, 400],
       [2, 500, 600]
   ])
   
   # Validate your input
   result = validate_graphizy_input(
       data, 
       aspect="array",           # or "dict"
       dimension=(800, 800),     # your image dimensions
       proximity_thresh=50.0,    # if using proximity graphs
       verbose=True              # print detailed results
   )
   
   if result["valid"]:
       print("✅ Data is ready!")
   else:
       print("❌ Issues found:")
       for error in result["errors"]:
           print(f"  - {error}")

Validation Function Reference
-----------------------------

**Function Signature:**

.. code-block:: python

   validate_graphizy_input(
       data_points,                    # Your data (array or dict)
       aspect="array",                 # "array" or "dict"
       data_shape=None,               # Expected data structure
       dimension=(1200, 1200),        # Image dimensions (width, height)
       proximity_thresh=None,         # Proximity threshold if applicable
       verbose=True                   # Print detailed results
   )

**Return Value:**

The function returns a dictionary with:

.. code-block:: python

   {
       "valid": True/False,           # Overall validity
       "errors": [],                  # List of error messages
       "warnings": [],                # List of warning messages
       "info": {},                    # Data information (shape, ranges, etc.)
       "suggestions": []              # Performance and usage suggestions
   }

Data Format Requirements
------------------------

**Array Format (aspect="array"):**

Your data should be a 2D NumPy array with at least 3 columns:

.. code-block:: python

   # ✅ Correct format: [id, x, y, additional_columns...]
   data = np.array([
       [0, 100, 200],      # object 0 at (100, 200)
       [1, 300, 400],      # object 1 at (300, 400)
       [2, 500, 600]       # object 2 at (500, 600)
   ])
   
   # ✅ With additional columns is fine
   data = np.array([
       [0, 100, 200, 1.5, True],   # [id, x, y, speed, active]
       [1, 300, 400, 2.0, False],
       [2, 500, 600, 1.8, True]
   ])

**Dictionary Format (aspect="dict"):**

Your data should be a dictionary with required keys:

.. code-block:: python

   # ✅ Correct format
   data = {
       "id": [0, 1, 2],
       "x": [100, 300, 500],
       "y": [200, 400, 600]
   }
   
   # ✅ Additional keys are fine
   data = {
       "id": [0, 1, 2],
       "x": [100, 300, 500],
       "y": [200, 400, 600],
       "speed": [1.5, 2.0, 1.8],
       "color": ["red", "blue", "green"]
   }

Common Data Issues and Solutions
--------------------------------

**1. String IDs (Most Common Issue)**

**Problem:** Using string identifiers instead of numeric ones.

.. code-block:: python

   # ❌ WRONG - This will cause errors
   bad_data = np.array([
       ["particle_1", 100, 200],
       ["particle_2", 300, 400]
   ])
   
   # Error: "Object IDs must be numeric, not string type"

**Solution:** Use numeric IDs:

.. code-block:: python

   # ✅ CORRECT
   good_data = np.array([
       [0, 100, 200],      # Use 0, 1, 2... or any numeric IDs
       [1, 300, 400]
   ], dtype=int)
   
   # Or convert strings to numbers
   string_ids = ["particle_1", "particle_2", "particle_3"]
   numeric_ids = list(range(len(string_ids)))  # [0, 1, 2]

**2. Wrong Array Dimensions**

**Problem:** 1D arrays or wrong shapes.

.. code-block:: python

   # ❌ WRONG - 1D array
   bad_data = np.array([1, 2, 3, 4, 5, 6])
   
   # ❌ WRONG - 3D array
   bad_data = np.array([[[1, 2, 3]]])

**Solution:** Use 2D arrays:

.. code-block:: python

   # ✅ CORRECT - Reshape if needed
   data_1d = np.array([0, 100, 200, 1, 300, 400])
   good_data = data_1d.reshape(-1, 3)  # Reshape to 2D
   
   # Result: [[0, 100, 200], [1, 300, 400]]

**3. Insufficient Columns**

**Problem:** Less than 3 columns (need at least id, x, y).

.. code-block:: python

   # ❌ WRONG - Only 2 columns
   bad_data = np.array([[0, 100], [1, 200]])

**Solution:** Add the missing coordinate:

.. code-block:: python

   # ✅ CORRECT - Add missing y coordinates
   x_coords = np.array([[0, 100], [1, 200]])
   y_coords = np.random.randint(0, 400, (len(x_coords), 1))
   good_data = np.column_stack([x_coords, y_coords])
   
   # Or create from scratch
   good_data = np.array([[0, 100, 150], [1, 200, 250]])

**4. Coordinates Outside Bounds**

**Problem:** Points outside the defined image dimensions.

.. code-block:: python

   # ❌ PROBLEMATIC - x=1300 exceeds dimension width of 1200
   data = np.array([[0, 1300, 200]])
   
   # Warning: "X coordinates outside dimension bounds [0, 1200)"

**Solutions:**

.. code-block:: python

   # Option 1: Clip coordinates to bounds
   data[:, 1] = np.clip(data[:, 1], 0, 1199)  # x coordinates
   data[:, 2] = np.clip(data[:, 2], 0, 1199)  # y coordinates
   
   # Option 2: Scale coordinates to fit
   def scale_to_fit(data, dimension):
       x_min, x_max = data[:, 1].min(), data[:, 1].max()
       y_min, y_max = data[:, 2].min(), data[:, 2].max()
       
       # Scale x coordinates
       if x_max > x_min:
           data[:, 1] = (data[:, 1] - x_min) / (x_max - x_min) * (dimension[0] - 1)
       
       # Scale y coordinates  
       if y_max > y_min:
           data[:, 2] = (data[:, 2] - y_min) / (y_max - y_min) * (dimension[1] - 1)
       
       return data
   
   scaled_data = scale_to_fit(data, (1200, 1200))
   
   # Option 3: Increase image dimensions
   larger_dimension = (1500, 1500)  # Make room for all points

**5. Dictionary Format Issues**

**Problem:** Missing keys or mismatched array lengths.

.. code-block:: python

   # ❌ WRONG - Missing 'y' key
   bad_data = {"id": [0, 1], "x": [100, 300]}
   
   # ❌ WRONG - Mismatched lengths
   bad_data = {
       "id": [0, 1, 2],        # 3 elements
       "x": [100, 300],        # 2 elements
       "y": [200, 400, 600]    # 3 elements
   }

**Solution:**

.. code-block:: python

   # ✅ CORRECT - All required keys with matching lengths
   good_data = {
       "id": [0, 1, 2],
       "x": [100, 300, 500],
       "y": [200, 400, 600]
   }
   
   # Fix mismatched lengths by trimming or padding
   def fix_dict_lengths(data_dict):
       min_length = min(len(v) for v in data_dict.values())
       return {k: v[:min_length] for k, v in data_dict.items()}
   
   fixed_data = fix_dict_lengths(bad_data)

Best Practices
--------------

**Performance Tips:**

1. **Use appropriate data types:**
   
   .. code-block:: python
   
      # Use int32 for coordinates if possible (saves memory)
      data = np.array(coordinates, dtype=np.int32)
      
      # Use float32 instead of float64 for large datasets
      data = data.astype(np.float32)

2. **Validate early and often:**
   
   .. code-block:: python
   
      # Validate immediately after loading data
      def load_and_validate(filename):
          data = np.loadtxt(filename)  # or your loading method
          result = validate_graphizy_input(data, verbose=False)
          
          if not result["valid"]:
              raise ValueError(f"Invalid data: {result['errors']}")
          
          return data

3. **Handle large datasets efficiently:**
   
   .. code-block:: python
   
      # For very large datasets, validate a sample first
      def validate_large_dataset(data, sample_size=1000):
          if len(data) > sample_size:
              sample_indices = np.random.choice(len(data), sample_size, replace=False)
              sample_data = data[sample_indices]
              result = validate_graphizy_input(sample_data, verbose=False)
              
              if not result["valid"]:
                  print("❌ Sample validation failed - likely issues in full dataset")
                  return result
              
              print(f"✅ Sample of {sample_size} points validated successfully")
          
          return validate_graphizy_input(data, verbose=True)

**Integration with Graphizy Workflow:**

.. code-block:: python

   def safe_graphizy_workflow(data, graph_type="delaunay"):
       """Complete safe workflow with validation"""
       
       # Step 1: Validate
       result = validate_graphizy_input(data, verbose=False)
       if not result["valid"]:
           print("❌ Validation failed:")
           for error in result["errors"]:
               print(f"  - {error}")
           return None
       
       # Step 2: Create grapher
       dimension = result["info"].get("dimension", (1200, 1200))
       grapher = Graphing(dimension=dimension)
       
       # Step 3: Create graph
       try:
           if graph_type == "delaunay":
               graph = grapher.make_delaunay(data)
           elif graph_type == "proximity":
               graph = grapher.make_proximity(data, proximity_thresh=50.0)
           elif graph_type == "mst":
               graph = grapher.make_mst(data)
           else:
               raise ValueError(f"Unknown graph type: {graph_type}")
           
           print(f"✅ Successfully created {graph_type} graph with {graph.vcount()} vertices")
           return graph, grapher
           
       except Exception as e:
           print(f"❌ Graph creation failed: {e}")
           return None

   # Use the safe workflow
   result = safe_graphizy_workflow(my_data, "delaunay")
   if result:
       graph, grapher = result
       image = grapher.draw_graph(graph)
       grapher.show_graph(image)

This validation system helps ensure your data works perfectly with Graphizy and provides clear guidance when issues arise. Always validate your data first - it will save you time debugging later!