Data Formats
============

This guide explains the data formats that Graphizy accepts and how to structure your data for optimal performance.

Overview
--------

Graphizy accepts data in two primary formats:

1. **Array Format** (``aspect="array"``) - NumPy arrays with structured columns
2. **Dictionary Format** (``aspect="dict"``) - Python dictionaries with named keys

Both formats represent the same information: a collection of objects with IDs and 2D coordinates.

Array Format (aspect="array")
-----------------------------

**Structure:**
The array format uses a 2D NumPy array where each row represents one object and columns contain the object's attributes.

**Required Columns:**
- Column 0: Object ID (numeric)
- Column 1: X coordinate 
- Column 2: Y coordinate
- Columns 3+: Additional attributes (optional)

**Basic Example:**

.. code-block:: python

   import numpy as np
   from graphizy import Graphing

   # Basic format: [id, x, y]
   data = np.array([
       [0, 100, 200],    # Object 0 at position (100, 200)
       [1, 300, 400],    # Object 1 at position (300, 400)
       [2, 500, 600],    # Object 2 at position (500, 600)
   ])

   # Create grapher and use the data
   grapher = Graphing(dimension=(800, 800), aspect="array")
   graph = grapher.make_delaunay(data)

**Extended Example with Additional Attributes:**

.. code-block:: python

   # Extended format: [id, x, y, speed, active, type]
   data = np.array([
       [0, 100, 200, 1.5, 1, 0],    # Object 0: speed=1.5, active=True, type=0
       [1, 300, 400, 2.3, 1, 1],    # Object 1: speed=2.3, active=True, type=1
       [2, 500, 600, 0.8, 0, 0],    # Object 2: speed=0.8, active=False, type=0
   ])

   # Graphizy will use columns 0, 1, 2 for id, x, y
   # Additional columns are preserved but not used for graph creation

Dictionary Format (aspect="dict")
---------------------------------

**Structure:**
The dictionary format uses a Python dictionary with three required keys, each containing a list of values.

**Required Keys:**
- ``"id"``: List of object IDs (numeric)
- ``"x"``: List of X coordinates
- ``"y"``: List of Y coordinates

**Basic Example:**

.. code-block:: python

   # Dictionary format
   data = {
       "id": [0, 1, 2],
       "x": [100, 300, 500],
       "y": [200, 400, 600]
   }

   # Create grapher and use the data
   grapher = Graphing(dimension=(800, 800), aspect="dict")
   graph = grapher.make_delaunay(data)

**Extended Example with Additional Attributes:**

.. code-block:: python

   # Dictionary with additional attributes
   data = {
       "id": [0, 1, 2, 3],
       "x": [100, 300, 500, 700],
       "y": [200, 400, 600, 800],
       "speed": [1.5, 2.3, 0.8, 1.9],
       "color": ["red", "blue", "green", "yellow"],
       "active": [True, True, False, True],
       "category": ["A", "B", "A", "C"]
   }

   # Graphizy will use id, x, y for graph creation
   # Additional keys are preserved for your use

Converting Between Formats
--------------------------

**Array to Dictionary:**

.. code-block:: python

   def array_to_dict(data_array):
       """Convert array format to dictionary format"""
       return {
           "id": data_array[:, 0].tolist(),
           "x": data_array[:, 1].tolist(),
           "y": data_array[:, 2].tolist()
       }

   # Example usage
   array_data = np.array([[0, 100, 200], [1, 300, 400]])
   dict_data = array_to_dict(array_data)

**Dictionary to Array:**

.. code-block:: python

   def dict_to_array(data_dict):
       """Convert dictionary format to array format"""
       return np.column_stack([
           data_dict["id"],
           data_dict["x"], 
           data_dict["y"]
       ])

   # Example usage
   dict_data = {"id": [0, 1], "x": [100, 300], "y": [200, 400]}
   array_data = dict_to_array(dict_data)

Common Data Sources
-------------------

**From CSV Files:**

.. code-block:: python

   import pandas as pd

   # Read CSV file
   df = pd.read_csv("objects.csv")  # columns: object_id, pos_x, pos_y
   
   # Convert to array format
   data_array = df[["object_id", "pos_x", "pos_y"]].values
   
   # Or convert to dictionary format
   data_dict = {
       "id": df["object_id"].tolist(),
       "x": df["pos_x"].tolist(),
       "y": df["pos_y"].tolist()
   }

**From Object Detection:**

.. code-block:: python

   # From YOLO or similar detection systems
   def detections_to_graphizy(detections):
       """Convert detection results to graphizy format"""
       data = []
       for i, detection in enumerate(detections):
           x_center, y_center = detection[0], detection[1]
           data.append([i, x_center, y_center])
       return np.array(data)

**From Simulation Systems:**

.. code-block:: python

   # From particle simulation
   def particles_to_graphizy(particles, include_velocity=False):
       """Convert particle objects to graphizy format"""
       if include_velocity:
           return np.array([
               [p.id, p.x, p.y, p.vx, p.vy] for p in particles
           ])
       else:
           return np.array([
               [p.id, p.x, p.y] for p in particles
           ])

Data Validation
---------------

Always validate your data before creating graphs:

.. code-block:: python

   from graphizy import validate_graphizy_input

   # Validate your data
   result = validate_graphizy_input(
       data, 
       aspect="array",           # or "dict"
       dimension=(800, 800),
       verbose=True
   )

   if not result["valid"]:
       print("Data issues found:")
       for error in result["errors"]:
           print(f"  - {error}")

For complete validation details, see the :doc:`data_validation` guide.

Best Practices
--------------

1. **Use numeric IDs only** - String IDs will cause errors
2. **Ensure coordinates fit within dimensions** - Points outside bounds will generate warnings  
3. **Choose array format for large datasets** - Better memory efficiency
4. **Choose dictionary format for mixed data types** - More readable and flexible
5. **Always validate data before graph creation** - Catch issues early

Performance Tips
----------------

.. code-block:: python

   # For large datasets, use appropriate data types
   large_data = np.random.randint(0, 1000, (10000, 3), dtype=np.int32)
   
   # Array format is generally faster for large datasets
   grapher = Graphing(aspect="array", dimension=(1000, 1000))