VertitimeX Technologies

Pandas Categorical Data.

  • Categorical are a pandas data type that corresponds to the categorical variables in statistics. Such variables take on a fixed and limited number of possible values.
  • For examples – grades, gender, blood group type etc.
  • Pandas provides Categorical data type (pd.Categorical) to optimize memory usage and improve performance when dealing with repetitive text-based data.
Why Use Categorical Data?
Saves memory by storing categories as integer codes instead of strings. Faster operations like sorting, filtering, and grouping compared to object dtype. Provides order to categorical values.

Creating Categorical Data
  1. Converting an Existing Column
        import pandas as pd
    df = pd.DataFrame({ 'Category': ['A', 'B', 'A', 'C', 'B', 'A'] })
    df['Category'] = df['Category'].astype('category')
    print(df.dtypes)
    ✅ The Category column is now of type category, reducing memory usage.
  2. Creating from Scratch
    cat_series = pd.Categorical(['red', 'blue', 'green', 'red', 'blue'])
    print(cat_series)
                
  3. Categorical Data with Defined Categories
        categories = ['small', 'medium', 'large']
    sizes = pd.Categorical(['small', 'large', 'medium', 'small'], categories=categories, ordered=True)
    print(sizes)
                
    ✅ Using ordered=True allows comparison (small < medium < large).
  4. Operations on Categorical Data
    1. Accessing Categories & Codes
      print(sizes.categories)  # ['small', 'medium', 'large']
      print(sizes.codes)       # [0, 2, 1, 0] -> Internal integer representation
                          
    2. Sorting
      sorted_sizes = sizes.sort_values()
      print(sorted_sizes)            
                  
    3. Filtering
      filtered_sizes = sizes[sizes > 'small']  # Keeps 'medium' and 'large'
      print(filtered_sizes)
                  
  5. Changing Categories
    sizes = sizes.rename_categories(['S', 'M', 'L'])
    print(sizes)
                
    ✅ Renames 'small' → 'S', 'medium' → 'M', etc.
  6. Adding & Removing Categories
    sizes = sizes.add_categories(['extra-large'])
    sizes = sizes.remove_categories(['small'])
    print(sizes)
                
  7. Use Case: Grouping & Aggregation
    df = pd.DataFrame({
    'Size': pd.Categorical(['small', 'large', 'medium', 'small', 'large'],
    categories=['small', 'medium', 'large'], ordered=True),
    'Price': [10, 30, 20, 15, 35]
    })
    
    grouped = df.groupby('Size').mean()
    print(grouped)
                
    ✅ Efficient grouping with meaningful category order.
When to Use?
Use categorical data when:
The column contains a fixed number of possible values (e.g., gender, product sizes, regions).
You need ordered categories (e.g., low < medium < high).
Memory efficiency and performance improvements matter.