RandomizedSearchCV : recherche aleatoire

GridSearchCV teste 36 combinaisons en 5-fold CV, soit 180 entrainements. Avec 10 hyperparametres et 10 valeurs chacun, ca fait 10 milliards de combinaisons. Evidemment impossible. RandomizedSearchCV en tire 50 au hasard et c'est souvent presque aussi bon. C'est la methode de choix en production quand l'espace de recherche est grand.

La difference : au lieu de tester 36 combinaisons (3x4x3), tu peux en tester seulement 20 choisies au hasard. En pratique, les recherches aleatoires trouvent de tres bons paramètres sans tout explorer.

Tu peux aussi utiliser des distributions continues (comme scipy.stats.uniform) au lieu de listes discretes.

from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
modèle, param_distributions, n_iter=20, cv=5, random_state=42
)

Écris une fonction randomized_search_rf(X, y, n_iter=20, cv=5) qui fait une recherche aleatoire sur un RandomForestClassifier avec :
n_estimators: [50, 100, 150, 200, 250, 300]
max_depth: [3, 5, 7, 10, 15, None]
min_samples_split: [2, 5, 10, 15, 20]
min_samples_leaf: [1, 2, 4, 8]

Renvoie les memes clés que l'exercice précédent plus 'n_iter_effectif' (le n_iter utilise).

Tests (4/4)

Bons params renvoyes

import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=5, random_state=42)
result = randomized_search_rf(X, y, n_iter=10, cv=3)
assert 'n_estimators' in result['meilleurs_params']
assert 'max_depth' in result['meilleurs_params']

Nombre de combinaisons = n_iter

import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=5, random_state=42)
result = randomized_search_rf(X, y, n_iter=10, cv=3)
assert result['nb_combinaisons'] == 10

Score entre 0 et 1

import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=5, random_state=42)
result = randomized_search_rf(X, y, n_iter=10, cv=3)
assert 0 <= result['meilleur_score'] <= 1

n_iter_effectif correct

import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=5, random_state=42)
result = randomized_search_rf(X, y, n_iter=15, cv=3)
assert result['n_iter_effectif'] == 15

Indices (3 disponibles)

Solution officielle

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

def randomized_search_rf(X, y, n_iter=20, cv=5):
    param_distributions = {
        'n_estimators': [50, 100, 150, 200, 250, 300],
        'max_depth': [3, 5, 7, 10, 15, None],
        'min_samples_split': [2, 5, 10, 15, 20],
        'min_samples_leaf': [1, 2, 4, 8],
    }
    search = RandomizedSearchCV(
        RandomForestClassifier(random_state=42),
        param_distributions,
        n_iter=n_iter,
        cv=cv,
        scoring='accuracy',
        random_state=42,
    )
    search.fit(X, y)
    return {
        'meilleurs_params': search.best_params_,
        'meilleur_score': float(search.best_score_),
        'nb_combinaisons': len(search.cv_results_['params']),
        'n_iter_effectif': n_iter,
    }

← GridSearchCV : trouver les meilleurs … Stacking : combiner plusieurs modèles →

solution.py

Bravo!

RandomizedSearchCV : recherche aleatoire

Tests (4/4)

Indices (3 disponibles)

Solution officielle