Connect with us

Basketball

Basketball technique action recognition using 3D convolutional neural networks – Scientific Reports

Published

on

Basketball technique action recognition using 3D convolutional neural networks – Scientific Reports

Experimental data and experimental settings

Furthermore, to evaluate the recognition performance of the proposed basketball action recognition technique based on the 3D CNN architecture presented in this research, a comparative analysis is conducted with two conventional sports action recognition models: the method based on inter-frame difference (MIFD) and the method based on optical flow (MOF). The assessment is carried out across various publicly available basketball action datasets, including NTURGB + D34, Basketball-Action-Dataset35, and B3D Dataset36, utilizing simulation and validation methodologies. The NTU RGB + D dataset is accessible at https://www.kaggle.com/datasets/hungkhoi/skeleton-data-of-ntu-rgbd-60-dataset. The B3D Dataset is available at https://github.com/b3d-project/b3d. The Basketball-Action-Dataset is available at https://www.kaggle.com/datasets/mathchi/euroleague-basketball-20212022. This comparative analysis facilitates an objective assessment of the performance of the 3D CNN-based basketball action recognition technique. Before conducting experiments, the publicly available basketball action dataset is partitioned into training, validation, and test sets. The partitioning method is as follows: 70% of the total dataset is allocated to the training set, while the remaining 10% is designated for the validation set. The remaining 20% is reserved for the test set.

In this research, each epoch consisted of 600 iterations, implying that 600 parameter updates are performed over the entire training dataset per epoch. Thus, a total of Y epochs are conducted during the training process, where Y represents the actual number of epochs performed. Given that each epoch comprised 600 iterations, a total of 20 epochs are conducted in this research. This setup ensures that the model adequately learns from the training data and performs parameter updates at the end of each epoch to progressively optimize its performance.

Experimental evaluation metrics

In this experiment, four evaluation indicators, including Accuracy, Recall, Precision, and F1 score, are employed to assess the performance of basketball action recognition technology. The calculation for Accuracy is as Eq. (1):

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

(1)

In Eq. (1), TP represents the number of samples correctly identified as the positive class, TN represents the number of samples correctly identified as the negative class, FP represents the number of negative class samples incorrectly identified as the positive class, and FN represents the number of positive class samples incorrectly identified as the negative class.

Recall is given by Eq. (2):

$$Recall=\frac{TP}{TP+FN}$$

(2)

Precision is expressed as Eq. (3):

$$Precision=\frac{TP}{TP+FP}$$

(3)

F1 score is given by Eq. (4):

$$F1=2\times \frac{Precision\times Recall}{Precision+Recall}$$

(4)

These evaluation metrics provide a holistic assessment of basketball action recognition techniques. Accuracy offers a broad measure of classification accuracy, while recall and precision offer insights into coverage and accuracy, respectively. The F1 score strikes a balance between these aspects, especially crucial when dealing with imbalanced positive and negative samples, thus offering a more nuanced evaluation of model performance. Throughout the experiment, these metrics are pivotal in comparing different models and determining the optimal basketball action recognition model.

Evaluation of recognition accuracy of different models across different datasets

This section provides an assessment of the performance of the basketball action recognition technique based on 3D CNN across various datasets. The evaluation is carried out using publicly accessible basketball action datasets, namely NTURGB + D, Basketball-Action-Dataset, and B3D Dataset. The two models, MIFD-1 and MIFD-2, are both variants of the MIFD for sports action recognition. MIFD-1 and MIFD-2 detect action dynamics by computing differences between consecutive video frames, leveraging this information to discern various basketball techniques. The disparity between MIFD-1 and MIFD-2 likely stems from nuanced variations in their approaches to frame differencing or feature extraction. Conversely, the two models, MOF-1 and MOF-2, are derived from MOF, a computer vision technique that measures object motion in images. MOF-1 and MOF-2 discern actions by analyzing pixel motion between video frames, enabling them to capture subtle action nuances for recognition. Similarly to MIFD, discrepancies between MOF-1 and MOF-2 may arise from differences in their optical flow calculation methods or feature representation techniques. The recognition accuracy of different models is compared across these datasets, and the evaluation curves for each model on varied datasets are depicted in Figs. 3, 4, 5, respectively.

Figure 3

Recognition accuracy performance evaluation of several different models under the NTURGB + D dataset.

Figure 4
figure 4

Recognition accuracy performance evaluation of several different models under the Basketball-Action-Dataset.

Figure 5
figure 5

Recognition accuracy performance evaluation of several different models under the B3D Dataset.

From Fig. 3, it is evident that on the NTURGB + D dataset, the recognition accuracy of all models improves as the number of iterations increases. Initially, MIFD-1 and MIFD-2 show relatively low accuracy, but their performance improves gradually with training. In contrast, MOF-1 and MOF-2 exhibit higher accuracy in the early iterations and maintain stable growth throughout the training process. CNN-1 and CNN-2 demonstrate the highest accuracy across all iterations, with CNN-2 reaching an accuracy of 67.9% after 600 iterations, notably surpassing the other models.

In Fig. 4, it is evident that the accuracy of all models improves as the number of iterations increases. MIFD-1 and MIFD-2 show relatively slower growth in accuracy compared to MOF-1 and MOF-2, which demonstrate faster improvement over the iteration process. CNN-1 and CNN-2 consistently exhibit outstanding performance, with CNN-2 maintaining its leading position, achieving an accuracy of 70.48% after 600 iterations.

In Fig. 5, the accuracy of MIFD-1 and MIFD-2 starts relatively low in the early iterations but shows improvement as training progresses. MOF-1 and MOF-2 exhibit more significant improvement during the iteration process, especially after 200 iterations. However, CNN-1 and CNN-2 demonstrate superior performance on this dataset, particularly after 600 iterations, where CNN-2 achieves an accuracy of 79.88%, showcasing its superiority in basketball technical action recognition tasks.

In summary, models based on three-dimensional convolutional neural networks (CNN-1 and CNN-2) display higher recognition accuracy and better performance improvement trends across all test datasets. This validates the effectiveness and potential of three-dimensional convolutional neural networks in the field of basketball technical action recognition. Moreover, these results indicate that through DL methods, accurate automation of the recognition and analysis of basketball technical actions is achievable, providing powerful technical support for coaches and athletes.

Recognition recall evaluation of different models across different data sets

Experimental evaluation metrics” section delves into evaluating the recall performance of the basketball technical action recognition technology based on 3D CNN across various datasets. The experimental results on the NTURGB + D dataset are depicted in Fig. 6, while Figs. 7 and 8 portray the data variation curves for the Basketball-Action-Dataset and B3D Dataset, respectively.

Figure 6
figure 6

Recognition recall performance evaluation of several different models under the NTURGB + D dataset.

Figure 7
figure 7

Recognition recall performance evaluation of several different models under the Basketball-Action-Dataset.

Figure 8
figure 8

Recognition recall performance evaluation of several different models under the B3D Dataset.

Figure 6 illustrates that during the initial iterations, the recall rates of all models are relatively low; however, as the number of iterations progresses, the recall rates of each model exhibit an upward trend. While the recall enhancement of MIFD-1 and MIFD-2 is relatively gradual, MOF-1 and MOF-2 demonstrate a comparatively faster growth rate. Notably, CNN-1 and CNN-2 consistently display the highest recall rates across all iterations. Particularly noteworthy is CNN-2, which achieves a recall rate of 80.87% after 600 iterations, significantly surpassing other models.

Figure 7 illustrates the comparison of recall rates on the Basketball-Action-Dataset. In this dataset, the recall rates of MIFD-1 and MIFD-2 exhibit a relatively gradual growth during the iterations. Conversely, MOF-1 and MOF-2 show a more significant improvement in recall. Similarly, CNN-1 and CNN-2 display remarkable recall rates, particularly notable after 600 iterations, where CNN-2 achieves a recall rate of 89.41%, underscoring its superiority in basketball action recognition tasks.

Figure 8 illustrates the variation in recall rates on the B3D dataset. In this dataset, the recall rates of MIFD-1 and MIFD-2 start relatively low but show improvement as the number of iterations increases. MOF-1 and MOF-2 exhibit a more pronounced increase in recall rates during the iterations, particularly evident after 200 iterations. CNN-1 and CNN-2 display superior performance on this dataset, notably after 600 iterations, where CNN-2 achieves a recall rate of 85.77%, maintaining its leading position.

Combining the results of the three experiments, models based on three-dimensional convolutional neural networks (CNN-1 and CNN-2) demonstrate higher recall rates and better performance trends on all test datasets. This validates the effectiveness and potential of three-dimensional convolutional neural networks in the field of basketball action recognition. Additionally, these results indicate that through DL methods, more accurate automation of the recognition and analysis of basketball technical actions can be achieved, providing powerful technical support for coaches and athletes.

Evaluation of precision and F1 score of different models in basketball action recognition

This subsection aims to assess the precision and F1 score performance of the basketball action recognition technique based on 3D CNNs on various datasets. Precision represents the ratio of correctly predicted positive samples by the classifier to the total number of samples predicted as positive. The F1 score, combining precision and recall, serves as a crucial metric for evaluating classifier performance. The precision and F1 score evaluations are conducted on the NTURGB + D, Basketball-Action-Dataset, and B3D Dataset to compare different models’ performance on these datasets. The experimental results are presented in Table 1.

Table 1 Accuracy numerical evaluation results of different models in recognizing basketball technique actions.

From Tables 1 and 2, it is evident that different models exhibit variations in precision and F1 scores for basketball action recognition. Overall, the models based on 3D CNNs (CNN-1 and CNN-2) achieve higher precision and F1 scores across all datasets. Notably, on the NTURGB + D dataset, the CNN-2 model demonstrated the best performance with a precision of 0.92 and an F1 score of 0.91. In contrast, the traditional methods (MIFD-1, MIFD-2, MOF-1, and MOF-2) exhibit relatively lower precision and F1 scores. While some of these models show performance improvements on certain datasets, they still lag behind the models based on 3D Convolutional Neural Networks in terms of precision and F1 scores.

Table 2 Numerical evaluation results of F1 score of different models in basketball skill recognition.

Cross-validation results

To assess the generalization ability of the model and the robustness of the results, this research utilizes a K-fold cross-validation technique. In this process, the dataset is partitioned into K mutually exclusive subsets, where K-1 subsets are allocated for training the model, while the remaining one subset is reserved for validating the model’s performance. This procedure is iterated K times, each time utilizing a distinct validation set. Ultimately, the average of the K evaluations of the model performance is obtained as the final performance evaluation. Through the cross-validation technique, a more precise assessment of the model’s performance on different data subsets is achievable, providing a comprehensive understanding of the model’s generalization ability and robustness. The cross-validation results of different models across various datasets are detailed in Table 3. On the NTURGB + D dataset, both the CNN-1 and CNN-2 models exhibit the highest accuracy and F1 scores, reaching 0.92 and 0.91, respectively. In contrast, the MIFD-1, MIFD-2, MOF-1, and MOF-2 models display lower accuracy and F1 scores, ranging between 0.70 and 0.79. On the Basketball-Action-Dataset, the CNN-1 model attains the highest accuracy and F1 score of 0.87 and 0.86, respectively, followed closely by the CNN-2 model. Conversely, the performance of the MIFD-1, MIFD-2, MOF-1, and MOF-2 models is relatively inferior. On the B3D Dataset, although the CNN-1 and CNN-2 models still exhibit superior performance, the overall performance of all models slightly declines compared to other datasets. Notably, the CNN-2 model’s performance is relatively poorer on the B3D Dataset. In summary, the CNN models consistently demonstrate better performance across different datasets, while models based on traditional methods generally exhibit relatively lower performance.

Table 3 Cross-validation results of different models under different data sets.

Sensitivity analysis design

In this research, a series of ablation experiments is designed to further validate the contribution of each module in the model to basketball action recognition. These experiments involve gradually removing or disabling specific modules in the model and evaluating the performance change of the model on different datasets to quantify the impact of each module. Initially, a baseline model encompassing all modules of 3D CNN and LSTM networks is defined for subsequent comparisons. In the first round of experiments, all 3D convolutional layers are removed from the model to assess the importance of these layers in extracting spatiotemporal features. In the second round, the LSTM layers are disabled to examine the role of temporal feature modeling in action recognition. Subsequently, the pooling layers are removed from the model to observe the contribution of these layers in reducing feature dimensionality and extracting representative features. Finally, the fully connected layers are removed to evaluate their role in mapping features to specific action categories. The performance change of the model on different datasets after removing different modules is summarized in Table 4.

Table 4 Performance changes of the model on different data sets after removing different modules.

Table 4 illustrates that when the 3D convolutional layers are removed, the accuracy of the model decreases on all datasets, underscoring the importance of 3D convolutional layers in extracting spatiotemporal features. Similarly, disabling the LSTM layers also results in a decrease in accuracy, indicating the crucial role of LSTM layers in temporal feature modeling. After removing the pooling layers, the performance of the model improves, possibly due to the occasional loss of useful feature information by pooling layers. Lastly, removing the fully connected layers leads to a significant decrease in accuracy, emphasizing their crucial role in action classification.

Comparative analysis experiment design

In this section, the basketball action recognition model based on 3D CNN proposed in this research is further evaluated through a comparative analysis with various existing methods. The performance comparison of different methods on different datasets is summarized in Table 5. Frame difference and optical flow methods, representing traditional action recognition approaches, demonstrate comparatively lower performance across all three datasets in contrast to DL methods. This observation underscores the substantial advantages of DL methods in effectively capturing and processing the intricacies of basketball actions. Particularly noteworthy are transformer models, emerging as potent DL techniques, showcasing remarkable performance in Table 5. These models achieve an accuracy of 90.1% on the NTURGB + D dataset, 87.3% on the Basketball-Action-Dataset, and 89.2% on the B3D dataset, showcasing their efficacy in handling large-scale datasets and intricate action patterns. The method proposed in reference37, which integrates YOLO and deep fuzzy LSTM networks, demonstrates high accuracy across all datasets, notably achieving 82.1% accuracy on the Basketball-Action-Dataset. Additionally, the 3D pose fuzzy neural network method proposed in reference38 exhibits proficiency in recognizing basketball player jumping actions, albeit slightly lower accuracy on the B3D dataset compared to reference. Moreover, the intelligent trajectory analysis method based on spectral imaging technology proposed in reference39 shows slightly lower accuracy on the NTURGB + D dataset relative to the former DL methods but achieves higher accuracy on the Basketball-Action-Dataset and B3D dataset. The basketball action recognition model based on 3D CNN proposed in this paper showcases the highest accuracy across all datasets, achieving accuracies of 93.1%, 88.5%, and 90.0% on the NTURGB + D dataset, Basketball-Action-Dataset, and B3D dataset, respectively. This outcome underscores the superiority of the method over traditional and existing DL methods in basketball action recognition tasks. Such superiority may stem from the unique architectural design of the model, which effectively integrates 3D CNN and LSTM networks to capture spatiotemporal features inherent in basketball actions. Furthermore, the method potentially benefits from advanced data processing techniques and optimization algorithms.

Table 5 Performance comparison of different methods on different datasets.

Discussion

This section delves into the proposed basketball action recognition method based on 3D CNN and compares it with existing skeleton-based action recognition methods. Existing skeleton-based action recognition methods primarily focus on analyzing human skeleton data. For instance, Yu et al. proposed a Convolutional 3D LSTM method, integrating attention mechanisms into LSTM networks through three-dimensional convolutions. This integration enables memory blocks to effectively learn short-term frame dependencies and long-term relationships40. Duan et al. introduced a PoseConv3D skeleton-based action recognition method, potentially enhancing the performance and generalization ability of basketball action recognition models, particularly in multi-person scenarios41. Additionally, Lee et al. proposed a Hierarchically Decomposed Graph Convolutional Network architecture, offering promising implications for enhancing the performance and generalization ability of basketball action recognition models, especially in scenarios dealing with skeleton data42. In comparison, the proposed basketball action recognition method based on 3D CNN exhibits several innovative aspects. The method proposed here not only focuses on the spatiotemporal features of skeleton data but also directly extracts rich spatiotemporal information from video sequences using 3D CNN, eliminating the need for preprocessing of skeleton joint data. The incorporation of LSTM networks in the proposed method enables the model to capture the long-term dependencies of basketball actions, crucial for comprehending the coherence and complexity of such actions. Through adaptive learning rate adjustment and regularization techniques, the proposed method effectively enhances the generalization ability and robustness of the model during training. Extensively tested on multiple publicly available basketball action datasets, the proposed method validates its recognition performance across various scenarios and conditions. Despite the potentially higher parameter count and computational complexity of 3D CNN models compared to some skeleton-based methods, the proposed approach achieves effective resource utilization while maintaining high accuracy through model optimization and algorithmic enhancements.

In comparison to existing skeleton-based action recognition methods, the proposed approach demonstrates superior recognition accuracy across multiple datasets. Through the application of optimization algorithms, the method exhibits enhanced robustness in identifying basketball actions under varying lighting conditions and complex backgrounds. Furthermore, through cross-dataset testing, the method showcases strong generalization ability, adept at adapting to diverse data distributions and action characteristics.

While the proposed method represents a significant advancement in the realm of basketball action recognition, opportunities for further improvement and optimization persist. Future research endeavors could explore techniques to reduce model parameter count and computational complexity, thereby enhancing the feasibility of real-time applications. Additionally, considering the integration of basketball action recognition with other pertinent tasks such as player identification and tactical analysis through multi-task learning could bolster the overall performance of the model. Moreover, validating the model’s performance on larger and more diverse datasets would provide deeper insights into its generalization capabilities. Furthermore, investigating the integration of video data with other modalities such as audio and text could yield a more comprehensive action recognition solution.

Continue Reading