Capstone Project Report Project Title: Feasibility of Automated Bone Age Estimation viaGoogle Teachable Machine: A Proof-of-Concept Study
Project Title: Feasibility of Automated Bone Age Estimation viaGoogle Teachable Machine: A Proof-of-Concept Study
CHAPTER 1: INTRODUCTION
1.1 Background of the Study
Bone age assessment (BAA) is an essential diagnostic tool in pediatric endocrinology.
Manual methods, such as the Greulich-Pyle (GP) atlas and Tanner-Whitehouse
(TW2/TW3) scoring, require radiologists to visually compare hand radiographs to
reference standards. These methods are effective, but time-intensive, subjective, and
susceptible to inter- and intra-observer variability.
Advances in artificial intelligence (AI), particularly in deep learning, offer promising
alternatives. Convolutional Neural Networks (CNNs) have demonstrated strong
performance, with mean absolute errors as low as 6–8 months in benchmark studies
such as the RSNA Pediatric Bone Age Challenge.
1.2 Relevance of the Project in AI
This study evaluated the use of no-code AI platforms, specifically Google Teachable
Machine (GTM), to train deep learning models for medical imaging tasks. The GTM
simplifies model development through a drag-and-drop interface using transfer learning
via the MobileNetV2 architecture.
We assessed the GTM output in two environments: the original GTM interface and a
code-based evaluation in Python using TensorFlow Lite (TFLite). This reflects a
complete AI development pipeline, from rapid prototyping to formal evaluation.
1.3 Scope of the Project
This project aims to classify pediatric hand X-rays into four bone age categories:
. Class 0: 0–1 year
. Class 1: 1–5 years
. Class 2: 5–10 years
. Class 3: >10 years
The model was trained in GTM and exported for evaluation in Colab using a balanced
subset from the RSNA Pediatric Bone Age dataset. This study evaluated the classification metrics and model portability.
The study is limited to a classification-based proof-of-concept, using a balanced subset of the dataset for training and a separate set for testing. The primary goal is to evaluate the feasibility and limitations of no-code AI tools in a clinically relevant use case.
1.4 Significance of the Work
This work holds significance for multiple stakeholders:
In a broader sense, this project contributes to the growing field of AI-enabled medical imaging, where accessibility, interpretability, and clinical integration are key priorities. The outcomes underscore the balance between ease of use and analytical rigor, advocating for a pipeline that starts with rapid prototyping and evolves toward clinical-grade deployment through formal testing and iterative refinement.
CHAPTER 2: OBJECTIVES OF THE PROJECT
This proof-of-concept evaluates whether Google Teachable Machine (GTM) can be
used to train a multi-class bone age classifier. It also compared the GTM model
performance with that of the exported TensorFlow Lite (TFLite) version.
Objectives:
. Training a four-class classifier using the GTM.
. Export and evaluate the model using Python.
. Comparison of usability and performance across platforms.
CHAPTER 3: LITERATURE REVIEW
3.1 Summary of Existing Research Work
Bone age assessment (BAA) is a well-established diagnostic tool in pediatric radiology
that is primarily used to evaluate skeletal development relative to chronological age.
Clinicians rely on this assessment to diagnose and manage a wide range of growth
and endocrine disorders, including delayed or precocious puberty, idiopathic short
stature, and hormonal imbalance.
The two most common traditional approaches are as follows:
Greulich–Pyle (GP) Atlas : A reference-based system where clinicians visually compare a patient’s left-hand X-ray to standardized images in an atlas.
Tanner–Whitehouse (TW2/TW3) Scoring: A more quantitative approach that scores the maturity of individual bones and calculates bone
age from a summative scale.
Although these methods are widely used, they suffer from inherent limitations:
. High inter-observer variability Dependence on the clinician’s experience
. Time-consuming manual processes
In the last decade, several AI-based studies have aimed to automate BAA.
The RSNA Pediatric Bone Age Challenge (2017) provided a large annotated dataset and
catalyzed research into deep-learning solutions. Top-performing models, typically based
on convolutional neural networks (CNNs), achieve a mean absolute error (MAEs) of less
than 6–8 months.
More recent studies have incorporated attention mechanisms, ensemble models, and multi-modal inputs (e.g., combining clinical metadata like sex or height), further
improving prediction accuracy. The evidence was structured in the PICO format:
Population (P)
Pediatric patients requiring bone-age assessment Growth disorder diagnosis [1][2] Forensic age estimation (particularly undocumented adolescents) [3] Treatment monitoring (e.g., growth hormone therapy) [1][4]
Intervention (I)
Automated Systems:
populations) [6]
Comparison (C)
Manual Greulich-Pyle assessment by radiologists
Outcomes (O)
Metric
| Automated Systems
| Manual Assessment
| Studies
|
MAE (months)
| 3.34-5.45
| 6.96-8.16
| [3][6][4][5]
|
RMS error (years)
| 0.33 (true accuracy)
| 0.52-0.68
| [3][5]
|
Clinical error rate*
| 0.5%
| 5.9%
| [3]
|
Processing time
| <2 minutes
| 15-20 minutes
| [1][3]
|
*Errors changing clinical diagnosis (e.g., misclassifying prepubertal vs pubertal status)
Key Evidence
ratings in disputed cases
(automated) of natural bone age variance
compared to image-only models
Clinical adoption requires the following considerations:
1. Population-specific calibration for ethnic/racial groups
2. Integration of uncertainty margins (±1.5 years) in forensic applications
3. Continuous validation against updated reference standards
Despite these advances, most existing solutions require the following
3.2 Identification of Gaps
3.3 Research Questions
This study addresses the following research questions:
By addressing these questions, this study contributes not only to the development of AI-based tools in radiology, but also to a broader discussion on making AI tools more usable, interpretable, and scalable in clinical workflows.
CHAPTER 4: PROBLEM STATEMENT AND KPIs
4.1 Problem Statement
Bone age estimation is a critical component of pediatric radiology, routinely used to assess skeletal development and identify growth disorders. Despite its clinical importance, the current standard methods—Greulich-Pyle atlas and Tanner-Whitehouse scoring—are inherently subjective, time-consuming, and prone to inter-observer variability. These limitations make the process inefficient, especially in busy clinical settings or regions with limited access to pediatric radiologists.
In parallel, deep learning–based automation has shown potential to provide consistent and accurate predictions. However, the high technical barrier to entry, including requirements for coding expertise, GPU-based infrastructure, and complex deployment pipelines, limits the accessibility of such solutions for non-technical healthcare professionals.
This project addresses a dual-layered problem:
4.2 Objective of the AI Solution
To evaluate whether a no-code AI tool (Google Teachable Machine) can be effectively used to prototype a pediatric bone age classifier, and whether its exported model (as a TFLite file) can be rigorously evaluated and validated in a Python-based environment. This allows assessment of both the usability and real-world performance of such accessible tools in solving a clinically relevant task.
4.3 Key Performance Indicators (KPIs)
To objectively evaluate the AI solution, the following key performance indicators were defined:
KPI | Definition | Target / Insight Expected |
Overall Accuracy
| Proportion of test images correctly classified into the correct age group (4-class model) | ≥ 50% for proof-of-concept; higher desired with data expansion |
Class-wise Recall
| Ability of the model to correctly identify true positives for each class | Especially critical for Class 2 (5–10 years), often underrepresented |
Class-wise Precision | Proportion of predicted labels for a class that were correct | Helps identify if model is over-predicting certain classes |
F1-Score | Harmonic mean of precision and recall per class | Balanced metric for evaluating both false positives and false negatives |
Macro-average F1 | Average F1-score across all classes, treating each class equally | Used as the primary measure of multi-class model balance |
Exportability to TFLite
| Whether the GTM-trained model can be exported and re-evaluated in a code-based pipeline | Required for integration and deployment in real-world systems |
Model Usability (No-Code Accessibility) | Ease of model training via GTM interface without coding | Should demonstrate that a clinician can build the model independently |
Interpretability via Confusion Matrix | Ability to visualize error patterns across age groups | Helps identify which classes require further refinement |
CHAPTER 5: METHODOLOGY
Approach and Strategy Used to Solve the Problem
The project adopts a two-stage strategy to assess the feasibility of building a bone age classifier using an accessible no-code platform (Google Teachable Machine) and validating its real-world applicability using a coding-based evaluation pipeline (TensorFlow Lite in Google Colab). The objective was to train a deep learning model to classify pediatric hand radiographs into four discrete age categories.
This dual-model workflow helps bridge the gap between rapid AI prototyping and structured AI validation.
Algorithms and AI Techniques Considered
Project Workflow Diagram
CHAPTER 6: TOOLS AND TECHNOLOGIES USED
Programming Languages
Frameworks and Libraries
Library / Tool | Purpose / Use |
matplotlib | Plotting visualizations such as bar charts and confusion matrices |
seaborn | Enhanced statistical plotting; used for heatmaps and styled bar plots |
numpy | Handling arrays and numerical data (e.g., defining confusion matrix data) |
pandas | Useful for handling tabular data, though not essential for these visuals |
scikit-learn | Computing classification metrics (precision, recall, F1-score); model evaluation |
TensorFlow Lite | Running the exported model for inference in a Python environment |
Google Teachable Machine | No-code interface for model training using transfer learning (MobileNetV2) |
Google Colab | Cloud environment for running Python scripts, evaluating models, and visualizing |
Development Environments
Cloud or Deployment Tools
CHAPTER 7: DATA COLLECTION AND PREPROCESSING
Source of Data
Description of Datasets
Cleaning, Normalization, Feature Engineering
CHAPTER 8: MODEL DEVELOPMENT
Model Selection Rationale
Training and Validation
Model Architecture
CHAPTER 9: RESULTS AND EVALUATION
Performance Metrics Used
Two models were evaluated: one trained on Google Teachable Machine (GTM) and the other on the exported TensorFlow Lite (TFLite) format, tested in Google Colab. The performance was assessed using:
GTM Evaluation Summary
Class | Accuracy (Recall) | Sample Count |
0 | 100% | 23 |
1 | 66.7% | 18 |
2 | 57.9% | 19 |
3 | 57.9% | 19 |
TFLite Evaluation Summary
Evaluated on 461 test images:
Class | Precision | Recall | F1-score | Support |
Class 0 | 0.62 | 1.00 | 0.76 | 101 |
Class 1 | 0.40 | 0.61 | 0.48 | 113 |
Class 2 | 0.64 | 0.06 | 0.10 | 123 |
Class 3 | 0.66 | 0.60 | 0.63 | 124 |
Overall | – | – | 0.54 | 461 |
Visualizations
Comparison with Baseline
While GTM provided a faster estimate with higher apparent accuracy (72.2%), the TFLite model demonstrated lower accuracy (54.4%) due to:
This gap highlights the need for deeper validation beyond GTM’s built-in interface.
CHAPTER 10: CHALLENGES FACED
Technical Challenges
Data Challenges
Project Management Challenges
How Challenges Were Overcome
CHAPTER 11: CONCLUSION
Summary of Findings
This capstone project successfully demonstrated the feasibility of using Google Teachable Machine (GTM), a no-code AI tool, to develop an image-based bone age classification model. The model, trained on the RSNA Pediatric Bone Age dataset and built upon the MobileNetV2 architecture, showed promising results in classifying infant bone age radiographs (Class 0) with high accuracy. However, its performance diminished in older pediatric age groups, especially Class 2 (5–10 years), which had the lowest recall owing to subtle skeletal features and class overlap.
Exporting the GTM model to TensorFlow Lite (TFLite) allows for a more detailed and scalable evaluation using Python. The TFLite model, tested on a larger dataset, achieved an overall accuracy of 54.4% and a macro-average F1-score of 0.49, revealing that the GTM’s built-in interface was not exposed.
Impact of the Project
This project highlights the potential of democratized AI tools, such as GTM, for rapid prototyping in healthcare settings. Simultaneously, it reinforces the importance of rigorous evaluation pipelines (e.g., TFLite + Python) before considering any model for real-world clinical use. This study serves as a valuable proof-of-concept for bridging non-programmer-friendly platforms with technically robust deployment paths.
CHAPTER 12: FUTURE WORK
Possible Enhancements
Scope for Further Research
CHAPTER 13: REFERENCES
CHAPTER 14: APPENDICES
A. Source Code and Scripts
B. Performance Visualizations
C. Online Access
All files, notebooks, and high-resolution figures are available online at:
https://drive.google.com/drive/folders/1fH4OQYhprOL1gUbF_6zqmmZQ81gNA23D?usp=drive_link
____________________________________________________________________________
Page | 1
Comments
Post a Comment