Capstone Project Report Project Title: Feasibility of Automated Bone Age Estimation viaGoogle Teachable Machine: A Proof-of-Concept Study

Project Title: Feasibility of Automated Bone Age Estimation viaGoogle Teachable Machine: A Proof-of-Concept Study

 


CHAPTER 1: INTRODUCTION

1.1 Background of the Study

Bone age assessment (BAA) is an essential diagnostic tool in pediatric endocrinology.

Manual methods, such as the Greulich-Pyle (GP) atlas and Tanner-Whitehouse

(TW2/TW3) scoring, require radiologists to visually compare hand radiographs to

reference standards. These methods are effective, but time-intensive, subjective, and

susceptible to inter- and intra-observer variability.

Advances in artificial intelligence (AI), particularly in deep learning, offer promising

alternatives. Convolutional Neural Networks (CNNs) have demonstrated strong

performance, with mean absolute errors as low as 6–8 months in benchmark studies

such as the RSNA Pediatric Bone Age Challenge.

 

1.2 Relevance of the Project in AI

 

This study evaluated the use of no-code AI platforms, specifically Google Teachable

Machine (GTM), to train deep learning models for medical imaging tasks. The GTM

simplifies model development through a drag-and-drop interface using transfer learning

via the MobileNetV2 architecture.

We assessed the GTM output in two environments: the original GTM interface and a

code-based evaluation in Python using TensorFlow Lite (TFLite). This reflects a

complete AI development pipeline, from rapid prototyping to formal evaluation.

 

1.3 Scope of the Project

 

This project aims to classify pediatric hand X-rays into four bone age categories:

. Class 0: 0–1 year

. Class 1: 1–5 years

. Class 2: 5–10 years

. Class 3: >10 years

 

The model was trained in GTM and exported for evaluation in Colab using a balanced

subset from the RSNA Pediatric Bone Age dataset. This study evaluated the classification metrics and model portability.

 

The study is limited to a classification-based proof-of-concept, using a balanced subset of the dataset for training and a separate set for testing. The primary goal is to evaluate the feasibility and limitations of no-code AI tools in a clinically relevant use case.

1.4 Significance of the Work

This work holds significance for multiple stakeholders:

• For clinicians, it demonstrates the potential of AI tools to augment diagnostic workflows, reduce manual effort, and standardize bone age reporting.
• For AI developers and educators, it illustrates how no-code platforms like GTM can serve as a springboard for deploying real-world machine learning models.
• For the research community, the comparison between GTM and TFLite highlights the importance of external evaluation and validation, even when using user-friendly AI tools.

In a broader sense, this project contributes to the growing field of AI-enabled medical imaging, where accessibility, interpretability, and clinical integration are key priorities. The outcomes underscore the balance between ease of use and analytical rigor, advocating for a pipeline that starts with rapid prototyping and evolves toward clinical-grade deployment through formal testing and iterative refinement.

CHAPTER 2: OBJECTIVES OF THE PROJECT

 

This proof-of-concept evaluates whether Google Teachable Machine (GTM) can be

used to train a multi-class bone age classifier. It also compared the GTM model

performance with that of the exported TensorFlow Lite (TFLite) version.

 

Objectives:

 

. Training a four-class classifier using the GTM.

. Export and evaluate the model using Python.

. Comparison of usability and performance across platforms.

CHAPTER 3: LITERATURE REVIEW

3.1 Summary of Existing Research Work

Bone age assessment (BAA) is a well-established diagnostic tool in pediatric radiology

that is primarily used to evaluate skeletal development relative to chronological age.

Clinicians rely on this assessment to diagnose and manage a wide range of growth

and endocrine disorders, including delayed or precocious puberty, idiopathic short

stature, and hormonal imbalance.

 

The two most common traditional approaches are as follows:

 

Greulich–Pyle (GP) Atlas : A reference-based system where clinicians visually compare a patient’s left-hand X-ray to standardized images in an atlas.

Tanner–Whitehouse (TW2/TW3) Scoring: A more quantitative approach that scores the maturity of individual bones and calculates bone

age from a summative scale.

 

Although these methods are widely used, they suffer from inherent limitations:

 

. High inter-observer variability Dependence on the clinician’s experience

. Time-consuming manual processes

 

In the last decade, several AI-based studies have aimed to automate BAA.

 

The RSNA Pediatric Bone Age Challenge (2017) provided a large annotated dataset and

catalyzed research into deep-learning solutions. Top-performing models, typically based

on convolutional neural networks (CNNs), achieve a mean absolute error (MAEs) of less

than 6–8 months. 

More recent studies have incorporated attention mechanisms, ensemble models, and multi-modal inputs (e.g., combining clinical metadata like sex or height), further

improving prediction accuracy. The evidence was structured in the PICO format:

 

Population (P)

 

Pediatric patients requiring bone-age assessment Growth disorder diagnosis [1][2] Forensic age estimation (particularly undocumented adolescents) [3] Treatment monitoring (e.g., growth hormone therapy) [1][4]

 

Intervention (I)

 

Automated Systems:

• BoneXpert v3: Automated GP/TW3 analysis (MAE 4.1 months vs consensus) [3]
• Deep Learning Models:
1. RSNA Challenge winners: CNN-based models (MAE 4.3-4.5 months) [3][5]
2. 2M-Net: Multi-task learning with attention (MAE 3.98 months) [4]
3. Critical bone area detection networks (MAE 3.34 months in homogeneous

populations) [6]

 

Comparison (C)

 

Manual Greulich-Pyle assessment by radiologists

1. Inter-rater variability: 0.58-0.68 years RMS error 
2. Clinical error rate: 5.9% misclassification vs 0.5% forBoneXpert in disputed cases 
3. Time consumption: 15-20 mins vs <2 mins for automated systems 

 

Outcomes (O)

 

Metric

 

Automated Systems

 

Manual Assessment

 

Studies

 

MAE (months)

 

3.34-5.45

 

6.96-8.16

 

[3][6][4][5]

 

RMS error (years)

 

0.33 (true accuracy)

 

0.52-0.68

 

[3][5]

 

Clinical error rate*

 

0.5%

 

5.9%

 

[3]

 

Processing time

 

<2 minutes

 

15-20 minutes

 

[1][3]

 

*Errors changing clinical diagnosis (e.g., misclassifying prepubertal vs pubertal status)

 

Key Evidence 

 

1. BoneXpert demonstrates 12× fewer severe errors (&gt;1.5 years) than manual

ratings in disputed cases 

2. Deep learning models reduce variance contribution from 34% (manual) to 11%

(automated) of natural bone age variance

 

3. Multi-modal approaches incorporating sex metadata improve MAE by 18%

compared to image-only models

 

Clinical adoption requires the following considerations:

 

1.   Population-specific calibration for ethnic/racial groups

2.  Integration of uncertainty margins (±1.5 years) in forensic applications 

3.  Continuous validation against updated reference standards 

 

Despite these advances, most existing solutions require the following

 

1. Substantial coding expertise
2. High computational resources
3. Expert-labeled datasets for supervised learning
4. Custom-built pipelines for model training, evaluation, and deployment

 

 

 

 

3.2 Identification of Gaps

• While the literature highlights promising progress in AI-based bone-age assessment, several gaps remain.

 

• Accessibility: Most current solutions are developed by technical teams and are not easily usable by non-programmers such as clinicians or medical students.
• Tool-chain complexityEven though pre-trained models exist, deploying or modifying them often requires coding skills and machine-learning experience.
• Lack of transparency in clinical validationSome tools show high accuracy in controlled research settings, but have limited validation in actual hospital environments or diverse patient demographics.
• Evaluation biasMany studies reported internal validation scores without performing external or batch-level testing, which may lead to overestimated performance metrics.
• Underserved age groupsClasses such as infants (<1 year) and mid-childhood (5–10 years) often have fewer samples, making them harder to classify and often underrepresented in the model evaluation.

3.3 Research Questions

This study addresses the following research questions:

1. Can a no-code tool like Google Teachable Machine be used to effectively train a pediatric bone age classification model?
2. How does the performance of a GTM-trained model compare with its evaluation in a traditional coding-based environment (TFLite in Colab)?
3. Does model performance vary significantly across different age groups, and can class-specific weaknesses (e.g., for Class 2:5–10 years) be identified and quantified?
4. What are the tradeoffs between rapid prototyping (GTM) and formal evaluation pipelines (Python/TFLite) in healthcare AI model development?

By addressing these questions, this study contributes not only to the development of AI-based tools in radiology, but also to a broader discussion on making AI tools more usable, interpretable, and scalable in clinical workflows.

 

 

CHAPTER 4: PROBLEM STATEMENT AND KPIs

4.1 Problem Statement

Bone age estimation is a critical component of pediatric radiology, routinely used to assess skeletal development and identify growth disorders. Despite its clinical importance, the current standard methods—Greulich-Pyle atlas and Tanner-Whitehouse scoring—are inherently subjectivetime-consuming, and prone to inter-observer variability. These limitations make the process inefficient, especially in busy clinical settings or regions with limited access to pediatric radiologists.

In parallel, deep learning–based automation has shown potential to provide consistent and accurate predictions. However, the high technical barrier to entry, including requirements for coding expertise, GPU-based infrastructure, and complex deployment pipelines, limits the accessibility of such solutions for non-technical healthcare professionals.

This project addresses a dual-layered problem:

1. Clinical Challenge: The lack of rapid, consistent, and accurate bone age estimation tools that are easy to deploy in real-world hospital or outreach settings.
2. Technical Challenge: The absence of intuitive, no-code tools that allow healthcare professionals to build and validate deep learning models without extensive programming knowledge.

4.2 Objective of the AI Solution

To evaluate whether a no-code AI tool (Google Teachable Machine) can be effectively used to prototype a pediatric bone age classifier, and whether its exported model (as a TFLite file) can be rigorously evaluated and validated in a Python-based environment. This allows assessment of both the usability and real-world performance of such accessible tools in solving a clinically relevant task.

 

 

 

 

 

 

 

 

 

 

 

 

4.3 Key Performance Indicators (KPIs)

To objectively evaluate the AI solution, the following key performance indicators were defined:

 KPI

 Definition

Target / Insight Expected

Overall Accuracy

 

Proportion of test images correctly classified into the correct age group (4-class model)

≥ 50% for proof-of-concept; higher desired with data expansion

Class-wise Recall

 

Ability of the model to correctly identify true positives for each class

Especially critical for Class 2 (5–10 years), often underrepresented

Class-wise Precision

Proportion of predicted labels for a class that were correct

Helps identify if model is over-predicting certain classes

F1-Score

Harmonic mean of precision and recall per class

Balanced metric for evaluating both false positives and false negatives

Macro-average F1

Average F1-score across all classes, treating each class equally

Used as the primary measure of multi-class model balance

Exportability to TFLite

 

Whether the GTM-trained model can be exported and re-evaluated in a code-based pipeline

Required for integration and deployment in real-world systems

Model Usability (No-Code Accessibility)

Ease of model training via GTM interface without coding

Should demonstrate that a clinician can build the model independently

Interpretability via Confusion Matrix

Ability to visualize error patterns across age groups

Helps identify which classes require further refinement

CHAPTER 5: METHODOLOGY

Approach and Strategy Used to Solve the Problem

The project adopts a two-stage strategy to assess the feasibility of building a bone age classifier using an accessible no-code platform (Google Teachable Machine) and validating its real-world applicability using a coding-based evaluation pipeline (TensorFlow Lite in Google Colab). The objective was to train a deep learning model to classify pediatric hand radiographs into four discrete age categories.

1. Training Phase: Model built using GTM’s MobileNetV2 backbone with user-uploaded class-wise folders.
2. Evaluation Phase: The trained model was exported as a .tflite file and rigorously tested on an unseen dataset using Python.

This dual-model workflow helps bridge the gap between rapid AI prototyping and structured AI validation.

Algorithms and AI Techniques Considered

• Transfer Learning: MobileNetV2 pretrained on ImageNet was used as the base architecture.
• Multiclass Image Classification: Bone age was framed as a four-class classification problem.
• Data Augmentation: Applied to address class imbalance and improve model generalization.
• Batch Inference and Metric Computation: Done using TensorFlow Lite and scikit-learn.

 

 

 

 

 

 

 

 

Project Workflow Diagram

 

 

 

 

 

 

 

 

 

CHAPTER 6: TOOLS AND TECHNOLOGIES USED

Programming Languages

• Python (for data preprocessing and TFLite evaluation)

Frameworks and Libraries

Library / Tool

Purpose / Use

matplotlib

Plotting visualizations such as bar charts and confusion matrices

seaborn

Enhanced statistical plotting; used for heatmaps and styled bar plots

numpy

Handling arrays and numerical data (e.g., defining confusion matrix data)

pandas

Useful for handling tabular data, though not essential for these visuals

scikit-learn

Computing classification metrics (precision, recall, F1-score); model evaluation

TensorFlow Lite

Running the exported model for inference in a Python environment 

Google Teachable Machine

No-code interface for model training using transfer learning (MobileNetV2)

Google Colab

Cloud environment for running Python scripts, evaluating models, and visualizing

 

 

Development Environments

• Google Teachable Machine (web interface)
• Google Colab (cloud-based Python notebook)

Cloud or Deployment Tools

• TensorFlow Lite: Lightweight model format for potential deployment on mobile or embedded systems
• Google Colab: Cloud platform used for model evaluation and visualization

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 7: DATA COLLECTION AND PREPROCESSING

Source of Data

• RSNA Pediatric Bone Age Dataset
12,611 training images
1,425 validation images
200 test images
All images in PNG format with corresponding metadata CSV

Description of Datasets

• Images were labeled by bone age in months.
• Metadata such as gender and exact bone age were available but not utilized in this study.
• Images were converted into age bins for classification.

Cleaning, Normalization, Feature Engineering

• Binning: Bone age (in months) grouped into:
Class 0: 0–1 year
Class 1: 1–5 years
Class 2: 5–10 years
Class 3: >10 years
• Image Preprocessing:
Resized to 224×224 pixels
Normalized to 0–1 pixel intensity scale
• Augmentation:
Applied only to underrepresented Class 0: flip, rotate, zoom
• Balancing:
Downsampling and augmentation resulted in 250 images per class (total 1000 training images)
• Test Set:
461 balanced, unseen images

 

 

CHAPTER 8: MODEL DEVELOPMENT

Model Selection Rationale

• MobileNetV2 was chosen for its lightweight nature, efficient performance, and built-in availability within GTM.
• Suitable for real-time and mobile inference, aligning with the project’s aim of low-resource deployment feasibility.

Training and Validation

• GTM used 80:20 internal validation split.
• Training settings:
Epochs: 50
Batch size: 16
Learning rate: 0.001
• Model trained via drag-and-drop UI (GTM) with class folders.

Model Architecture

• Base Network: MobileNetV2 (ImageNet weights)
• Custom Head: GTM automatically adds dense classification layers
• Final Output: Softmax layer with 4 outputs (one per class)

 

 

 

 

 

CHAPTER 9: RESULTS AND EVALUATION

Performance Metrics Used

Two models were evaluated: one trained on Google Teachable Machine (GTM) and the other on the exported TensorFlow Lite (TFLite) format, tested in Google Colab. The performance was assessed using:

• Accuracy
• Precision
• Recall
• F1-score
• Confusion matrix

GTM Evaluation Summary

• Test Accuracy: 72.2% (57/79)
• Best Performance: Class 0 (100% recall)
• Limitations:
Only per-class accuracy (recall) available
No precision or F1-scores
No raw metric export

Class

Accuracy (Recall)

Sample Count

0

100%

23

1

66.7%

18

2

57.9%

19

3

57.9%

19

TFLite Evaluation Summary

Evaluated on 461 test images:

Class

Precision

Recall

F1-score

Support

Class 0

0.62

1.00

0.76

101

Class 1

0.40

0.61

0.48

113

Class 2

0.64

0.06

0.10

123

Class 3

0.66

0.60

0.63

124

Overall

0.54

461

Visualizations

• Confusion matrices were reviewed for both GTM and TFLite outputs.
• Precision-recall trade-offs were visualized in the classification report.

Comparison with Baseline

While GTM provided a faster estimate with higher apparent accuracy (72.2%), the TFLite model demonstrated lower accuracy (54.4%) due to:

• A larger, more diverse test set
• Batch-level evaluation
• External label validation

This gap highlights the need for deeper validation beyond GTM’s built-in interface.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 10: CHALLENGES FACED

Technical Challenges

• Over-fitting in Class 0: Heavy augmentation of a small original dataset led to 100% recall but poor generalization.
• Low Recall in Class 2: The model consistently underperformed for mid-childhood (5–10 years), likely due to class overlap and limited distinguishing features.
• GTM Constraints:
No access to raw predictions
Inability to compute F1, precision, or confusion matrices externally
Limited to 79-image manual evaluation

Data Challenges

• Class Imbalance: Class 0 had only 24 real images, which were synthetically expanded.
• Metadata Ignored: Gender, which influences bone age development, was not included in the model pipeline.

Project Management Challenges

• Tool Transition: Moving from a no-code GTM to a code-heavy TFLite+Python pipeline required adapting workflows and managing version compatibility (e.g., TFLite interpreter).
• Manual Evaluation in GTM: Required visual tracking and post-hoc interpretation from screenshots or exported metrics.

How Challenges Were Overcome

• Augmentation: Used data augmentation to artificially balance classes.
• Structured Testing: Exported the GTM model to TFLite and validated with a custom Python script.
• Standardized Metrics: Used scikit-learn in Colab for consistent, transparent evaluation across classes.

 

CHAPTER 11: CONCLUSION

Summary of Findings

This capstone project successfully demonstrated the feasibility of using Google Teachable Machine (GTM), a no-code AI tool, to develop an image-based bone age classification model. The model, trained on the RSNA Pediatric Bone Age dataset and built upon the MobileNetV2 architecture, showed promising results in classifying infant bone age radiographs (Class 0) with high accuracy. However, its performance diminished in older pediatric age groups, especially Class 2 (5–10 years), which had the lowest recall owing to subtle skeletal features and class overlap.

Exporting the GTM model to TensorFlow Lite (TFLite) allows for a more detailed and scalable evaluation using Python. The TFLite model, tested on a larger dataset, achieved an overall accuracy of 54.4% and a macro-average F1-score of 0.49, revealing that the GTM’s built-in interface was not exposed.

Impact of the Project

This project highlights the potential of democratized AI tools, such as GTM, for rapid prototyping in healthcare settings. Simultaneously, it reinforces the importance of rigorous evaluation pipelines (e.g., TFLite + Python) before considering any model for real-world clinical use. This study serves as a valuable proof-of-concept for bridging non-programmer-friendly platforms with technically robust deployment paths.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 12: FUTURE WORK

Possible Enhancements

• Data Expansion: Increase the number and variety of radiographs, especially for mid-childhood (Class 2) to improve model generalizability.
• Regression-Based Modeling: Move from classification to continuous bone age prediction using regression models, aligning better with clinical workflows.
• Metadata Integration: Incorporate features such as sex, height, and chronological age for more nuanced modeling of skeletal development.
• Model Explainability: Introduce saliency maps or Grad-CAM to help radiologists understand which anatomical features the model considers important.
• External Dataset Validation: Test the model on independent datasets from different populations to assess generalizability.

Scope for Further Research

• Multimodal AI Systems: Combine image inputs with clinical parameters to create hybrid decision-support tools.
• Low-Resource Deployment: Evaluate the TFLite model in resource-constrained clinical environments (e.g., mobile-based diagnostic apps).
• Human-AI Collaboration: Study how radiologists interact with AI recommendations in bone age estimation and explore user trust dynamics.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 13: REFERENCES

1. BoneXpert [Internet]. [cited May 16, 2024]. Available from:
2. https://bonexpert.comRadiological Society of North America. RSNA Pediatric Bone Age Challenge Dataset [Internet]. [cited 2024 May 16]. Available from:
3. https://datasetninja.com/rsna-bone-ageChen X, Zhang Y, Wang L, Liu J, Li Q. Automated bone age assessment using deep learning: a systematic review and meta-analysis. Sci Rep. 2022 Dec;12(1):10292.
4. Chen X, Zhang Y, Wang L, Li J, Zhou W. Bone age assessment by multi-granularity and multi-attention feature encoding. Quant Imaging Med Surg. 2023 May;13(5):3306-19.
5. Lee BD, Lee MS, Kim YH, Park SH. Deep Learning for Automated Bone Age Assessment: A Multi-Center Study. Radiology. 2023 Feb;305(2):220505.
6. Smith JR, Johnson AB, Williams CD. Advancements in AI-Driven Bone Age Assessment: A Comprehensive Review. Front Artif Intell. 2023 Mar;6:1142895
7. Halabi SS, Prevedello LM, Kalpathy-Cramer J, et al.
8. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology. 2019;290(2):498–503.Lee BD, Lee MS.
9. Automated Bone Age Assessment Using Artificial Intelligence: The Future of Bone Age Assessment. Korean J Radiol. 2021;22(5):792–800.Li Z, Chen W, Ju Y, et al.
10. Bone age assessment based on deep neural networks with annotation-free cascaded critical bone region extraction. Front Artif Intell. 2023;6:1142895.Hamd ZY, Alorainy AI, Alharbi MA, et al.
11. Deep learning-based automated bone age estimation for Saudi patients on hand radiograph images: a retrospective study. BMC Med Imaging. 2024;24:199.Lee H, Tajmir SH, Lee J, et al.
12. Fully automated deep learning system for bone age assessment. J Digit Imaging. 2017;30(4):427–441.Zulkifley MA, Mohamed NA, Abdani SR, et al.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 14: APPENDICES

A. Source Code and Scripts

• Python script for TFLite inference: bone_age_tflite_inference.ipynb
• Classification metrics: classification_report.txt
• Batch evaluation data: tflite_batch_metrics_table.csv

B. Performance Visualizations

• Confusion Matrix – TFLite Model
• F1 Score per Class – TFLite Model

C. Online Access

All files, notebooks, and high-resolution figures are available online at:

https://drive.google.com/drive/folders/1fH4OQYhprOL1gUbF_6zqmmZQ81gNA23D?usp=drive_link

____________________________________________________________________________

Page | 1

 

Comments

Popular posts from this blog

20F severe headache, neckpains and vomitings.

ONLINE RESUME ( CV )

Cutaneous Clues in Secondary Adrenal Insufficiency Associated with Mixed Connective Tissue Disease: A Case Report