Every sound is a shape. Some shapes are on your lips, some are hidden on your tongue. Here are both, for a Japanese line.
Words from イノチミジカシコイセヨオトメ · CreepHyp. Front mouths are Google's pronunciation visemes. Side views are the CC0 midsagittal set by Wright and McCloy.
English and Japanese spell it the same. The tongue does not agree.
Words from the song
Speech is animated by visemes and articulations, the handful of distinct poses the mouth makes. Many sounds share one pose, so the set is small. This tool maps each Japanese mora to a quick consonant pose then a held vowel pose, and plays the sequence.
The front view is Google's own pronunciation mouth images, recolored. The side view is a public-domain midsagittal set, the vocal tract sliced down the middle so the tongue, palate and teeth show. Switch views and play the same word to see the lips and the tongue tell two halves of one story.
The midsagittal set is one drawing per international phonetic symbol, not per Japanese sound, so a few mappings are close cousins, not exact:
None of this is your mouth doing it wrong. It is one diagram set stretched over a different language.
Side-view set: Wright and McCloy, CC0 · the T contrast follows Dogen.
In the front view, each mora's underline marks its mouth-posture family.