ByteDance has been constructing an AI music beast… with a bit assist from The Beatles and Michael Jackson

TikTok’s $300 billion-valued guardian firm, ByteDance, is likely one of the world’s busiest AI builders. It plans to spend billions of {dollars} on AI chips this yr, whereas its tech provides Sam Altman’s OpenAI a run for its cash.

ByteDance’s Duobao AI chatbot is at the moment the most well-liked AI assistant in China, with 78.6 million month-to-month lively customers as of January.

This makes it the world’s second most-used AI app behind OpenAI’s ChatGPT (with 349.4 million MAUs). The not too long ago launched Doubao-1.5-pro is claimed to match the efficiency of OpenAI’s GPT-4o at a fraction of the associated fee.

As Counterpoint Analysis notes on this breakdown of Duobao’s positioning and performance, “very like its worldwide rival ChatGPT, the cornerstone of Doubao’s attraction is its multimodality, providing superior textual content, picture, and audio processing capabilities”.

It could additionally generate music.

In September, ByteDance added an AI music technology perform to the Duobao app, which apparently “helps greater than ten kinds of music types and permits you to write lyrics and compose music with one click on”.

This, although, isn’t the tip of ByteDance’s fascination with constructing music AI applied sciences.

On September 18, ByteDance’s Duobao Workforce introduced the large launch of a collection of AI music fashions dubbed Seed-Music.

Seed-Music, they claimed, would “empower folks to discover extra potentialities in music creation”.

Established in 2023, the ByteDance Doubao (Seed) Workforce is “devoted to constructing industry-leading AI basis fashions”.

In line with the official launch announcement for Seed-Music in September, the AI music product “helps score-to-song conversion, controllable technology, music and lyrics modifying, and low-threshold voice cloning”.

It additionally claims that “it cleverly combines the strengths of language fashions and diffusion fashions and integrates them into the music composition workflow, making it appropriate for various music creation eventualities for each inexperienced persons and professionals”.

The official Seed-Music web site comprises a variety of audio clips that show what it could possibly do.

You possibly can hear a few of that, beneath:

Extra essential, although, is how Seed-Music was constructed.

Fortunately, the Duobao Workforce has printed a tech report that explains the interior workings of their Seed-Music challenge.

MBW has learn it cowl to cowl.

Within the introduction to ByteDance’s analysis paper, which you’ll learn in full right here, the corporate’s researchers state that, “music is deeply embedded in human tradition” and that “all through human historical past, vocal music has accompanied key moments in life and society: from love calls to seasonal harvests”.

“Our aim is to leverage fashionable generative modeling applied sciences, to not substitute human creativity, however to decrease the obstacles to music creation.”

ByeDance analysis paper for Seed-Music

The intro continues: “Right this moment, vocal music stays central to international tradition. Nevertheless, creating vocal music is a fancy, multi-stage course of involving pre-production, writing, recording, modifying, mixing, and mastering, making it difficult for most individuals.”

“Our aim is to leverage fashionable generative modeling applied sciences, to not substitute human creativity, however to decrease the obstacles to music creation. By providing interactive creation and modifying instruments, we intention to empower each novices and professionals to have interaction at completely different levels of the music manufacturing course of.”

How Seed-Music works

ByteDance’s researchers clarify that the “unified framework” behind Seed-Music “is constructed upon three elementary representations: audio tokens, symbolic tokens, and vocoder latents”, which every correspond to “a technology pipeline.”

The audio token-based pipeline, as illustrated within the chart beneath, works like this: “(1) Enter embedders convert multi-modal controlling inputs, resembling music fashion description, lyrics, reference audio, or music scores, right into a prefix embedding sequence. (2) The auto-regressive LM generates a sequence of audio tokens. (3) The diffusion transformer mannequin generates steady vocoder latents. (4) The acoustic vocoder produces high-quality 44.1kHz stereo audio.”

In distinction to the audio token-based pipeline, the symbolic token-based Generator, which you’ll see within the chart beneath, is “designed to foretell symbolic tokens for higher interpretability”, which the researchers state is “essential for addressing musicians’ workflows in Seed-Music”.

In line with the analysis paper, “Symbolic representations, resembling MIDI, ABC notation and MusicXML, are discrete and could be simply tokenized right into a format appropriate with LMs”.

ByteDance’s researchers add within the paper: “In contrast to audio tokens, symbolic representations are interpretable, permitting creators to learn and modify them immediately. Nevertheless, their lack of acoustic particulars means the system has to rely closely on the Renderer’s means to generate nuanced acoustic traits for musical efficiency. Coaching such a Renderer requires large-scale datasets of paired audio and symbolic transcriptions, that are particularly scarce for vocal music.”

The plain query…

By now, you’re in all probability asking the place The Beatles and Michael Jackson’s music come into all of this.

We’re practically there. First, we have to speak about MIRs.

In line with the Seed-Music analysis paper, “to extract the symbolic options from audio for coaching the above system,” the staff behind the tech used varied “in-house Music Info Retrieval (MIR) fashions”.

In line with this very clear rationalization over at Dataloop, MIR “is a subcategory of AI fashions that focuses on extracting significant info from music knowledge, resembling audio indicators, lyrics, and metadata”.

Aka: It’s a metadata scraper. Stick a tune into the jaws of a MIR mannequin, and it’ll analyze, predict and current knowledge which may embody pitch, beats-per-minute (BPM), lyrics, chords, and extra.

Music Info Retrieval analysis first gained recognition over its means to assist with the digital classification of genres, moods, tempos, and so on. – key constructing blocks for advice techniques utilized by music streaming providers.

Now, although, main generative AI music platforms are reportedly utilizing MIR analysis to enhance their product output.

Are you able to see the place that is going? Sure, after all.

ByteDance’s analysis staff has efficiently constructed its personal in-house MIR fashions, which have been utilized by the ByteDance staff to “extract the symbolic options from audio” to construct components of its Seed-Music system. These MIR fashions embody:

AI, are you okay? Are you okay, AI?

Taking a deeper dive into the analysis printed by ByteDance for its Structural evaluation-focused MIR mannequin, we discover a analysis paper titled:

‘To catch a refrain, verse, intro, or anything: Analyzing a tune with structural capabilities’.

It was printed in 2022. You can learn it right here.

In line with the paper: “Typical music construction evaluation algorithms intention to divide a tune into segments and to group them with summary labels (e.g., ‘A’, ‘B’, and ‘C’).

“Nevertheless, explicitly figuring out the perform of every phase (e.g., ‘verse’ or ‘refrain’) isn’t tried, however has many purposes”.

On this analysis paper, they “introduce a multi-task deep studying framework to mannequin these structural semantic labels immediately from audio by estimating ‘verseness,’ ‘chorusness,’ and so forth, as a perform of time”.

To conduct this analysis, the ByteDance staff used 4 “public datasets”, together with one referred to as the ‘Isophonics’ dataset, which, it notes, “comprises 277 songs from The Beatles, Carole King, Michael Jackson, and Queen.”

The supply of the Isophonics dataset utilized by ByteDance’s researchers seems to be Isophonics.internet, described as the house for software program and knowledge assets from the Centre for Digital Music (C4DM) at Queen Mary, College of London.

The Isophonics web site notes that its “chord, onset, and segmentation annotations have been utilized by many researchers within the MIR group.”

The web site explains that “the annotations printed right here fall into 4 classes: chords, keys, structural segmentations, and beats/bars”.

In 2022, ByteDance’s researchers printed a video presentation of their, To catch a refrain, verse, intro, or anything: Analyzing a tune with structural capabilities paper for the Worldwide Convention on Acoustics, Speech, and Sign Processing (ICASSP).

You possibly can see this presentation beneath.

The video’s caption describes a “novel system/methodology that segments a tune into sections resembling refrain, verse, intro, outro, bridge, and so on”.

It demonstrates its findings associated to songs by the Beatles, Michael Jackson, Avril Lavigne and different artists:

We have to be cautious right here over any suggestion that ByteDance’s AI music-generating know-how could have been “skilled” utilizing songs by fashionable artists just like the Beatles or Michael Jackson.

But, as you may see, a dataset containing annotations of such songs has clearly been used as part of a ByteDance analysis challenge on this discipline.

Any evaluation or reference to fashionable songs and their annotations in analysis carried out or funded by a multi-billion-dollar know-how firm will certainly increase a variety of questions for the music {industry} – particularly these employed to guard its copyrights.

“We firmly consider that AI applied sciences ought to help, not disrupt, the livelihoods of musicians and artists. AI ought to function a software for creative expression, as true artwork all the time stems from human intention.”

ByteDance’s Seed-Music researchers

There’s a part devoted to Ethics and Security on the backside of ByteDance’s Seed-Music analysis paper.

In line with ByteDance’s researchers, they “firmly consider that AI applied sciences ought to help, not disrupt, the livelihoods of musicians and artists“.

They add: “AI ought to function a software for creative expression, as true artwork all the time stems from human intention. Our aim is to current this know-how as a possibility to advance the music {industry} by decreasing obstacles to entry, providing smarter, sooner modifying instruments, producing new and thrilling sounds, and opening up new potentialities for creative exploration.”

The ByteDance researchers additionally define moral points particularly: “We acknowledge that AI instruments are inherently liable to bias, and our aim is to offer a software that stays impartial and advantages everybody. To attain this, we intention to supply a variety of management components that assist decrease preexisting biases.

“By returning creative decisions to customers, we consider we will promote equality, protect creativity, and improve the worth of their work. With these priorities in thoughts, we hope our breakthroughs in lead sheet tokens spotlight our dedication to empowering musicians and fostering human creativity by AI.”

When it comes to Security / ‘deepfake’ considerations, the researchers clarify that, “within the case of vocal music, we acknowledge how the singing voice evokes one of many strongest expressions of particular person identification”.

They add: “To safeguard towards the misuse of this know-how in impersonating others, we undertake a course of just like the security measures specified by Seed-TTS. This entails a multistep verification methodology for spoken content material and voice to make sure the enrollment of audio tokens comprises solely the voice of approved customers.

“We additionally implement a multi-level water-marking scheme and duplication checks throughout the generative course of. Trendy techniques for music technology could basically reshape tradition and the connection between creative creation and consumption.

“We’re assured that, with sturdy consensus between stakeholders, these applied sciences will and revolutionize music creation workflow and profit music novices, professionals, and listeners alike.”Music Enterprise Worldwide

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Latest Posts

ByteDance has been constructing an AI music beast… with a bit assist from The Beatles and Michael Jackson

How Seed-Music works

The plain query…

AI, are you okay? Are you okay, AI?

RELATED ARTICLES

Latest Posts

Don't Miss

Stay in touch

Work From Home

popular

finance

career

Stay in touch

Contact us